How to calculate Retention?

Время чтения текста – 6 минут

In this post we will discover, how to properly construct a report on Retention with application of Redash and SQL language.
For starters, let’s explain in a nutshell what the metric Retention rate is, why it is important,

Retention rate

Retention rate metric is widespread and is particularly popular within the mobile industry, since it allows to understand how well a product engages the users into daily use. Let’s recall (or discover), how Retention is calculated:

Retention of day X – is N% of users that will return to the product on day X. In other words, if on some specific day (day 0) 100 new users came, and 15 returned on the first day, then Retention of the 1st day will be equal to 15/100=15%.
Most commonly, Retention of days 1, 3, 7 and 30 are singled out as the most descriptive metrics of a product, however it’s useful to address Retention curve as a whole and make conclusions, proceeding from it.

Retention curve

In the end, we are interested in construction of such curve, that shows the retention of users from day 0 to day 30.

Retention rate curve from day 0 do day 30

Rolling Retention (RR)

Besides classic Retention rate, Rolling Retention (hereinafter, RR) is allocated. At calculation of RR, apart from day X, all the subsequent days are also considered. Thus, RR of the 1st day – the amount of users who returned on the 1st and subsequent days.

Let’s compare Retention and Rolling Retention of the 10th day:
Retention10 — the amount of users, who returned on the 10th day / the amount of users, who installed the app 10 days ago * 100%.
Rolling Retention10 — the amount of users, who returned on the 10th day or later / the amount of users, who installed the app 10 days ago * 100%.

Granularity (retention of time periods)

In some industries and respective tasks, it is useful to understand the Retention of a specific day (most often, in the mobile industry), in other cases it is useful to understand the retention of users on various time intervals: for example, weekly or monthly periods (oftentimes, it’s handy in e-commerce, retail).

An example of cohorts by months and monthly Retention respective thereto

How to build a Retention report on SQL language?

We have sorted out above how to calculate Retention in formulas. Now let’s apply it with SQL language.
Let’s assume, that we have two tables: user — storing data about users’ identifiers and meta-information, client_session — information on visits of the mobile app by users.
Only these two tables will be present in the query, so you can easily adapt the query to yourself.
note: within this code, I am using Impala as DBMS.

Collecting the size of cohorts

SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          ndv(user.id) AS users
   FROM USER
   WHERE from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1

Let’s sort out this pretty simple query: for every day we calculate the number of unique users for the period [60 days ago; 31 days ago].
In order not to mess with documentation: command ndv() in Impala is analogue of a command count(distinct).

Calculating the number of returned users on each cohort

SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff,
          ndv(user.id) AS ret_base
   FROM USER
   LEFT JOIN client_session cs ON cs.user_id=user.id
   WHERE 1=1
     AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30
     AND from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1, 2

In this query, the key part is contained in the command datediff: now we are calculating for each cohort and for each datediff the number of unique users with the very same command ndv() (practically, the number of users, who returned within the days from 0 to 30).

Great, now we have the size of cohorts and the number of returned users.

Combining all together

SELECT reg.reg_date AS date_registration,
       reg.users AS cohort_size,
       cohort.date_diff AS day_difference,
       cohort.ret_base AS retention_base,
       cohort.ret_base/reg.users AS retention_rate
FROM
  (SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          ndv(user.id) AS users
   FROM USER
   WHERE from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1) reg
LEFT JOIN
  (SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff,
          ndv(user.id) AS ret_base
   FROM USER
   LEFT JOIN client_session cs ON cs.user_id=user.id
   WHERE 1=1
     AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30
     AND from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1, 2) cohort ON reg.reg_date=cohort.reg_date
    ORDER BY 1,3

We have received the query, that calculates Retention for each cohort, and, eventually, the result can be displayed as follows:

Retention rate, calculated for each cohort of users

Construction of the sole Retention curve

Let’s modify our query a bit and obtain the data for construction of one Retention curve:

SELECT 
       cohort.date_diff AS day_difference,
       avg(reg.users) AS cohort_size,
       avg(cohort.ret_base) AS retention_base,
       avg(cohort.ret_base)/avg(reg.users)*100 AS retention_rate
FROM
  (SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          ndv(user.id) AS users
   FROM USER
   WHERE from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1) reg
LEFT JOIN
  (SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff,
          ndv(user.id) AS ret_base
   FROM USER
   LEFT JOIN client_session cs ON cs.user_id=user.id
   WHERE 1=1
     AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30
     AND from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1,2) cohort ON reg.reg_date=cohort.reg_date
    GROUP BY 1        
    ORDER BY 1

Now, we have average by all the cohorts Retention rate, calculated for each day.

More on the subject

Share
Send
Pin
 2848   2019   analysis   BI-tools   redash   sql   visualisation
Popular