LEFT JOIN: blog on analytics, visualisation & data science, posts tagged: sql

Clickhouse Training 101 by Altinity

Mon, 09 Aug 2021 08:41:15 +0300

Just recently I have completed a Clickhouse Training by Altinity (101 Series Training). For those who are just getting to know Clickhouse, Altinity offers free basic training: Data Warehouse Basics. I recommend starting with it if you are planning to dive into learning.

Certification by Altinity

I would like to share my experience on the training as well as the training notes.
The training costs $500 and lasts 4 days for 2 hours. It is carried out in the evenings Moscow time (starting from 19:00 GMT +3).

Session # 1.

The first day mostly revises everything covered in Data Warehouse Basics, but it has several new ideas on how to get useful information on queries from system tables.

For example, this query will show which commands are running and their status.

SELECT command, is_done
FROM system.mutations
WHERE table = 'ontime'

Besides, for me it was useful to learn about column compression with the use of codecs:

ALTER TABLE ontime
 MODIFY COLUMN TailNum LowCardinality(String) CODEC(ZSTD(1))

For those who are just starting with Clickhouse, the first day will be super useful as it will help in understanding table engines and syntax for their creation, partitions, inserting data (for example directly from S3).

INSERT INTO sdata
SELECT * FROM s3(
 'https://s3.us-east-1.amazonaws.com/d1-altinity/data/sdata*.csv.gz',
 'aws_access_key_id',
 'aws_secret_access_key',
 'Parquet',
 'DevId Int32, Type String, MDate Date, MDatetime
DateTime, Value Float64')

Session # 2.

I found the second day the most intense and useful as within this session Robert from Altinity talks about aggregate functions and materialized views ( detailed scheme for the creation of materialized views ) in Clickhouse in more detail.

It was super useful for me to learn about index types in Clickhouse.

Session # 3.

During the third day, colleagues share their knowledge on how to work with Kafka and JSON objects stored in the tables.
It was interesting to find out that working with arrays in Clickhouse is very similar to arrays in Python:

WITH [1, 2, 4] AS array
SELECT
 array[1] AS First,
 array[2] AS Second,
 array[3] AS Third,
 array[-1] AS Last,
 length(array) AS Length

When working with arrays, there is a great feature called ARRAY JOIN which “unrolls” arrays to rows.

Clickhouse allows you to efficiently interact with JSON objects stored in a table:

-- Get a JSON string value
SELECT JSONExtractString(row, 'request') AS request
FROM log_row LIMIT 3
-- Get a JSON numeric value
SELECT JSONExtractInt(row, 'status') AS status
FROM log_row LIMIT 3

This piece of code is an example of how to extract the elements of the JSON array “request” and “status” separately.

ALTER TABLE log_row
 ADD COLUMN
status Int16 DEFAULT
 JSONExtractInt(row, 'status')
ALTER TABLE log_row
UPDATE status = status WHERE 1 = 1

Session # 4.

The most difficult topic from my point of view was saved for the last day – building sharding and replication patterns and building queries on distributed Clickhouse servers.

Special respect to Altinity for an excellent collection of labs during the training.

Links:

Data normalization with SQL

Fri, 21 May 2021 16:58:48 +0300

According to GIGO (garbage in, garbage out) principle, errors in input data lead to erroneous analysis results. The results of our work directly depend on the quality of data preparation.

For instance, when we need to prepare data to use in ML algorithms (like k-NN, k-means, logistic regression, etc.), features in the original dataset may vary in scale like the age and height of a person. This may lead to the incorrect performance of the algorithm. Thus, such data needs to be rescaled first.

In this tutorial, we will consider the ways to scale the data using SQL query: min-max normalization, min-max normalization for an arbitrary range, and z-score normalization. For each of these methods we have prepared two SQL query options: one using a SELECT subquery and another using a window function OVER().

We will work with the simple table students that contains the data on the height of the students:

name	height
Ivan	174
Peter	181
Dan	199
Kate	158
Mike	179
Silvia	165
Giulia	152
Robert	188
Steven	177
Sophia	165

Min-max rescaling

Min-max scaling approach scales the data using the fixed range from 0 to 1. In this case, all the data is on the same scale which will exclude the impact of outliers on the conclusions.

The formula for the min-max scaling is given as:

We multiply the numerator by 1.0 in order to get a floating point number at the end.

SQL-query with a subquery:

SELECT height, 
       1.0 * (height-t1.min_height)/(t1.max_height - t1.min_height) AS scaled_minmax
  FROM students, 
      (SELECT min(height) as min_height, 
              max(height) as max_height 
         FROM students
      ) as t1;

SQL-query with a window function:

SELECT height, 
       (height - MIN(height) OVER ()) * 1.0 / (MAX(height) OVER () - MIN(height) OVER ()) AS scaled_minmax
  FROM students;

As a result, we get the values in [0, 1] range where 0 is the height of the shortest student and 1 is the height of the tallest one.

name	height	scaled_minmax
Ivan	174	0.46809
Peter	181	0.61702
Dan	199	1
Kate	158	0.12766
Mike	179	0.57447
Silvia	165	0.2766
Giulia	152	0
Robert	188	0.76596
Steven	177	0.53191
Sophia	165	0.2766

Rescaling within a given range

This is an option of min-max normalization between an arbitrary set of values. When it comes to data scaling, the values do not always need to be in the range between 0 and 1. In these cases, the following formula is applied.

This allows us to scale the data to an arbitrary scale. In our example, let’s assume a=10.0 and b=20.0.

SQL-query with subquery:

SELECT height, 
       ((height - min_height) * (20.0 - 10.0) / (max_height - min_height)) + 10 AS scaled_ab
  FROM students,
      (SELECT MAX(height) as max_height, 
              MIN(height) as min_height
         FROM students  
      ) t1;

SQL-query with a window function:

SELECT height, 
       ((height - MIN(height) OVER() ) * (20.0 - 10.0) / (MAX(height) OVER() - MIN(height) OVER())) + 10.0 AS scaled_ab
  FROM students;

We receive similar results as before, but with data spread between 10 and 20.

name	height	scaled_ab
Ivan	174	14.68085
Peter	181	16.17021
Dan	199	20
Kate	158	11.2766
Mike	179	15.74468
Silvia	165	12.76596
Giulia	152	10
Robert	188	17.65957
Steven	177	15.31915
Sophia	165	12.76596

Z-score normalization

Using Z-score normalization, the data will be scaled so that it has the properties of a standard normal distribution where the mean (μ) is equal to 0 and the standard deviation (σ) to 1.

Z-score is calculated using the formula:

SQL-query with a subquery:

SELECT height, 
       (height - t1.mean) * 1.0 / t1.sigma AS zscore
  FROM students,
      (SELECT AVG(height) AS mean, 
              STDDEV(height) AS sigma
         FROM students
        ) t1;

SQL-query with a window function:

SELECT height, 
       (height - AVG(height) OVER()) * 1.0 / STDDEV(height) OVER() AS z-score
  FROM students;

As a result, we can easily notice the outliers that exceed the standard deviation.

name	height	zscore
Ivan	174	0.01488
Peter	181	0.53582
Dan	199	1.87538
Kate	158	-1.17583
Mike	179	0.38698
Silvia	165	-0.65489
Giulia	152	-1.62235
Robert	188	1.05676
Steven	177	0.23814
Sophia	165	-0.65489

SQL Window Functions Cheat Sheet with examples

Tue, 02 Jun 2020 12:17:13 +0300

Window functions are calculation functions that can increase the efficiency and reduce the complexity of SQL queries, making things much easier:
View as

Special Thanks to LearnSQL, best place to master your SQL skills !

Cohort analysis in Redash

Tue, 03 Mar 2020 14:17:54 +0300

In one of the previous articles we have reviewed building of Retention-report and have partially addressed the concept of cohorts therein.
Cohort usually implies group of users of a product or a company. Most often, groups are allocated on the basis of time of app installation / appearance of a user in a system.
It turns out, that, using cohort analysis, one can track down how the changes in a product affected the behaviour of users (for example, of old and new users).

Along with that, cohorts can be defined also proceeding from other parameters: geography of a user, traffic source, device platform and other important parameters of your product.

We will figure out, how to compare Retention of users of weekly cohorts in Redash, since Redash has special type of visualization for building such type of report.
Firstly, let’s sort out SQL-query. We still have two tables – user (id of a user and time of app installation) and client_session – timestamps (created_at) of activity of each user (user_id). Let’s consider the Retention of the first seven days for last 60 days.
The query is written in Cloudera Impala, let’s review it.

For starters, let’s build the total size of cohorts:

select trunc(from_unixtime(user.installed_at), "WW") AS cohort_week, 
	ndv(distinct user.id) as cohort_size //counting the number of users in the cohort
	from user 
	where from_unixtime(user.installed_at) between date_add(now(), -60) and now() //taking registered users for last 60 days
group by trunc(from_unixtime(user.installed_at), "WW")

The second part of the query can calculate the quantity of active users for every day during the first thirty days:

select trunc(from_unixtime(user.installed_at), "WW") AS cohort_week, 
        datediff(cast(cs.created_at as timestamp),cast(user.installed_at as timestamp)) as days,
	ndv(distinct user.id) as value  //counting the number of active users for every day
		from user 
		left join client_session cs on user.id=cs.user_id
where from_unixtime(user.installed_at) between date_add(now(), -60) and now()
and from_unixtime(cs.created_at) >= date_add(now(), -60) //taking sessions for last 60 days
and datediff(cast(cs.created_at as timestamp),cast(user.installed_at as timestamp)) between 0 and 30 //cutting off only the first 30 days of activity
group by 1,2

Bottom line, all the query entirely:

select size.cohort_week, size.cohort_size, ret.days, ret.value from
(select trunc(from_unixtime(user.installed_at), "WW") AS cohort_week, 
		ndv(distinct user.id) as cohort_size 
	from user 
	where from_unixtime(user.installed_at) between date_add(now(), -60) and now()
group by trunc(from_unixtime(user.installed_at), "WW")) size
left join (select trunc(from_unixtime(user.installed_at), "WW") AS cohort_week, 
        datediff(cast(cs.created_at as timestamp),cast(user.installed_at as timestamp)) as days,
		ndv(distinct user.id) as value 
		from user 
		left join client_session cs on user.id=cs.user_id
where from_unixtime(user.installed_at) between date_add(now(), -60) and now()
and from_unixtime(cs.created_at) >= date_add(now(), -60)
and datediff(cast(cs.created_at as timestamp),cast(user.installed_at as timestamp)) between 0 and 30
group by 1,2) ret on size.cohort_week=ret.cohort_week

Great, now correctly calculated data is available to us.

Data of cohorts in tabular form

Let’s create new visualization in Redash and indicate the parameters correctly:

It’s important to indicate the parameters correctly – the columns of the resulting query are compliant therewith.

Let’s make sure to indicate that we have weekly cohorts:

Voila, our visualization of cohorts is ready:

You can add filters and parameters to it and use in a dashboard

Materials on the topic

Looker Overview

Wed, 08 Jan 2020 11:22:02 +0300

Today we are going to talk about BI-platform Looker, on which I managed to work in 2019.

Here is the short content of the article for convenient and fast navigation:

What is Looker?

Creators of Looker position it as a software of business intelligence class and big data analytics platform, that helps to research, analyze and share business analytics in real time mode.
Looker — is a really convenient tool and one of a few BI products, that allows to work with pre-set data cubes in a real-time mode (actually, relational tables that are described in Look ML-model).
An engineer, working with Looker, needs to describe a data model on Look ML language (it’s something between CSS and SQL), publish this data model and then set reporting and dashboards.
Look ML itself is pretty simple, the nexus between the data objects are set by a data-engineer, which consequently allows to use the data without knowledge of SQL language (to be precise: Looker engine generates the code in SQL language itself on user’s behalf).

Just recently, in June 2019, Google announced acquisition of Looker platform for $2.6 billion.

Which DBMS you can connect to via Looker and how?

The selection of DBMS that Looker is working with is pretty wide. You can see the various connections on the screen shot below as of October, 2019:

Available DBMS for connection

You can easily set a connection to the database via web-interface:

Web-interface of connection to DBMS

With regard to connections to databases, I’d like to highlight the following two facts: first of all, unfortunately, Clickhouse support from Yandex is currently missing (as well as in the foreseeable future). Most likely, the support won’t appear, considering the fact that Looker was acquired by a competitor, Google.
updated: Actually, Looker supports Clickhouse from the December 2019
The second nuisance is that you can’t build one data model, that would apply to different DBMS. There is no inbuilt storage in Looker, that could combine the results of query (unlike the same Redash).
It means, that analytical architecture should be built within one DBMS (preferably with high action speed or on aggregated data).

Building of Looker ML data model

In order to build a report or a dashboard in Looker, you need to provisionally set a data model. Syntax of Look ML language is quite thoroughly described in the documentation. Personally, I can just add that model description doesn’t require long-time immersion for a specialist with SQL knowledge. Rather, one needs to rearrange the approach to data model preparation. Look ML language is very much alike CSS:

Console of Look ML model creation

In the data model the following is set up: links with tables, keys, granularity, information of some fields being facts, and other – measurements. For facts, the aggregation is written. Obviously, at model creation one can use various IF / CASE expressions.

Explore mode

Probably, it’s the main killer-feature of Looker, since it allows any business departments to get data without attraction of analysts / data engineers. And, guess that’s why use of accounts with Explore mode is billed separately.

In fact, Explore mode is an interface, that allows to use the set up Look ML data model, select the required metrics and measurements and build customized report / visualization.
For example, we want to understand how many actions of any kind were performed in Looker’s interface last week. In order to do it, using Explore mode, we select Date field and apply a filter to it: last week (in this sense, Looker is quite smart and it and it will be enough writing ‘Last week’ in the filter), thereafter we choose “Category” from the measurements, and Quantity as a metric. After pressing the button Run the ready report will be generated.

Building report in Looker

Then, using the data received in the table form, you can set up the visualization of any type.
For example, Pie chart:

Applying visualization to report

Building of reports and their saving in Look

Sometimes you can have a desire to save the set of data / visualization received in Explore and share it with colleagues, for this purpose Looker has a separate essense – Look. That is ready constructed report with selected filters / measurements / facts.

Example of the saved Look

Examples of dashboards in Looker

Systemizing the warehouse of Look created, oftentimes you want to receive a ready composition / overview of key metrics, that could be displayed on one list.
For these purposes dashboard creation fits perfectly. Dashboard is created either on the wing, or using previously created Look. One of the dashboard’s “tricks” is configuration of parameters, that are changed on all the dashboard and can be applied to all the Look at the same time.

Interesting features in one line

In Looker you can refer to other reports and, using such function, you can create a dynamic parameter, that is passed on by a link.
For example, you’ve created a report with division of revenue by countries, and in this report you can refer to the dashboard on a separate country. Following the link, a user sees the dashboard on a specific country, that he clicked on.
On every Looker page there is a chat, where support service answers very promptly
Looker is not able to work with data merge on the level of various DBMS, however it can combine the data on the level of ready Look (in our case, this function works really weird).
Within the framework of work with various models, I have found out an extremely non-trivial use of SQL for calculation of unique values in a non-normalized data table, Looker calls it symmetric aggregates.
SQL, indeed, looks very non-trivial:

SELECT 
 order_items.order_id AS "order_items.order_id",
 order_items.sale_price AS "order_items.sale_price",
 (COALESCE(CAST( ( SUM(DISTINCT (CAST(FLOOR(COALESCE(users.age ,0)
 *(1000000*1.0)) AS DECIMAL(38,0))) + 
 CAST(STRTOL(LEFT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))
 * 1.0e8 + CAST(STRTOL(RIGHT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0)) ) 
 - SUM(DISTINCT CAST(STRTOL(LEFT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))
 * 1.0e8 + CAST(STRTOL(RIGHT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))) ) 
 AS DOUBLE PRECISION) 
 / CAST((1000000*1.0) AS DOUBLE PRECISION), 0) 
 / NULLIF(COUNT(DISTINCT CASE WHEN users.age IS NOT NULL THEN users.id 
 ELSE NULL END), 0)) AS "users.average_age"
FROM order_items AS order_items
LEFT JOIN users AS users ON order_items.user_id = users.id

GROUP BY 1,2
ORDER BY 3 DESC
LIMIT 500

At implementation of Looker to a purchase, JumpStart Kit is mandatory, which costs not less than $6k. Within this kit you receive support and consultation from Looker at tool implementation.

Diagram of BCG (Boston Consulting Group) Matrix

Tue, 19 Nov 2019 10:38:11 +0300

I will water down the blog with an interesting report, that was developed for Yota company on November, 2011. BCG Matrix has inspired us to develop this report.

We had: one Excel package, 75 VBA macro, ODBC connection to Oracle, SQL queries to databases of all sorts and colours. We will review report construction within this stack, but first, let’s speak about the very idea of the report.

BCG Matrix – is 2x2 matrix, whereon the clients’ segments are displayed by circumferences with their centres in the intersection of coordinates, formed by the relevant paces of two indicators selected.

To make it simple, we had to divide all the clients of the company into 4 segments: ARPU above average/below average, traffic consumption (main service) above average/below average. Thus, it turned out that 4 quadrants appear, and you need to place a bubble chart into each one of them, whereas the size of a bubble means the total amount of users within a segment. In addition to that, one more bubble was added to each quadrant (smaller one), that showcased the churn in each segment (author’s improvement).

What did we want to get at the output?
A chart of the following type:

Representation of the BCG matrix on the data of Yota company

The task statement is more or less clear, let’s move to the realization.
Let’s assume, that we’ve already collected all the required data (meaning that, we’ve learned to identify the average ARPU and average traffic consumption, in this post we won’t examine SQL-query), then the paramount task lies in understanding how to display the bubbles in the required places by means of Excel tools.

For this aim, a bubble chart comes to help:

Insert – Chart – Bubble

Going to the menu Selection of data source and evaluating, what is required in order to build a chart in the type that we need: coordinates X, coordinates Y, values of bubbles’ sizes.

Great, so it turns out that if we assume that our chart will be located in coordinates on the X axis from -1 to 1, and on the Y axis from -1 to 1, then the centre of the right upper bubble will be the spot (0.5; 0.5) on the chart. Likewise, we’ll place all the other bubbles.

We should separately consider the bubbles of Churn type (for displaying of the churn), they are located more to the right then the main bubble and might intersect with it, therefore we will place the right upper bubble to empirically obtained coordinates (0.65; 0.35).

Thus, for four main and four additional bubbles, we can organize the data as follows:

Let’s review more thoroughly how we’ll use them:

So, we set on X-axis – horizontal coordinates of the centres of our bubbles, that lie in the cells A9:A12, on Y-axis – vertical coordinates of the centres of our bubbles, that lie in the cells B9:B12, and the sizes of the bubbles are stored in the cells E9:E12.
Furthermore, we add another data set for the Churn, once more indicating all the required parameters.

We’ll get the following chart:

Then, we’re making it pretty: changing colours, deleting axis and getting a beautiful result.

By adding the required data labels, we receive what we initially needed in the task.

Share your experience in comments – did you build such charts and how you solved the task?

How to calculate Retention?

Sun, 03 Nov 2019 16:27:55 +0300

In this post we will discover, how to properly construct a report on Retention with application of Redash and SQL language.
For starters, let’s explain in a nutshell what the metric Retention rate is, why it is important,

Retention rate

Retention rate metric is widespread and is particularly popular within the mobile industry, since it allows to understand how well a product engages the users into daily use. Let’s recall (or discover), how Retention is calculated:

Retention of day X – is N% of users that will return to the product on day X. In other words, if on some specific day (day 0) 100 new users came, and 15 returned on the first day, then Retention of the 1st day will be equal to 15/100=15%.
Most commonly, Retention of days 1, 3, 7 and 30 are singled out as the most descriptive metrics of a product, however it’s useful to address Retention curve as a whole and make conclusions, proceeding from it.

Retention curve

In the end, we are interested in construction of such curve, that shows the retention of users from day 0 to day 30.

Retention rate curve from day 0 do day 30

Rolling Retention (RR)

Besides classic Retention rate, Rolling Retention (hereinafter, RR) is allocated. At calculation of RR, apart from day X, all the subsequent days are also considered. Thus, RR of the 1st day – the amount of users who returned on the 1st and subsequent days.

Let’s compare Retention and Rolling Retention of the 10th day:
Retention₁₀ — the amount of users, who returned on the 10th day / the amount of users, who installed the app 10 days ago * 100%.
Rolling Retention₁₀ — the amount of users, who returned on the 10th day or later / the amount of users, who installed the app 10 days ago * 100%.

Granularity (retention of time periods)

In some industries and respective tasks, it is useful to understand the Retention of a specific day (most often, in the mobile industry), in other cases it is useful to understand the retention of users on various time intervals: for example, weekly or monthly periods (oftentimes, it’s handy in e-commerce, retail).

An example of cohorts by months and monthly Retention respective thereto

How to build a Retention report on SQL language?

We have sorted out above how to calculate Retention in formulas. Now let’s apply it with SQL language.
Let’s assume, that we have two tables: user — storing data about users’ identifiers and meta-information, client_session — information on visits of the mobile app by users.
Only these two tables will be present in the query, so you can easily adapt the query to yourself.
note: within this code, I am using Impala as DBMS.

Collecting the size of cohorts

SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          ndv(user.id) AS users
   FROM USER
   WHERE from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1

Let’s sort out this pretty simple query: for every day we calculate the number of unique users for the period [60 days ago; 31 days ago].
In order not to mess with documentation: command ndv() in Impala is analogue of a command count(distinct).

Calculating the number of returned users on each cohort

SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff,
          ndv(user.id) AS ret_base
   FROM USER
   LEFT JOIN client_session cs ON cs.user_id=user.id
   WHERE 1=1
     AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30
     AND from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1, 2

In this query, the key part is contained in the command datediff: now we are calculating for each cohort and for each datediff the number of unique users with the very same command ndv() (practically, the number of users, who returned within the days from 0 to 30).

Great, now we have the size of cohorts and the number of returned users.

Combining all together

SELECT reg.reg_date AS date_registration,
       reg.users AS cohort_size,
       cohort.date_diff AS day_difference,
       cohort.ret_base AS retention_base,
       cohort.ret_base/reg.users AS retention_rate
FROM
  (SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          ndv(user.id) AS users
   FROM USER
   WHERE from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1) reg
LEFT JOIN
  (SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff,
          ndv(user.id) AS ret_base
   FROM USER
   LEFT JOIN client_session cs ON cs.user_id=user.id
   WHERE 1=1
     AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30
     AND from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1, 2) cohort ON reg.reg_date=cohort.reg_date
    ORDER BY 1,3

We have received the query, that calculates Retention for each cohort, and, eventually, the result can be displayed as follows:

Retention rate, calculated for each cohort of users

Construction of the sole Retention curve

Let’s modify our query a bit and obtain the data for construction of one Retention curve:

SELECT 
       cohort.date_diff AS day_difference,
       avg(reg.users) AS cohort_size,
       avg(cohort.ret_base) AS retention_base,
       avg(cohort.ret_base)/avg(reg.users)*100 AS retention_rate
FROM
  (SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          ndv(user.id) AS users
   FROM USER
   WHERE from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1) reg
LEFT JOIN
  (SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff,
          ndv(user.id) AS ret_base
   FROM USER
   LEFT JOIN client_session cs ON cs.user_id=user.id
   WHERE 1=1
     AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30
     AND from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1,2) cohort ON reg.reg_date=cohort.reg_date
    GROUP BY 1        
    ORDER BY 1

Now, we have average by all the cohorts Retention rate, calculated for each day.

Redash – full-fledged on-demand analytics

Wed, 14 Nov 2018 23:41:09 +0300

Today we will examine and try to get to the bottom of one tool that is quite famous nowadays – Redash. The tool is extremely convenient due to the fact that it can work with Clickhouse. All the other developers of BI-tools don’t support Clickhouse. Just recently, an ODBC driver has appeared for Tableau.

I will make a short review of Redash’s useful features, and in the following number of posts I will figure out which useful requests and reports can be constructed using Redash.

Redash Homepage

So, what is Redash? It is the tool for on-demand analytics, moreover it can be used on completely different databases. For instance, Redash can be connected to a database under MySQL or HP Vertica.
The main thing, proposed by Redash is a handy console for writing SQL-queries to a database.

Console for writing SQL-queries

Consequently, the first major observation: in order to use Redash you should know SQL or have an employee in team who knows SQL.
Apart from the basic console, Redash provides visualization tools (inter alia, construction of Funnel-charts, cohort analysis and pivot tables (however, the functionality of the latter is very limited)), as well as tools for construction of dashboards and alert systems (via mail or in slack).
Let’s examine the basic console a bit more thoroughly. A user has an access to some useful features: application of filters, multifilters and query’s parameters:

Simultaneous application of parameters, filters and multifilters in queries

Further, opportunities of visualizations should be mentioned separately. They are, indeed, convenient and extensive, especially considering that one can set automatic updating of query results in accordance with the required timetable.
In the following note we will discuss visualization in Redash more closely, and later – construction of dashboards and alert systems.

LEFT JOIN: blog on analytics, visualisation & data science, posts tagged: sql

Clickhouse Training 101 by Altinity

Session # 1.

Session # 2.

Session # 3.

Session # 4.

Data normalization with SQL

Min-max rescaling

Rescaling within a given range

Z-score normalization

SQL Window Functions Cheat Sheet with examples

Cohort analysis in Redash

Looker Overview

What is Looker?

Which DBMS you can connect to via Looker and how?

Building of Looker ML data model

Explore mode

Building of reports and their saving in Look

Examples of dashboards in Looker

Interesting features in one line

Diagram of BCG (Boston Consulting Group) Matrix

How to calculate Retention?

Retention rate

Retention curve

Rolling Retention (RR)

Granularity (retention of time periods)

How to build a Retention report on SQL language?

Collecting the size of cohorts

Calculating the number of returned users on each cohort

Combining all together

Construction of the sole Retention curve

More on the subject

Redash – full-fledged on-demand analytics