8 posts tagged


Clickhouse Training 101 by Altinity

Estimated read time – 5 min

Just recently I have completed a Clickhouse Training by Altinity (101 Series Training). For those who are just getting to know Clickhouse, Altinity offers free basic training: Data Warehouse Basics. I recommend starting with it if you are planning to dive into learning.

Certification by Altinity

I would like to share my experience on the training as well as the training notes.
The training costs $500 and lasts 4 days for 2 hours. It is carried out in the evenings Moscow time (starting from 19:00 GMT +3).

Session # 1.

The first day mostly revises everything covered in Data Warehouse Basics, but it has several new ideas on how to get useful information on queries from system tables.

For example, this query will show which commands are running and their status.

SELECT command, is_done
FROM system.mutations
WHERE table = 'ontime'

Besides, for me it was useful to learn about column compression with the use of codecs:

 MODIFY COLUMN TailNum LowCardinality(String) CODEC(ZSTD(1))

For those who are just starting with Clickhouse, the first day will be super useful as it will help in understanding table engines and syntax for their creation, partitions, inserting data (for example directly from S3).

 'DevId Int32, Type String, MDate Date, MDatetime
DateTime, Value Float64')

Session # 2.

I found the second day the most intense and useful as within this session Robert from Altinity talks about aggregate functions and materialized views ( detailed scheme for the creation of materialized views ) in Clickhouse in more detail.

It was super useful for me to learn about index types in Clickhouse.

Session # 3.

During the third day, colleagues share their knowledge on how to work with Kafka and JSON objects stored in the tables.
It was interesting to find out that working with arrays in Clickhouse is very similar to arrays in Python:

WITH [1, 2, 4] AS array
 array[1] AS First,
 array[2] AS Second,
 array[3] AS Third,
 array[-1] AS Last,
 length(array) AS Length

When working with arrays, there is a great feature called ARRAY JOIN which “unrolls” arrays to rows.

Clickhouse allows you to efficiently interact with JSON objects stored in a table:

-- Get a JSON string value
SELECT JSONExtractString(row, 'request') AS request
FROM log_row LIMIT 3
-- Get a JSON numeric value
SELECT JSONExtractInt(row, 'status') AS status
FROM log_row LIMIT 3

This piece of code is an example of how to extract the elements of the JSON array “request” and “status” separately.

status Int16 DEFAULT
 JSONExtractInt(row, 'status')
UPDATE status = status WHERE 1 = 1

Session # 4.

The most difficult topic from my point of view was saved for the last day – building sharding and replication patterns and building queries on distributed Clickhouse servers.

Special respect to Altinity for an excellent collection of labs during the training.


 No comments    64   1 mon   clickhouse   sql

Data normalization with SQL

Estimated read time – 5 min

According to GIGO (garbage in, garbage out) principle, errors in input data lead to erroneous analysis results. The results of our work directly depend on the quality of data preparation.

For instance, when we need to prepare data to use in ML algorithms (like k-NN, k-means, logistic regression, etc.), features in the original dataset may vary in scale like the age and height of a person. This may lead to the incorrect performance of the algorithm. Thus, such data needs to be rescaled first.

In this tutorial, we will consider the ways to scale the data using SQL query: min-max normalization, min-max normalization for an arbitrary range, and z-score normalization. For each of these methods we have prepared two SQL query options: one using a SELECT subquery and another using a window function OVER().

We will work with the simple table students that contains the data on the height of the students:

name height
Ivan 174
Peter 181
Dan 199
Kate 158
Mike 179
Silvia 165
Giulia 152
Robert 188
Steven 177
Sophia 165

Min-max rescaling

Min-max scaling approach scales the data using the fixed range from 0 to 1. In this case, all the data is on the same scale which will exclude the impact of outliers on the conclusions.

The formula for the min-max scaling is given as:

We multiply the numerator by 1.0 in order to get a floating point number at the end.

SQL-query with a subquery:

SELECT height, 
       1.0 * (height-t1.min_height)/(t1.max_height - t1.min_height) AS scaled_minmax
  FROM students, 
      (SELECT min(height) as min_height, 
              max(height) as max_height 
         FROM students
      ) as t1;

SQL-query with a window function:

SELECT height, 
       (height - MIN(height) OVER ()) * 1.0 / (MAX(height) OVER () - MIN(height) OVER ()) AS scaled_minmax
  FROM students;

As a result, we get the values in [0, 1] range where 0 is the height of the shortest student and 1 is the height of the tallest one.

name height scaled_minmax
Ivan 174 0.46809
Peter 181 0.61702
Dan 199 1
Kate 158 0.12766
Mike 179 0.57447
Silvia 165 0.2766
Giulia 152 0
Robert 188 0.76596
Steven 177 0.53191
Sophia 165 0.2766

Rescaling within a given range

This is an option of min-max normalization between an arbitrary set of values. When it comes to data scaling, the values do not always need to be in the range between 0 and 1. In these cases, the following formula is applied.

This allows us to scale the data to an arbitrary scale. In our example, let’s assume a=10.0 and b=20.0.

SQL-query with subquery:

SELECT height, 
       ((height - min_height) * (20.0 - 10.0) / (max_height - min_height)) + 10 AS scaled_ab
  FROM students,
      (SELECT MAX(height) as max_height, 
              MIN(height) as min_height
         FROM students  
      ) t1;

SQL-query with a window function:

SELECT height, 
       ((height - MIN(height) OVER() ) * (20.0 - 10.0) / (MAX(height) OVER() - MIN(height) OVER())) + 10.0 AS scaled_ab
  FROM students;

We receive similar results as before, but with data spread between 10 and 20.

name height scaled_ab
Ivan 174 14.68085
Peter 181 16.17021
Dan 199 20
Kate 158 11.2766
Mike 179 15.74468
Silvia 165 12.76596
Giulia 152 10
Robert 188 17.65957
Steven 177 15.31915
Sophia 165 12.76596

Z-score normalization

Using Z-score normalization, the data will be scaled so that it has the properties of a standard normal distribution where the mean (μ) is equal to 0 and the standard deviation (σ) to 1.

Z-score is calculated using the formula:

SQL-query with a subquery:

SELECT height, 
       (height - t1.mean) * 1.0 / t1.sigma AS zscore
  FROM students,
      (SELECT AVG(height) AS mean, 
              STDDEV(height) AS sigma
         FROM students
        ) t1;

SQL-query with a window function:

SELECT height, 
       (height - AVG(height) OVER()) * 1.0 / STDDEV(height) OVER() AS z-score
  FROM students;

As a result, we can easily notice the outliers that exceed the standard deviation.

name height zscore
Ivan 174 0.01488
Peter 181 0.53582
Dan 199 1.87538
Kate 158 -1.17583
Mike 179 0.38698
Silvia 165 -0.65489
Giulia 152 -1.62235
Robert 188 1.05676
Steven 177 0.23814
Sophia 165 -0.65489
 No comments    157   3 mon   Analytics engineering   sql

SQL Window Functions Cheat Sheet with examples

Estimated read time – 1 min

Window functions are calculation functions that can increase the efficiency and reduce the complexity of SQL queries, making things much easier:
View as  PDF

Special Thanks to LearnSQL, best place to master your SQL skills !

 No comments    340   2020   Analytics engineering   sql

Cohort analysis in Redash

Estimated read time – 5 min

In one of the previous articles we have reviewed building of Retention-report and have partially addressed the concept of cohorts therein.
Cohort usually implies group of users of a product or a company. Most often, groups are allocated on the basis of time of app installation / appearance of a user in a system.
It turns out, that, using cohort analysis, one can track down how the changes in a product affected the behaviour of users (for example, of old and new users).

Along with that, cohorts can be defined also proceeding from other parameters: geography of a user, traffic source, device platform and other important parameters of your product.

We will figure out, how to compare Retention of users of weekly cohorts in Redash, since Redash has special type of visualization for building such type of report.
Firstly, let’s sort out SQL-query. We still have two tables – user (id of a user and time of app installation) and client_session – timestamps (created_at) of activity of each user (user_id). Let’s consider the Retention of the first seven days for last 60 days.
The query is written in Cloudera Impala, let’s review it.

For starters, let’s build the total size of cohorts:

select trunc(from_unixtime(user.installed_at), "WW") AS cohort_week, 
	ndv(distinct user.id) as cohort_size //counting the number of users in the cohort
	from user 
	where from_unixtime(user.installed_at) between date_add(now(), -60) and now() //taking registered users for last 60 days
group by trunc(from_unixtime(user.installed_at), "WW")

The second part of the query can calculate the quantity of active users for every day during the first thirty days:

select trunc(from_unixtime(user.installed_at), "WW") AS cohort_week, 
        datediff(cast(cs.created_at as timestamp),cast(user.installed_at as timestamp)) as days,
	ndv(distinct user.id) as value  //counting the number of active users for every day
		from user 
		left join client_session cs on user.id=cs.user_id
where from_unixtime(user.installed_at) between date_add(now(), -60) and now()
and from_unixtime(cs.created_at) >= date_add(now(), -60) //taking sessions for last 60 days
and datediff(cast(cs.created_at as timestamp),cast(user.installed_at as timestamp)) between 0 and 30 //cutting off only the first 30 days of activity
group by 1,2

Bottom line, all the query entirely:

select size.cohort_week, size.cohort_size, ret.days, ret.value from
(select trunc(from_unixtime(user.installed_at), "WW") AS cohort_week, 
		ndv(distinct user.id) as cohort_size 
	from user 
	where from_unixtime(user.installed_at) between date_add(now(), -60) and now()
group by trunc(from_unixtime(user.installed_at), "WW")) size
left join (select trunc(from_unixtime(user.installed_at), "WW") AS cohort_week, 
        datediff(cast(cs.created_at as timestamp),cast(user.installed_at as timestamp)) as days,
		ndv(distinct user.id) as value 
		from user 
		left join client_session cs on user.id=cs.user_id
where from_unixtime(user.installed_at) between date_add(now(), -60) and now()
and from_unixtime(cs.created_at) >= date_add(now(), -60)
and datediff(cast(cs.created_at as timestamp),cast(user.installed_at as timestamp)) between 0 and 30
group by 1,2) ret on size.cohort_week=ret.cohort_week

Great, now correctly calculated data is available to us.

Data of cohorts in tabular form

Let’s create new visualization in Redash and indicate the parameters correctly:

It’s important to indicate the parameters correctly – the columns of the resulting query are compliant therewith.

Let’s make sure to indicate that we have weekly cohorts:

Voila, our visualization of cohorts is ready:

You can add filters and parameters to it and use in a dashboard

Materials on the topic

 No comments    1134   2020   BI-tools   redash   sql   visualisation

Looker Overview

Estimated read time – 10 min

Today we are going to talk about BI-platform Looker, on which I managed to work in 2019.

Here is the short content of the article for convenient and fast navigation:

  1. What is Looker?
  2. Which DBMS you can connect to via Looker and how?
  3. Building of Looker ML data model
  4. Explore Mode (data research on the model built)
  5. Building of reports and their saving in Look
  6. Examples of dashboards in Looker

What is Looker?

Creators of Looker position it as a software of business intelligence class and big data analytics platform, that helps to research, analyze and share business analytics in real time mode.
Looker — is a really convenient tool and one of a few BI products, that allows to work with pre-set data cubes in a real-time mode (actually, relational tables that are described in Look ML-model).
An engineer, working with Looker, needs to describe a data model on Look ML language (it’s something between CSS and SQL), publish this data model and then set reporting and dashboards.
Look ML itself is pretty simple, the nexus between the data objects are set by a data-engineer, which consequently allows to use the data without knowledge of SQL language (to be precise: Looker engine generates the code in SQL language itself on user’s behalf).

Just recently, in June 2019, Google announced acquisition of Looker platform for $2.6 billion.

Which DBMS you can connect to via Looker and how?

The selection of DBMS that Looker is working with is pretty wide. You can see the various connections on the screen shot below as of October, 2019:

Available DBMS for connection

You can easily set a connection to the database via web-interface:

Web-interface of connection to DBMS

With regard to connections to databases, I’d like to highlight the following two facts: first of all, unfortunately, Clickhouse support from Yandex is currently missing (as well as in the foreseeable future). Most likely, the support won’t appear, considering the fact that Looker was acquired by a competitor, Google.
updated: Actually, Looker supports Clickhouse from the December 2019
The second nuisance is that you can’t build one data model, that would apply to different DBMS. There is no inbuilt storage in Looker, that could combine the results of query (unlike the same Redash).
It means, that analytical architecture should be built within one DBMS (preferably with high action speed or on aggregated data).

Building of Looker ML data model

In order to build a report or a dashboard in Looker, you need to provisionally set a data model. Syntax of Look ML language is quite thoroughly described in the documentation. Personally, I can just add that model description doesn’t require long-time immersion for a specialist with SQL knowledge. Rather, one needs to rearrange the approach to data model preparation. Look ML language is very much alike CSS:

Console of Look ML model creation

In the data model the following is set up: links with tables, keys, granularity, information of some fields being facts, and other – measurements. For facts, the aggregation is written. Obviously, at model creation one can use various IF / CASE expressions.

Explore mode

Probably, it’s the main killer-feature of Looker, since it allows any business departments to get data without attraction of analysts / data engineers. And, guess that’s why use of accounts with Explore mode is billed separately.

In fact, Explore mode is an interface, that allows to use the set up Look ML data model, select the required metrics and measurements and build customized report / visualization.
For example, we want to understand how many actions of any kind were performed in Looker’s interface last week. In order to do it, using Explore mode, we select Date field and apply a filter to it: last week (in this sense, Looker is quite smart and it and it will be enough writing ‘Last week’ in the filter), thereafter we choose “Category” from the measurements, and Quantity as a metric. After pressing the button Run the ready report will be generated.

Building report in Looker

Then, using the data received in the table form, you can set up the visualization of any type.
For example, Pie chart:

Applying visualization to report

Building of reports and their saving in Look

Sometimes you can have a desire to save the set of data / visualization received in Explore and share it with colleagues, for this purpose Looker has a separate essense – Look. That is ready constructed report with selected filters / measurements / facts.

Example of the saved Look

Examples of dashboards in Looker

Systemizing the warehouse of Look created, oftentimes you want to receive a ready composition / overview of key metrics, that could be displayed on one list.
For these purposes dashboard creation fits perfectly. Dashboard is created either on the wing, or using previously created Look. One of the dashboard’s “tricks” is configuration of parameters, that are changed on all the dashboard and can be applied to all the Look at the same time.

Interesting features in one line

  • In Looker you can refer to other reports and, using such function, you can create a dynamic parameter, that is passed on by a link.
    For example, you’ve created a report with division of revenue by countries, and in this report you can refer to the dashboard on a separate country. Following the link, a user sees the dashboard on a specific country, that he clicked on.
  • On every Looker page there is a chat, where support service answers very promptly
  • Looker is not able to work with data merge on the level of various DBMS, however it can combine the data on the level of ready Look (in our case, this function works really weird).
  • Within the framework of work with various models, I have found out an extremely non-trivial use of SQL for calculation of unique values in a non-normalized data table, Looker calls it symmetric aggregates.
    SQL, indeed, looks very non-trivial:
 order_items.order_id AS "order_items.order_id",
 order_items.sale_price AS "order_items.sale_price",
 *(1000000*1.0)) AS DECIMAL(38,0))) + 
 * 1.0e8 + CAST(STRTOL(RIGHT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0)) ) 
 * 1.0e8 + CAST(STRTOL(RIGHT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))) ) 
 / CAST((1000000*1.0) AS DOUBLE PRECISION), 0) 
 ELSE NULL END), 0)) AS "users.average_age"
FROM order_items AS order_items
LEFT JOIN users AS users ON order_items.user_id = users.id

  • At implementation of Looker to a purchase, JumpStart Kit is mandatory, which costs not less than $6k. Within this kit you receive support and consultation from Looker at tool implementation.
 No comments    339   2020   analysis   Analytics engineering   BI-tools   looker   sql
Earlier Ctrl + ↓