<?xml version="1.0" encoding="utf-8"?> 
<rss version="2.0">

<channel>

<title>LEFT JOIN: blog on analytics, visualisation &amp; data science, posts tagged: sql</title>
<link>https://en.leftjoin.ru/tags/sql/</link>
<description></description>
<generator>E2 (v3386; Aegea)</generator>

<item>
<title>Clickhouse Training 101 by Altinity</title>
<guid isPermaLink="false">65</guid>
<link>https://en.leftjoin.ru/all/altinity-clickhouse-training-101/</link>
<comments>https://en.leftjoin.ru/all/altinity-clickhouse-training-101/</comments>
<description>
&lt;p&gt;Just recently I have completed a &lt;a href="https://altinity.com/clickhouse-training/?utm_source=leftjoin"&gt;Clickhouse Training by Altinity (101 Series Training). &lt;/a&gt; For those who are just getting to know Clickhouse, Altinity offers free basic training: &lt;a href="https://altinity.com/data-warehouse-basics/?utm_source=leftjoin"&gt;Data Warehouse Basics&lt;/a&gt;. I recommend starting with it if you are planning to dive into learning.&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/altinity-clickhouse-developer-300px.png" width="300" height="300" alt="" /&gt;
&lt;div class="e2-text-caption"&gt;Certification by Altinity&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;I would like to share my experience on the training as well as the &lt;a href="https://capable-stream-f18.notion.site/Clickhouse-Training-101-by-Altinity-notes-120f1b6467f44a30956d6d7ffeff7b08"&gt;training notes&lt;/a&gt;.&lt;br /&gt;
The training costs $500 and lasts 4 days for 2 hours. It is carried out in the evenings Moscow time (starting from 19:00 GMT +3).&lt;/p&gt;
&lt;h2&gt;Session # 1.&lt;/h2&gt;
&lt;p&gt;The first day mostly revises everything covered in Data Warehouse Basics, but it has several new ideas on how to get useful information on queries from system tables.&lt;/p&gt;
&lt;p&gt;For example, this query will show which commands are running and their status.&lt;/p&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;SELECT command, is_done
FROM system.mutations
WHERE table = 'ontime'&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Besides, for me it was useful to learn about column compression with the use of codecs:&lt;/p&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;ALTER TABLE ontime
 MODIFY COLUMN TailNum LowCardinality(String) CODEC(ZSTD(1))&lt;/code&gt;&lt;/pre&gt;&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/--2021-08-01--11.53.59.png" width="1732" height="1048" alt="" /&gt;
&lt;/div&gt;
&lt;p&gt;For those who are just starting with Clickhouse, the first day will be super useful as it will help in understanding table engines and syntax for their creation, partitions, inserting data (for example directly from S3).&lt;/p&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;INSERT INTO sdata
SELECT * FROM s3(
 'https://s3.us-east-1.amazonaws.com/d1-altinity/data/sdata*.csv.gz',
 'aws_access_key_id',
 'aws_secret_access_key',
 'Parquet',
 'DevId Int32, Type String, MDate Date, MDatetime
DateTime, Value Float64')&lt;/code&gt;&lt;/pre&gt;&lt;h2&gt;Session # 2.&lt;/h2&gt;
&lt;p&gt;I found the second day the most intense and useful as within this session Robert from Altinity talks about aggregate functions and materialized views ( &lt;a href="https://www.notion.so/Session-2-35af1ed8d2c54c6fa7fcbea3c9385810#f36adc3df7d74deebedcb3c04e019661"&gt;detailed scheme for the creation of materialized views&lt;/a&gt; ) in Clickhouse in more detail.&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/--2021-08-02--18.11.45.png" width="877" height="495" alt="" /&gt;
&lt;/div&gt;
&lt;p&gt;It was super useful for me to learn about index types in Clickhouse.&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/--2021-08-02--18.20.35.png" width="874" height="472" alt="" /&gt;
&lt;/div&gt;
&lt;h2&gt;Session # 3.&lt;/h2&gt;
&lt;p&gt;During the third day, colleagues share their knowledge on how to work with Kafka and JSON objects stored in the tables.&lt;br /&gt;
It was interesting to find out that working with arrays in Clickhouse is very similar to arrays in Python:&lt;/p&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;WITH [1, 2, 4] AS array
SELECT
 array[1] AS First,
 array[2] AS Second,
 array[3] AS Third,
 array[-1] AS Last,
 length(array) AS Length&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;When working with arrays, there is a great feature called ARRAY JOIN which “unrolls” arrays to rows.&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/--2021-08-02--19.14.28.png" width="812" height="518" alt="" /&gt;
&lt;/div&gt;
&lt;p&gt;Clickhouse allows you to efficiently interact with JSON objects stored in a table:&lt;/p&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;-- Get a JSON string value
SELECT JSONExtractString(row, 'request') AS request
FROM log_row LIMIT 3
-- Get a JSON numeric value
SELECT JSONExtractInt(row, 'status') AS status
FROM log_row LIMIT 3&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;This piece of code is an example of how to extract the elements of the JSON array “request” and “status” separately.&lt;/p&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;ALTER TABLE log_row
 ADD COLUMN
status Int16 DEFAULT
 JSONExtractInt(row, 'status')
ALTER TABLE log_row
UPDATE status = status WHERE 1 = 1&lt;/code&gt;&lt;/pre&gt;&lt;h2&gt;Session # 4.&lt;/h2&gt;
&lt;p&gt;The most difficult topic from my point of view was saved for the last day – &lt;a href="https://www.notion.so/Session-4-f2aa33b6fe434a4e8542f0f64f9439bc#3a3038e94dbf4b47a10284dc1dc226ec"&gt;building sharding and replication patterns &lt;/a&gt; and building queries on distributed Clickhouse servers.&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;div class="fotorama" data-width="938" data-ratio="1.8073217726397"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/--2021-08-02--19.43.22.png" width="938" height="519" alt="" /&gt;
&lt;img src="https://en.leftjoin.ru/pictures/--2021-08-02--19.47.52.png" width="934" height="520" alt="" /&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Special respect to Altinity for an excellent collection of labs during the training.&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Links&lt;/b&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://capable-stream-f18.notion.site/Clickhouse-Training-101-by-Altinity-notes-120f1b6467f44a30956d6d7ffeff7b08"&gt;Notes in Notion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://altinity.com/clickhouse-training/?utm_source=leftjoin"&gt;ClickHouse 101 Training by Altinity&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
<pubDate>Mon, 09 Aug 2021 08:41:15 +0300</pubDate>
</item>

<item>
<title>Data normalization with SQL</title>
<guid isPermaLink="false">64</guid>
<link>https://en.leftjoin.ru/all/data-normalization-with-sql/</link>
<comments>https://en.leftjoin.ru/all/data-normalization-with-sql/</comments>
<description>
&lt;p&gt;According to GIGO (garbage in, garbage out) principle, errors in input data lead to erroneous analysis results. The results of our work directly depend on the quality of data preparation.&lt;/p&gt;
&lt;p&gt;For instance, when we need to prepare data to use in ML algorithms (like k-NN, k-means, logistic regression, etc.), features in the original dataset may vary in scale like the age and height of a person. This may lead to the incorrect performance of the algorithm. Thus, such data needs to be rescaled first.&lt;/p&gt;
&lt;p&gt;In this tutorial, we will consider the ways to scale the data using SQL query: min-max normalization, min-max normalization for an arbitrary range, and z-score normalization. For each of these methods we have prepared two SQL query options: one using a SELECT subquery and another using a window function OVER().&lt;/p&gt;
&lt;p&gt;We will work with the simple table &lt;b&gt;students&lt;/b&gt; that contains the data on the height of the students:&lt;/p&gt;
&lt;div class="e2-text-table"&gt;
&lt;table cellpadding="0" cellspacing="0" border="0"&gt;
&lt;tr&gt;
&lt;td&gt;name&lt;/td&gt;
&lt;td&gt;height&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ivan&lt;/td&gt;
&lt;td style="text-align: right"&gt;174&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peter&lt;/td&gt;
&lt;td style="text-align: right"&gt;181&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dan&lt;/td&gt;
&lt;td style="text-align: right"&gt;199&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kate&lt;/td&gt;
&lt;td style="text-align: right"&gt;158&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mike&lt;/td&gt;
&lt;td style="text-align: right"&gt;179&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Silvia&lt;/td&gt;
&lt;td style="text-align: right"&gt;165&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Giulia&lt;/td&gt;
&lt;td style="text-align: right"&gt;152&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Robert&lt;/td&gt;
&lt;td style="text-align: right"&gt;188&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Steven&lt;/td&gt;
&lt;td style="text-align: right"&gt;177&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sophia&lt;/td&gt;
&lt;td style="text-align: right"&gt;165&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;h2&gt;Min-max rescaling&lt;/h2&gt;
&lt;p&gt;Min-max scaling approach scales the data using the fixed range from 0 to 1. In this case, all the data is on the same scale which will exclude the impact of outliers on the conclusions.&lt;/p&gt;
&lt;p&gt;The formula for the min-max scaling is given as:&lt;/p&gt;
&lt;p&gt;We multiply the numerator by 1.0 in order to get a floating point number at the end.&lt;/p&gt;
&lt;p&gt;SQL-query with a subquery:&lt;/p&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;SELECT height, 
       1.0 * (height-t1.min_height)/(t1.max_height - t1.min_height) AS scaled_minmax
  FROM students, 
      (SELECT min(height) as min_height, 
              max(height) as max_height 
         FROM students
      ) as t1;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;SQL-query with a window function:&lt;/p&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;SELECT height, 
       (height - MIN(height) OVER ()) * 1.0 / (MAX(height) OVER () - MIN(height) OVER ()) AS scaled_minmax
  FROM students;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;As a result, we get the values in [0, 1] range where 0 is the height of the shortest student and 1 is the height of the tallest one.&lt;/p&gt;
&lt;div class="e2-text-table"&gt;
&lt;table cellpadding="0" cellspacing="0" border="0"&gt;
&lt;tr&gt;
&lt;td&gt;name&lt;/td&gt;
&lt;td&gt;height&lt;/td&gt;
&lt;td&gt;scaled_minmax&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ivan&lt;/td&gt;
&lt;td style="text-align: right"&gt;174&lt;/td&gt;
&lt;td&gt;0.46809&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peter&lt;/td&gt;
&lt;td style="text-align: right"&gt;181&lt;/td&gt;
&lt;td&gt;0.61702&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dan&lt;/td&gt;
&lt;td style="text-align: right"&gt;199&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kate&lt;/td&gt;
&lt;td style="text-align: right"&gt;158&lt;/td&gt;
&lt;td&gt;0.12766&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mike&lt;/td&gt;
&lt;td style="text-align: right"&gt;179&lt;/td&gt;
&lt;td&gt;0.57447&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Silvia&lt;/td&gt;
&lt;td style="text-align: right"&gt;165&lt;/td&gt;
&lt;td&gt;0.2766&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Giulia&lt;/td&gt;
&lt;td style="text-align: right"&gt;152&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Robert&lt;/td&gt;
&lt;td style="text-align: right"&gt;188&lt;/td&gt;
&lt;td&gt;0.76596&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Steven&lt;/td&gt;
&lt;td style="text-align: right"&gt;177&lt;/td&gt;
&lt;td&gt;0.53191&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sophia&lt;/td&gt;
&lt;td style="text-align: right"&gt;165&lt;/td&gt;
&lt;td&gt;0.2766&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;h2&gt;Rescaling within a given range&lt;/h2&gt;
&lt;p&gt;This is an option of min-max normalization between an arbitrary set of values. When it comes to data scaling, the values do not always need to be in the range between 0 and 1. In these cases, the following formula is applied.&lt;/p&gt;
&lt;p&gt;This allows us to scale the data to an arbitrary scale. In our example, let’s assume a=10.0 and b=20.0.&lt;/p&gt;
&lt;p&gt;SQL-query with subquery:&lt;/p&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;SELECT height, 
       ((height - min_height) * (20.0 - 10.0) / (max_height - min_height)) + 10 AS scaled_ab
  FROM students,
      (SELECT MAX(height) as max_height, 
              MIN(height) as min_height
         FROM students  
      ) t1;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;SQL-query with a window function:&lt;/p&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;SELECT height, 
       ((height - MIN(height) OVER() ) * (20.0 - 10.0) / (MAX(height) OVER() - MIN(height) OVER())) + 10.0 AS scaled_ab
  FROM students;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;We receive similar results as before, but with data spread between 10 and 20.&lt;/p&gt;
&lt;div class="e2-text-table"&gt;
&lt;table cellpadding="0" cellspacing="0" border="0"&gt;
&lt;tr&gt;
&lt;td&gt;name&lt;/td&gt;
&lt;td&gt;height&lt;/td&gt;
&lt;td&gt;scaled_ab&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ivan&lt;/td&gt;
&lt;td style="text-align: right"&gt;174&lt;/td&gt;
&lt;td&gt;14.68085&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peter&lt;/td&gt;
&lt;td style="text-align: right"&gt;181&lt;/td&gt;
&lt;td&gt;16.17021&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dan&lt;/td&gt;
&lt;td style="text-align: right"&gt;199&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kate&lt;/td&gt;
&lt;td style="text-align: right"&gt;158&lt;/td&gt;
&lt;td&gt;11.2766&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mike&lt;/td&gt;
&lt;td style="text-align: right"&gt;179&lt;/td&gt;
&lt;td&gt;15.74468&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Silvia&lt;/td&gt;
&lt;td style="text-align: right"&gt;165&lt;/td&gt;
&lt;td&gt;12.76596&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Giulia&lt;/td&gt;
&lt;td style="text-align: right"&gt;152&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Robert&lt;/td&gt;
&lt;td style="text-align: right"&gt;188&lt;/td&gt;
&lt;td&gt;17.65957&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Steven&lt;/td&gt;
&lt;td style="text-align: right"&gt;177&lt;/td&gt;
&lt;td&gt;15.31915&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sophia&lt;/td&gt;
&lt;td style="text-align: right"&gt;165&lt;/td&gt;
&lt;td&gt;12.76596&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;h2&gt;Z-score normalization&lt;/h2&gt;
&lt;p&gt;Using Z-score normalization, the data will be scaled so that it has the properties of a standard normal distribution where the mean (μ) is equal to 0 and the standard deviation (σ) to 1.&lt;/p&gt;
&lt;p&gt;Z-score is calculated using the formula:&lt;/p&gt;
&lt;p&gt;SQL-query with a subquery:&lt;/p&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;SELECT height, 
       (height - t1.mean) * 1.0 / t1.sigma AS zscore
  FROM students,
      (SELECT AVG(height) AS mean, 
              STDDEV(height) AS sigma
         FROM students
        ) t1;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;SQL-query with a window function:&lt;/p&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;SELECT height, 
       (height - AVG(height) OVER()) * 1.0 / STDDEV(height) OVER() AS z-score
  FROM students;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;As a result, we can easily notice the outliers that exceed the standard deviation.&lt;/p&gt;
&lt;div class="e2-text-table"&gt;
&lt;table cellpadding="0" cellspacing="0" border="0"&gt;
&lt;tr&gt;
&lt;td&gt;name&lt;/td&gt;
&lt;td&gt;height&lt;/td&gt;
&lt;td&gt;zscore&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ivan&lt;/td&gt;
&lt;td style="text-align: right"&gt;174&lt;/td&gt;
&lt;td&gt;0.01488&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peter&lt;/td&gt;
&lt;td style="text-align: right"&gt;181&lt;/td&gt;
&lt;td&gt;0.53582&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dan&lt;/td&gt;
&lt;td style="text-align: right"&gt;199&lt;/td&gt;
&lt;td&gt;1.87538&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kate&lt;/td&gt;
&lt;td style="text-align: right"&gt;158&lt;/td&gt;
&lt;td&gt;-1.17583&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mike&lt;/td&gt;
&lt;td style="text-align: right"&gt;179&lt;/td&gt;
&lt;td&gt;0.38698&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Silvia&lt;/td&gt;
&lt;td style="text-align: right"&gt;165&lt;/td&gt;
&lt;td&gt;-0.65489&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Giulia&lt;/td&gt;
&lt;td style="text-align: right"&gt;152&lt;/td&gt;
&lt;td&gt;-1.62235&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Robert&lt;/td&gt;
&lt;td style="text-align: right"&gt;188&lt;/td&gt;
&lt;td&gt;1.05676&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Steven&lt;/td&gt;
&lt;td style="text-align: right"&gt;177&lt;/td&gt;
&lt;td&gt;0.23814&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sophia&lt;/td&gt;
&lt;td style="text-align: right"&gt;165&lt;/td&gt;
&lt;td&gt;-0.65489&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/div&gt;
</description>
<pubDate>Fri, 21 May 2021 16:58:48 +0300</pubDate>
</item>

<item>
<title>SQL Window Functions Cheat Sheet with examples</title>
<guid isPermaLink="false">30</guid>
<link>https://en.leftjoin.ru/all/sql-window-functions-cheat-sheet-with-examples/</link>
<comments>https://en.leftjoin.ru/all/sql-window-functions-cheat-sheet-with-examples/</comments>
<description>
&lt;p&gt;Window functions are calculation functions that can increase the efficiency and reduce the complexity of SQL queries, making things much easier:&lt;br /&gt;
View as &lt;a href="http://valiotti.com/leftjoin/files/Wndow_Functions_Cheat_Sheet.pdf"&gt;&lt;i  class="fa fa-file-pdf-o"&gt; PDF &lt;/i&gt;&lt;/a&gt;&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/Window_Functions_Cheat_Sheet-1.png" width="2339" height="1654" alt="" /&gt;
&lt;/div&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/Window_Functions_Cheat_Sheet-2.png" width="2339" height="1654" alt="" /&gt;
&lt;/div&gt;
&lt;p&gt;Special Thanks to &lt;a href="https://learnsql.com/"&gt;LearnSQL&lt;/a&gt;, best place to master your SQL skills !&lt;/p&gt;
</description>
<pubDate>Tue, 02 Jun 2020 12:17:13 +0300</pubDate>
</item>

<item>
<title>Cohort analysis in Redash</title>
<guid isPermaLink="false">22</guid>
<link>https://en.leftjoin.ru/all/cohort-analysis-in-redash/</link>
<comments>https://en.leftjoin.ru/all/cohort-analysis-in-redash/</comments>
<description>
&lt;p&gt;In one of the previous articles we have reviewed &lt;a href="all/retention-rate/"&gt;building of Retention-report&lt;/a&gt; and have partially addressed the concept of &lt;a href="https://en.wikipedia.org/wiki/Cohort_analysis"&gt;cohorts&lt;/a&gt; therein.&lt;br /&gt;
Cohort usually implies group of users of a product or a company. Most often, groups are allocated on the basis of time of app installation / appearance of a user in a system.&lt;br /&gt;
It turns out, that, using cohort analysis, one can track down how the changes in a product affected the behaviour of users (for example, of old and new users).&lt;/p&gt;
&lt;p&gt;Along with that, cohorts can be defined also proceeding from other parameters: geography of a user, traffic source, device platform and other important parameters of your product.&lt;/p&gt;
&lt;p&gt;We will figure out, how to compare Retention of users of weekly cohorts in Redash, since Redash has special type of visualization for building such type of report.&lt;br /&gt;
Firstly, let’s sort out SQL-query. We still have two tables – &lt;i&gt;user&lt;/i&gt; (id of a user and time of app installation) and &lt;i&gt;client_session&lt;/i&gt; – timestamps (&lt;i&gt;created_at&lt;/i&gt;) of activity of each user (&lt;i&gt;user_id&lt;/i&gt;). Let’s consider the Retention of the first seven days for last 60 days.&lt;br /&gt;
The query is written in Cloudera Impala, let’s review it.&lt;/p&gt;
&lt;p&gt;For starters, let’s build the total size of cohorts:&lt;/p&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;select trunc(from_unixtime(user.installed_at), &amp;quot;WW&amp;quot;) AS cohort_week, 
	ndv(distinct user.id) as cohort_size //counting the number of users in the cohort
	from user 
	where from_unixtime(user.installed_at) between date_add(now(), -60) and now() //taking registered users for last 60 days
group by trunc(from_unixtime(user.installed_at), &amp;quot;WW&amp;quot;)&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;The second part of the query can calculate the quantity of active users for every day during the first thirty days:&lt;/p&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;select trunc(from_unixtime(user.installed_at), &amp;quot;WW&amp;quot;) AS cohort_week, 
        datediff(cast(cs.created_at as timestamp),cast(user.installed_at as timestamp)) as days,
	ndv(distinct user.id) as value  //counting the number of active users for every day
		from user 
		left join client_session cs on user.id=cs.user_id
where from_unixtime(user.installed_at) between date_add(now(), -60) and now()
and from_unixtime(cs.created_at) &amp;gt;= date_add(now(), -60) //taking sessions for last 60 days
and datediff(cast(cs.created_at as timestamp),cast(user.installed_at as timestamp)) between 0 and 30 //cutting off only the first 30 days of activity
group by 1,2&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Bottom line, all the query entirely:&lt;/p&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;select size.cohort_week, size.cohort_size, ret.days, ret.value from
(select trunc(from_unixtime(user.installed_at), &amp;quot;WW&amp;quot;) AS cohort_week, 
		ndv(distinct user.id) as cohort_size 
	from user 
	where from_unixtime(user.installed_at) between date_add(now(), -60) and now()
group by trunc(from_unixtime(user.installed_at), &amp;quot;WW&amp;quot;)) size
left join (select trunc(from_unixtime(user.installed_at), &amp;quot;WW&amp;quot;) AS cohort_week, 
        datediff(cast(cs.created_at as timestamp),cast(user.installed_at as timestamp)) as days,
		ndv(distinct user.id) as value 
		from user 
		left join client_session cs on user.id=cs.user_id
where from_unixtime(user.installed_at) between date_add(now(), -60) and now()
and from_unixtime(cs.created_at) &amp;gt;= date_add(now(), -60)
and datediff(cast(cs.created_at as timestamp),cast(user.installed_at as timestamp)) between 0 and 30
group by 1,2) ret on size.cohort_week=ret.cohort_week&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Great, now correctly calculated data is available to us.&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/cohort-data@2x.png" width="326" height="180" alt="" /&gt;
&lt;div class="e2-text-caption"&gt;Data of cohorts in tabular form&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Let’s create new visualization in Redash and indicate the parameters correctly:&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/cohort-viz@2x.png" width="589" height="471" alt="" /&gt;
&lt;div class="e2-text-caption"&gt;It’s important to indicate the parameters correctly – the columns of the resulting query are compliant therewith.&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Let’s make sure to indicate that we have weekly cohorts:&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/cohort-options@2x.png" width="580" height="187" alt="" /&gt;
&lt;/div&gt;
&lt;p&gt;Voila, our visualization of cohorts is ready:&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/cohort-graph@2x.png" width="807" height="366" alt="" /&gt;
&lt;div class="e2-text-caption"&gt;You can add filters and parameters to it and use in a dashboard&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Materials on the topic&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://gopractice.ru/cohort_analysis/"&gt;Cohort analysis. Metrics of product vs metrics of growth&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://medium.com/analytics-for-humans/a-beginners-guide-to-cohort-analysis-the-most-actionable-and-underrated-report-on-google-c0797d826bf4"&gt;A beginners guide to cohort analysis&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
<pubDate>Tue, 03 Mar 2020 14:17:54 +0300</pubDate>
</item>

<item>
<title>Looker Overview</title>
<guid isPermaLink="false">20</guid>
<link>https://en.leftjoin.ru/all/looker-overview/</link>
<comments>https://en.leftjoin.ru/all/looker-overview/</comments>
<description>
&lt;p&gt;Today we are going to talk about BI-platform Looker, on which I managed to work in 2019.&lt;/p&gt;
&lt;p&gt;Here is the short content of the article for convenient and fast navigation:&lt;/p&gt;
&lt;ol start="1"&gt;
&lt;li&gt;&lt;a href="http://leftjoin.ru/drafts/looker-overview/#part1"&gt;What is Looker?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://leftjoin.ru/drafts/looker-overview/#part2"&gt;Which DBMS you can connect to via Looker and how?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://leftjoin.ru/drafts/looker-overview/#part3"&gt;Building of Looker ML data model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://leftjoin.ru/drafts/looker-overview/#part4"&gt;Explore Mode (data research on the model built&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;&lt;a href="http://leftjoin.ru/drafts/looker-overview/#part5"&gt;Building of reports and their saving in Look&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://leftjoin.ru/drafts/looker-overview/#part6"&gt;Examples of dashboards in Looker&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;&lt;a id="part1"&gt;&lt;/a&gt;What is Looker?&lt;/h2&gt;
&lt;p&gt;Creators of &lt;a href="https://looker.com/"&gt;Looker&lt;/a&gt; position it as a software of business intelligence class and big data analytics platform, that helps to research, analyze and share business analytics in real time mode.&lt;br /&gt;
Looker —  is a really convenient tool and one of a few BI products, that allows to work with pre-set data cubes in a real-time mode (actually, relational tables that are described in Look ML-model).&lt;br /&gt;
An engineer, working with Looker, needs to describe a data model on Look ML language (it’s something between CSS and SQL), publish this data model and then set reporting and dashboards.&lt;br /&gt;
Look ML itself is pretty simple, the nexus between the data objects are set by a data-engineer, which consequently allows to use the data without knowledge of SQL language (to be precise: Looker engine generates the code in SQL language itself on user’s behalf).&lt;/p&gt;
&lt;p&gt;Just recently, in June 2019, Google &lt;a href="https://cloud.google.com/blog/topics/inside-google-cloud/expanding-our-platform-for-business-intelligence-and-embedded-analytics"&gt;announced&lt;/a&gt; acquisition of Looker platform for $2.6 billion.&lt;/p&gt;
&lt;h2&gt;&lt;a id="part2"&gt;&lt;/a&gt;Which DBMS you can connect to via Looker and how?&lt;/h2&gt;
&lt;p&gt;The selection of DBMS that Looker is working with is pretty wide. You can see the various connections on the screen shot below as of October, 2019:&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/db-list@2x.png" width="418" height="852" alt="" /&gt;
&lt;div class="e2-text-caption"&gt;Available DBMS for connection&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;You can easily set a connection to the database via web-interface:&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/connection-setting@2x.png.jpg" width="2560" height="1460" alt="" /&gt;
&lt;div class="e2-text-caption"&gt;Web-interface of connection to DBMS&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;s&gt;With regard to connections to databases, I’d like to highlight the following two facts: first of all, unfortunately, Clickhouse support from Yandex is currently missing (as well as in the foreseeable future). Most likely, the support won’t appear, considering the fact that Looker was acquired by a competitor, Google.&lt;/s&gt;&lt;br /&gt;
&lt;i&gt;updated:&lt;/i&gt; Actually, Looker supports Clickhouse from the December 2019&lt;br /&gt;
The second nuisance is that you can’t build one data model, that would apply to different DBMS. There is no inbuilt storage in Looker, that could combine the results of query (unlike the same Redash).&lt;br /&gt;
It means, that analytical architecture should be built within one DBMS (preferably with high action speed or on aggregated data).&lt;/p&gt;
&lt;h2&gt;&lt;a id="part3"&gt;&lt;/a&gt;Building of Looker ML data model&lt;/h2&gt;
&lt;p&gt;In order to build a report or a dashboard in Looker, you need to provisionally set a data model. Syntax of Look ML language is quite thoroughly &lt;a href="https://docs.looker.com/data-modeling/learning-lookml/what-is-lookml"&gt;described in the documentation&lt;/a&gt;. Personally, I can just add that model description doesn’t require long-time immersion for a specialist with SQL knowledge. Rather, one needs to rearrange the approach to data model preparation. Look ML language is very much alike CSS:&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/lookml@2x.jpg" width="1280" height="703.5" alt="" /&gt;
&lt;div class="e2-text-caption"&gt;Console of Look ML model creation&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;In the data model the following is set up: links with tables, keys, granularity, information of some fields being facts, and other – measurements. For facts, the aggregation is written. Obviously, at model creation one can use various IF / CASE expressions.&lt;/p&gt;
&lt;h2&gt;&lt;a id="part4"&gt;&lt;/a&gt;Explore mode&lt;/h2&gt;
&lt;p&gt;Probably, it’s the main killer-feature of Looker, since it allows any business departments to get data without attraction of analysts / data engineers. And, guess that’s why use of accounts with Explore mode is billed separately.&lt;/p&gt;
&lt;p&gt;In fact, Explore mode is an interface, that allows to use the set up Look ML data model, select the required metrics and measurements and build customized report / visualization.&lt;br /&gt;
For example, we want to understand how many actions of any kind were performed in Looker’s interface last week. In order to do it, using Explore mode, we select Date field and apply a filter to it: last week (in this sense, Looker is quite smart and it and it will be enough writing ‘Last week’ in the filter), thereafter we choose “Category” from the measurements, and Quantity as a metric. After pressing the button &lt;i&gt;Run&lt;/i&gt; the ready report will be generated.&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/explore-tab@2x.png.jpg" width="2560" height="1012" alt="" /&gt;
&lt;div class="e2-text-caption"&gt;Building report in Looker&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Then, using the data received in the table form, you can set up the visualization of any type.&lt;br /&gt;
For example, Pie chart:&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/vis@2x.png" width="1081" height="566" alt="" /&gt;
&lt;div class="e2-text-caption"&gt;Applying visualization to report&lt;/div&gt;
&lt;/div&gt;
&lt;h2&gt;&lt;a id="part5"&gt;&lt;/a&gt;Building of reports and their saving in Look&lt;/h2&gt;
&lt;p&gt;Sometimes you can have a desire to save the set of data / visualization received in Explore and share it with colleagues, for this purpose Looker has a separate essense – Look. That is ready constructed report with selected filters / measurements / facts.&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/look@2x.png.jpg" width="2560" height="1457" alt="" /&gt;
&lt;div class="e2-text-caption"&gt;Example of the saved Look&lt;/div&gt;
&lt;/div&gt;
&lt;h2&gt;&lt;a id="part6"&gt;&lt;/a&gt;Examples of dashboards in Looker&lt;/h2&gt;
&lt;p&gt;Systemizing the warehouse of Look created, oftentimes you want to receive a ready composition / overview of key metrics, that could be displayed on one list.&lt;br /&gt;
For these purposes dashboard creation fits perfectly. Dashboard is created either on the wing, or using previously created Look. One of the dashboard’s “tricks” is configuration of parameters, that are changed on all the dashboard and can be applied to all the Look at the same time.&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;div class="fotorama" data-width="2560" data-ratio="1.8591140159768"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/dashboard@2x.png.jpg" width="2560" height="1377" alt="" /&gt;
&lt;img src="https://en.leftjoin.ru/pictures/dashboard_v2@2x.png.jpg" width="2560" height="1456" alt="" /&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h2&gt;Interesting features in one line&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;In Looker you can refer to other reports and, using such function, you can create a dynamic parameter, that is passed on by a link.&lt;br /&gt;
For example, you’ve created a report with division of revenue by countries, and in this report you can refer to the dashboard on a separate country. Following the link, a user sees the dashboard on a specific country, that he clicked on.&lt;/li&gt;
&lt;li&gt;On every Looker page there is a chat, where support service answers very promptly&lt;/li&gt;
&lt;li&gt;Looker is not able to work with data merge on the level of various DBMS, however it can combine the data on the level of ready Look (in our case, this function works really weird).&lt;/li&gt;
&lt;li&gt;Within the framework of work with various models, I have found out an extremely non-trivial use of SQL for calculation of unique values in a non-normalized data table, Looker calls it &lt;a href="https://help.looker.com/hc/en-us/articles/360023722974-A-Simple-Explanation-of-Symmetric-Aggregates-or-Why-On-Earth-Does-My-SQL-Look-Like-That-"&gt;symmetric aggregates&lt;/a&gt;.&lt;br /&gt;
SQL, indeed, looks very non-trivial:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;SELECT 
 order_items.order_id AS &amp;quot;order_items.order_id&amp;quot;,
 order_items.sale_price AS &amp;quot;order_items.sale_price&amp;quot;,
 (COALESCE(CAST( ( SUM(DISTINCT (CAST(FLOOR(COALESCE(users.age ,0)
 *(1000000*1.0)) AS DECIMAL(38,0))) + 
 CAST(STRTOL(LEFT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))
 * 1.0e8 + CAST(STRTOL(RIGHT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0)) ) 
 - SUM(DISTINCT CAST(STRTOL(LEFT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))
 * 1.0e8 + CAST(STRTOL(RIGHT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))) ) 
 AS DOUBLE PRECISION) 
 / CAST((1000000*1.0) AS DOUBLE PRECISION), 0) 
 / NULLIF(COUNT(DISTINCT CASE WHEN users.age IS NOT NULL THEN users.id 
 ELSE NULL END), 0)) AS &amp;quot;users.average_age&amp;quot;
FROM order_items AS order_items
LEFT JOIN users AS users ON order_items.user_id = users.id

GROUP BY 1,2
ORDER BY 3 DESC
LIMIT 500&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;At implementation of Looker to a purchase, JumpStart Kit is mandatory, which costs not less than $6k. Within this kit you receive support and consultation from Looker at tool implementation.&lt;/li&gt;
&lt;/ul&gt;
</description>
<pubDate>Wed, 08 Jan 2020 11:22:02 +0300</pubDate>
</item>

<item>
<title>Diagram of BCG (Boston Consulting Group) Matrix</title>
<guid isPermaLink="false">13</guid>
<link>https://en.leftjoin.ru/all/excel-chart-matrix-bcg/</link>
<comments>https://en.leftjoin.ru/all/excel-chart-matrix-bcg/</comments>
<description>
&lt;p&gt;I will water down the blog with an interesting report, that was developed for Yota company on November, 2011. &lt;a href="https://ru.wikipedia.org/wiki/%D0%9C%D0%B0%D1%82%D1%80%D0%B8%D1%86%D0%B0_%D0%91%D0%9A%D0%93"&gt;BCG Matrix&lt;/a&gt; has inspired us to develop this report.&lt;/p&gt;
&lt;p&gt;We had: one Excel package, 75 VBA macro, ODBC connection to Oracle, SQL queries to databases of all sorts and colours. We will review report construction within this stack, but first, let’s speak about the very idea of the report.&lt;/p&gt;
&lt;p&gt;BCG Matrix – is 2x2 matrix, whereon the clients’ segments are displayed by circumferences with their centres in the intersection of coordinates, formed by the relevant paces of two indicators selected.&lt;/p&gt;
&lt;p&gt;To make it simple, we had to divide all the clients of the company into 4 segments: ARPU above average/below average, traffic consumption (main service) above average/below average. Thus, it turned out that 4 quadrants appear, and you need to place a bubble chart into each one of them, whereas the size of a bubble means the total amount of users within a segment. In addition to that, one more bubble was added to each quadrant (smaller one), that showcased the churn in each segment (author’s improvement).&lt;/p&gt;
&lt;p&gt;&lt;b&gt;What did we want to get at the output?&lt;/b&gt;&lt;br /&gt;
A chart of the following type:&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/bubbles-chart@2x.png" width="1039" height="681" alt="" /&gt;
&lt;div class="e2-text-caption"&gt;Representation of the BCG matrix on the data of Yota company&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The task statement is more or less clear, let’s move to the realization.&lt;br /&gt;
Let’s assume, that we’ve already collected all the required data (meaning that, we’ve learned to identify the average ARPU and average traffic consumption, in this post we won’t examine SQL-query), then the paramount task lies in understanding how to display the bubbles in the required places by means of Excel tools.&lt;/p&gt;
&lt;p&gt;For this aim, a bubble chart comes to help:&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/viz-type@2x.png" width="252" height="345" alt="" /&gt;
&lt;div class="e2-text-caption"&gt;&lt;i&gt;Insert – Chart – Bubble&lt;/i&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Going to the menu &lt;i&gt;Selection of data source&lt;/i&gt; and evaluating, what is required in order to build a chart in the type that we need: coordinates &lt;i&gt;X&lt;/i&gt;, coordinates &lt;i&gt;Y&lt;/i&gt;, values of bubbles’ sizes.&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/data-source@2x.png" width="518" height="529" alt="" /&gt;
&lt;/div&gt;
&lt;p&gt;Great, so it turns out that if we assume that our chart will be located in coordinates on the &lt;i&gt;X&lt;/i&gt; axis from -1 to 1, and on the &lt;i&gt;Y&lt;/i&gt; axis from -1 to 1, then the centre of the right upper bubble will be the spot (0.5; 0.5) on the chart. Likewise, we’ll place all the other bubbles.&lt;/p&gt;
&lt;p&gt;We should separately consider the bubbles of &lt;i&gt;Churn&lt;/i&gt; type (for displaying of the churn), they are located more to the right then the main bubble and might intersect with it, therefore we will place the right upper bubble to empirically obtained coordinates (0.65; 0.35).&lt;/p&gt;
&lt;p&gt;Thus, for four main and four additional bubbles, we can organize the data as follows:&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/bubbles-data@2x.png" width="562" height="97" alt="" /&gt;
&lt;/div&gt;
&lt;p&gt;Let’s review more thoroughly how we’ll use them:&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;div class="fotorama" data-width="535" data-ratio="0.9006734006734"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/data-source-active@2x.png" width="535" height="594" alt="" /&gt;
&lt;img src="https://en.leftjoin.ru/pictures/data-source-churn@2x.png" width="534" height="591" alt="" /&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;So, we set on X-axis – horizontal coordinates of the centres of our bubbles, that lie in the cells &lt;i&gt;A9:A12&lt;/i&gt;, on Y-axis – vertical coordinates of the centres of our bubbles, that lie in the cells &lt;i&gt;B9:B12&lt;/i&gt;, and the sizes of the bubbles are stored in the cells &lt;i&gt;E9:E12&lt;/i&gt;.&lt;br /&gt;
Furthermore, we add another data set for the Churn, once more indicating all the required parameters.&lt;/p&gt;
&lt;p&gt;We’ll get the following chart:&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/bubbles-preparing@2x.png" width="503" height="423" alt="" /&gt;
&lt;/div&gt;
&lt;p&gt;Then, we’re making it pretty: changing colours, deleting axis and getting a beautiful result.&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/bubbles-preparing-step2@2x.png" width="568" height="458" alt="" /&gt;
&lt;/div&gt;
&lt;p&gt;By adding the required data labels, we receive what we initially needed in the task.&lt;/p&gt;
&lt;p&gt;Share your experience in comments – did you build such charts and how you solved the task?&lt;/p&gt;
</description>
<pubDate>Tue, 19 Nov 2019 10:38:11 +0300</pubDate>
</item>

<item>
<title>How to calculate Retention?</title>
<guid isPermaLink="false">12</guid>
<link>https://en.leftjoin.ru/all/retention-rate/</link>
<comments>https://en.leftjoin.ru/all/retention-rate/</comments>
<description>
&lt;p&gt;In this post we will discover, how to properly construct a report on Retention with application of &lt;a href="all/redash-full-fledged-on-demand-analytics/"&gt;Redash&lt;/a&gt; and SQL language.&lt;br /&gt;
For starters, let’s explain in a nutshell what the metric &lt;b&gt;Retention rate&lt;/b&gt; is, why it is important,&lt;/p&gt;
&lt;h2&gt;Retention rate&lt;/h2&gt;
&lt;p&gt;&lt;b&gt;Retention rate&lt;/b&gt; metric is widespread and is particularly popular within the mobile industry, since it allows to understand how well a product engages the users into daily use. Let’s recall (or discover), how &lt;b&gt;Retention&lt;/b&gt; is calculated:&lt;/p&gt;
&lt;p&gt;Retention of day &lt;i&gt;X&lt;/i&gt; – is &lt;i&gt;N%&lt;/i&gt; of users that will return to the product on day &lt;i&gt;X&lt;/i&gt;. In other words, if on some specific day (day 0) 100 new users came, and 15 returned on the first day, then Retention of the 1st day will be equal to 15/100=15%.&lt;br /&gt;
Most commonly, Retention of days 1, 3, 7 and 30 are singled out as the most descriptive metrics of a product, however it’s useful to address Retention curve as a whole and make conclusions, proceeding from it.&lt;/p&gt;
&lt;h2&gt;Retention curve&lt;/h2&gt;
&lt;p&gt;In the end, we are interested in construction of such curve, that shows the retention of users from day 0 to day 30.&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/Retention@2x.png" width="585" height="200" alt="" /&gt;
&lt;div class="e2-text-caption"&gt;Retention rate curve from day 0 do day 30&lt;/div&gt;
&lt;/div&gt;
&lt;h2&gt;Rolling Retention (RR)&lt;/h2&gt;
&lt;p&gt;Besides classic Retention rate, Rolling Retention (hereinafter, RR) is allocated. At calculation of RR, apart from day X, all the subsequent days are also considered. Thus, RR of the 1st day – the amount of users who returned on the 1st and subsequent days.&lt;/p&gt;
&lt;p&gt;Let’s compare Retention and Rolling Retention of the 10th day:&lt;br /&gt;
&lt;b&gt;Retention&lt;sub&gt;10&lt;/sub&gt;&lt;/b&gt; — the amount of users, who returned on the 10th day / the amount of users, who installed the app 10 days ago * 100%.&lt;br /&gt;
&lt;b&gt;Rolling Retention&lt;sub&gt;10&lt;/sub&gt;&lt;/b&gt; — the amount of users, who returned on the 10th day &lt;i&gt;or later&lt;/i&gt; / the amount of users, who installed the app 10 days ago * 100%.&lt;/p&gt;
&lt;h2&gt;Granularity (retention of time periods)&lt;/h2&gt;
&lt;p&gt;In some industries and respective tasks, it is useful to understand the Retention of a specific day (most often, in the mobile industry), in other cases it is useful to understand the retention of users on various time intervals: for example, weekly or monthly periods (oftentimes, it’s handy in e-commerce, retail).&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/Monthly-Retention@2x.png" width="669" height="527" alt="" /&gt;
&lt;div class="e2-text-caption"&gt;An example of cohorts by months and monthly Retention respective thereto&lt;/div&gt;
&lt;/div&gt;
&lt;h2&gt;How to build a Retention report on SQL language?&lt;/h2&gt;
&lt;p&gt;We have sorted out above how to calculate Retention in formulas. Now let’s apply it with SQL language.&lt;br /&gt;
Let’s assume, that we have two tables: &lt;i&gt;user&lt;/i&gt; — storing data about users’ identifiers and meta-information, &lt;i&gt;client_session&lt;/i&gt; — information on visits of the mobile app by users.&lt;br /&gt;
Only these two tables will be present in the query, so you can easily adapt the query to yourself.&lt;br /&gt;
&lt;i&gt;note&lt;/i&gt;: within this code, I am using Impala as DBMS.&lt;/p&gt;
&lt;h3&gt;Collecting the size of cohorts&lt;/h3&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;SELECT from_unixtime(user.installed_at, &amp;quot;yyyy-MM-dd&amp;quot;) AS reg_date,
          ndv(user.id) AS users
   FROM USER
   WHERE from_unixtime(user.installed_at)&amp;gt;=date_add(now(), -60)
     AND from_unixtime(user.installed_at)&amp;lt;=date_add(now(), -31)
   GROUP BY 1&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Let’s sort out this pretty simple query: for every day we calculate the number of unique users for the period [60 days ago; 31 days ago].&lt;br /&gt;
In order not to mess with documentation: command &lt;i&gt;ndv()&lt;/i&gt; in Impala is analogue of a command &lt;i&gt;count(distinct)&lt;/i&gt;.&lt;/p&gt;
&lt;h3&gt;Calculating the number of returned users on each cohort&lt;/h3&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;SELECT from_unixtime(user.installed_at, &amp;quot;yyyy-MM-dd&amp;quot;) AS reg_date,
          datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff,
          ndv(user.id) AS ret_base
   FROM USER
   LEFT JOIN client_session cs ON cs.user_id=user.id
   WHERE 1=1
     AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30
     AND from_unixtime(user.installed_at)&amp;gt;=date_add(now(), -60)
     AND from_unixtime(user.installed_at)&amp;lt;=date_add(now(), -31)
   GROUP BY 1, 2&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;In this query, the key part is contained in the command &lt;i&gt;datediff&lt;/i&gt;: now we are calculating for each cohort and for each &lt;i&gt;datediff&lt;/i&gt; the number of unique users with the very same command &lt;i&gt;ndv()&lt;/i&gt; (practically, the number of users, who returned within the days from 0 to 30).&lt;/p&gt;
&lt;p&gt;Great, now we have the size of cohorts and the number of returned users.&lt;/p&gt;
&lt;h3&gt;Combining all together&lt;/h3&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;SELECT reg.reg_date AS date_registration,
       reg.users AS cohort_size,
       cohort.date_diff AS day_difference,
       cohort.ret_base AS retention_base,
       cohort.ret_base/reg.users AS retention_rate
FROM
  (SELECT from_unixtime(user.installed_at, &amp;quot;yyyy-MM-dd&amp;quot;) AS reg_date,
          ndv(user.id) AS users
   FROM USER
   WHERE from_unixtime(user.installed_at)&amp;gt;=date_add(now(), -60)
     AND from_unixtime(user.installed_at)&amp;lt;=date_add(now(), -31)
   GROUP BY 1) reg
LEFT JOIN
  (SELECT from_unixtime(user.installed_at, &amp;quot;yyyy-MM-dd&amp;quot;) AS reg_date,
          datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff,
          ndv(user.id) AS ret_base
   FROM USER
   LEFT JOIN client_session cs ON cs.user_id=user.id
   WHERE 1=1
     AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30
     AND from_unixtime(user.installed_at)&amp;gt;=date_add(now(), -60)
     AND from_unixtime(user.installed_at)&amp;lt;=date_add(now(), -31)
   GROUP BY 1, 2) cohort ON reg.reg_date=cohort.reg_date
    ORDER BY 1,3&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;We have received the query, that calculates &lt;b&gt;Retention&lt;/b&gt; for each cohort, and, eventually, the result can be displayed as follows:&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/Cohort-retention@2x.png" width="585" height="200" alt="" /&gt;
&lt;div class="e2-text-caption"&gt;Retention rate, calculated for each cohort of users&lt;/div&gt;
&lt;/div&gt;
&lt;h3&gt;Construction of the sole Retention curve&lt;/h3&gt;
&lt;p&gt;Let’s modify our query a bit and obtain the data for construction of one Retention curve:&lt;/p&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;SELECT 
       cohort.date_diff AS day_difference,
       avg(reg.users) AS cohort_size,
       avg(cohort.ret_base) AS retention_base,
       avg(cohort.ret_base)/avg(reg.users)*100 AS retention_rate
FROM
  (SELECT from_unixtime(user.installed_at, &amp;quot;yyyy-MM-dd&amp;quot;) AS reg_date,
          ndv(user.id) AS users
   FROM USER
   WHERE from_unixtime(user.installed_at)&amp;gt;=date_add(now(), -60)
     AND from_unixtime(user.installed_at)&amp;lt;=date_add(now(), -31)
   GROUP BY 1) reg
LEFT JOIN
  (SELECT from_unixtime(user.installed_at, &amp;quot;yyyy-MM-dd&amp;quot;) AS reg_date,
          datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff,
          ndv(user.id) AS ret_base
   FROM USER
   LEFT JOIN client_session cs ON cs.user_id=user.id
   WHERE 1=1
     AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30
     AND from_unixtime(user.installed_at)&amp;gt;=date_add(now(), -60)
     AND from_unixtime(user.installed_at)&amp;lt;=date_add(now(), -31)
   GROUP BY 1,2) cohort ON reg.reg_date=cohort.reg_date
    GROUP BY 1        
    ORDER BY 1&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Now, we have average by all the cohorts &lt;b&gt;Retention rate&lt;/b&gt;, calculated for each day.&lt;/p&gt;
&lt;h2&gt;More on the subject&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://gopractice.ru/retention/"&gt;How to create products, forming habits?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.braze.com/blog/calculate-retention-rate/"&gt;Top 3 Ways To Calculate User Retention Rate With Formulas&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
<pubDate>Sun, 03 Nov 2019 16:27:55 +0300</pubDate>
</item>

<item>
<title>Redash – full-fledged on-demand analytics</title>
<guid isPermaLink="false">2</guid>
<link>https://en.leftjoin.ru/all/redash-full-fledged-on-demand-analytics/</link>
<comments>https://en.leftjoin.ru/all/redash-full-fledged-on-demand-analytics/</comments>
<description>
&lt;p&gt;Today we will examine and try to get to the bottom of one tool that is quite famous nowadays – Redash. The tool is extremely convenient due to the fact that it can work with Clickhouse. All the other developers of BI-tools don’t support &lt;a href="https://clickhouse.yandex/docs/en/"&gt;Clickhouse&lt;/a&gt;. Just recently, an ODBC driver has appeared for Tableau.&lt;/p&gt;
&lt;p&gt;I will make a short review of Redash’s useful features, and in the following number of posts I will figure out which useful requests and reports can be constructed using Redash.&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/redash_home.png" width="900" height="504" alt="" /&gt;
&lt;div class="e2-text-caption"&gt;Redash Homepage&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;So, what is Redash? It is the tool for on-demand analytics, moreover it can be used on completely different databases. For instance, Redash can be connected to a database under MySQL or HP Vertica.&lt;br /&gt;
The main thing, proposed by Redash is a handy console for writing SQL-queries to a database.&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/redash_V2.png" width="900" height="511" alt="" /&gt;
&lt;div class="e2-text-caption"&gt;Console for writing SQL-queries&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Consequently, the first major observation: in order to use Redash you should know SQL or have an employee in team who knows SQL.&lt;br /&gt;
Apart from the basic console, Redash provides visualization tools (inter alia, construction of Funnel-charts, cohort analysis and pivot tables (however, the functionality of the latter is very limited)), as well as tools for construction of dashboards and alert systems (via mail or in slack).&lt;br /&gt;
Let’s examine the basic console a bit more thoroughly. A user has an access to some useful features: application of filters, multifilters and query’s parameters:&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/redash_v3.png" width="900" height="528" alt="" /&gt;
&lt;div class="e2-text-caption"&gt;Simultaneous application of parameters, filters and multifilters in queries&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Further, opportunities of visualizations should be mentioned separately. They are, indeed, convenient and extensive, especially considering that one can set automatic updating of query results in accordance with the required timetable.&lt;br /&gt;
In the following note we will discuss visualization in Redash more closely, and later – construction of dashboards and alert systems.&lt;/p&gt;
</description>
<pubDate>Wed, 14 Nov 2018 23:41:09 +0300</pubDate>
</item>


</channel>
</rss>