{
    "version": "https:\/\/jsonfeed.org\/version\/1",
    "title": "LEFT JOIN: blog on analytics, visualisation & data science, posts tagged: analysis",
    "home_page_url": "https:\/\/en.leftjoin.ru\/tags\/analysis\/",
    "feed_url": "https:\/\/en.leftjoin.ru\/tags\/analysis\/json\/",
    "icon": "https:\/\/en.leftjoin.ru\/user\/userpic@2x.jpg",
    "author": {
        "name": "Nikolay Valiotti",
        "url": "https:\/\/en.leftjoin.ru\/",
        "avatar": "https:\/\/en.leftjoin.ru\/user\/userpic@2x.jpg"
    },
    "items": [
        {
            "id": "71",
            "url": "https:\/\/en.leftjoin.ru\/all\/mean-vs-median\/",
            "title": "Mean VS median: how to choose a target metric?",
            "content_html": "<p>In today’s article, we would like to highlight a simple, but important topic – how to choose a simple metric to evaluate a particular dataset. Everyone has been familiar with the arithmetic mean for a long time, almost every student knows very well that you should sum up all the available values, divide by their number and get the average value. However, school knowledge does not include any alternative options, of which, in fact, there are many in statistics – almost for every occasion. In solving research and marketing problems, people often take mean as a target. Is this legitimate or is there a better option? Let’s figure it out.<\/p>\n<p>To begin with, it is worth remembering the definitions of the two metrics that we will talk about today.<br \/>\nMean is the most popular statistic used to calculate a data center. What is the median? Median is a value that splits data, sorted in order of increasing values, into two equal parts. This means that the median shows the central value in the sample if the number of cases is odd and the arithmetic mean of the two values ​​if the number of cases in the sample is even.<\/p>\n<h2>Research tasks<\/h2>\n<p>So, the estimation of the sample mean is often important in many research questions. For instance, specialists studying demography often study changes in the number of regions in Russia in order to track the dynamics and reflect it in reports. Let’s try to calculate the average size of the Russian city, as well as the median, and then compare the results.<br \/>\nFirst, you need to find and load data by connecting the pandas library for this.<\/p>\n<pre class=\"e2-text-code\"><code>import pandas as pd\r\ncity = pd.read_csv('city.csv', sep = ';')<\/code><\/pre><p>Then, you need to calculate the mean and median of the sample.<\/p>\n<pre class=\"e2-text-code\"><code>mean_pop = round (city.population_2020.mean (), 0)\r\nmedian_pop = round (city.population_2020.median (), 0)<\/code><\/pre><p>The values, of course, are different, since the distribution of observations in the sample is different from the normal one. In order to understand whether they are very different, let’s build a distribution graph and display the mean and median.<\/p>\n<pre class=\"e2-text-code\"><code>import matplotlib.pyplot as plt\r\nimport seaborn as sns\r\n\r\nsns.set_palette('rainbow')\r\nfig = plt.figure(figsize = (20, 15))\r\nax = fig.add_subplot(1, 1, 1)\r\ng = sns.histplot(data = city, x= 'population_2020', alpha=0.6, bins = 100, ax=ax)\r\n\r\ng.axvline(mean_pop, linewidth=2, color='r', alpha=0.9, linestyle='--', label = 'Mean = {:,.0f}'.format(mean_pop).replace(',', ' '))\r\ng.axvline(median_pop, linewidth=2, color='darkgreen', alpha=0.9, linestyle='--', label = 'Median = {:,.0f}'.format(median_pop).replace(',', ' '))\r\n\r\nplt.ticklabel_format(axis='x', style='plain')\r\nplt.xlabel(&quot;Population&quot;, fontsize=20)\r\nplt.ylabel(&quot;Number of cities&quot;, fontsize=20)\r\nplt.title(&quot;Distribution of population of russian cities&quot;, fontsize=20)\r\nplt.legend(fontsize=&quot;xx-large&quot;)\r\nplt.show()<\/code><\/pre><div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/first.jpg\" width=\"1440\" height=\"1080\" alt=\"\" \/>\n<\/div>\n<p>Also, on this data it is worth building a boxplot for more accurate visualization with the main distribution quantiles, median, mean and outliers.<\/p>\n<pre class=\"e2-text-code\"><code>fig = plt.figure(figsize = (10, 10))\r\nsns.set_theme(style=&quot;whitegrid&quot;)\r\nsns.set_palette(palette=&quot;pastel&quot;)\r\n\r\nsns.boxplot(y = city['population_2020'], showfliers = False)\r\n\r\nplt.scatter(0, 550100, marker='*', s=100, color = 'black', label = 'Outlier')\r\nplt.scatter(0, 560200, marker='*', s=100, color = 'black')\r\nplt.scatter(0, 570300, marker='*', s=100, color = 'black')\r\nplt.scatter(0, mean_pop, marker='o', s=100, color = 'red', edgecolors = 'black', label = 'Mean')\r\nplt.legend()\r\n\r\nplt.ylabel(&quot;Population&quot;, fontsize=15)\r\nplt.ticklabel_format(axis='y', style='plain')\r\nplt.title(&quot;Boxplot of population of russian cities&quot;, fontsize=15)\r\nplt.show()<\/code><\/pre><div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/second.jpg\" width=\"720\" height=\"720\" alt=\"\" \/>\n<\/div>\n<p>It follows from the graphs that the median is significantly less than the average, and it is also clear that this is a consequence of the presence of outliers which are Moscow and St. Petersburg. Since the arithmetic mean is an extremely sensitive metric to outliers, it is not worth relying on conclusions about the mean if they are present in the sample. An increase or decrease in the population of Moscow can greatly change the average population in Russia, but this will not affect the real regional trend.<br \/>\nUsing the arithmetic mean, we say that the number of a typical (average) city in the Russian Federation is 268 thousand people. However, this misleads us, since the mean value is significantly higher than the median solely due to the size of the population of Moscow and St. Petersburg. In fact, the number of a typical Russian city is significantly less (2 times less!) and is approximately 104 thousand citizens.<\/p>\n<h2>Marketing tasks<\/h2>\n<p>In a business tasks, the difference between the mean and the median is also important, as using the wrong metric can seriously affect the results of the advertising campaign or make it difficult to achieve the goal. Let’s take a look at a real example of the difficulties an entrepreneur can face in retail if he chooses the wrong target metric.<br \/>\nTo begin with, as in the previous example, let’s load a dataset about supermarket purchases. Let’s select the dataset columns necessary for analysis and rename them to simplify the code in the future. Since this data is not as well prepared as the previous ones, it is necessary to group all purchased items by receipts. In this case, it is necessary to group by two variables: by the customer’s id and by the date of purchase (the date and time is determined by the moment of closing the bill, therefore, all purchases within one bill coincide by date variable). Then, let’s name the resulting column “total_bill”, that is, the check amount and calculate the average and median.<\/p>\n<pre class=\"e2-text-code\"><code>df = pd.read_excel ('invoice_data.xlsx')\r\ndf.columns = ['user', 'total_price', 'date']\r\ngroupped_df = pd.DataFrame (df.groupby (['user', 'date']). total_price.sum ())\r\ngroupped_df.columns = ['total_bill']\r\nmean_bill = groupped_df.total_bill.mean ()\r\nmedian_bill = groupped_df.total_bill.median ()<\/code><\/pre><p>Now, as in the previous example, you need to plot the distribution of customer checks and boxplot, and also display the median and mean on each of them.<\/p>\n<pre class=\"e2-text-code\"><code>sns.set_palette('rainbow')\r\nfig = plt.figure(figsize = (20, 15))\r\nax = fig.add_subplot(1, 1, 1)\r\nsns.histplot(groupped_df, x = 'total_bill', binwidth=200, alpha=0.6, ax=ax)\r\nplt.xlabel(&quot;Purchases&quot;, fontsize=20)\r\nplt.ylabel(&quot;Total bill&quot;, fontsize=20)\r\nplt.title(&quot;Distribution of total bills&quot;, fontsize=20)\r\nplt.axvline(mean_bill, linewidth=2, color='r', alpha=1, linestyle='--', label = 'Mean = {:.0f}'.format(mean_bill))\r\nplt.axvline(median_bill, linewidth=2, color='darkgreen', alpha=1, linestyle='--', label = 'Median = {:.0f}'.format(median_bill))\r\nplt.legend(fontsize=&quot;xx-large&quot;)\r\nplt.show()<\/code><\/pre><div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/third.jpg\" width=\"1440\" height=\"1080\" alt=\"\" \/>\n<\/div>\n<pre class=\"e2-text-code\"><code>fig = plt.figure(figsize = (10, 10))\r\nsns.set_theme(style=&quot;whitegrid&quot;)\r\nsns.set_palette(palette=&quot;pastel&quot;)\r\n\r\nsns.boxplot(y = groupped_df['total_bill'], showfliers = False)\r\n\r\nplt.scatter(0, 1800, marker='*', s=100, color = 'black', label = 'Outlier')\r\nplt.scatter(0, 1850, marker='*', s=100, color = 'black')\r\nplt.scatter(0, 1900, marker='*', s=100, color = 'black')\r\nplt.scatter(0, mean_bill, marker='o', s=100, color = 'red', edgecolors = 'black', label = 'Mean')\r\nplt.legend()\r\n\r\nplt.ticklabel_format(axis='y', style='plain')\r\nplt.ylabel(&quot;Total bill&quot;, fontsize=15)\r\nplt.title(&quot;Boxplot of total bills&quot;, fontsize=15)\r\nplt.show()<\/code><\/pre><div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/forth.jpg\" width=\"720\" height=\"720\" alt=\"\" \/>\n<\/div>\n<p>The graphs show that the distribution is different from normal, which means that the median and mean are not equal. The median value is smaller than the average by about 220 rubles.<br \/>\nNow, imagine that marketers have a task to increase the average buyer’s bill. A marketer may decide that since the average check is 601 rubles, the following promotion can be offered: “All buyers who make a purchase over 600 rubles, will get 20% discount on any good for 100 rubles.” In general, it is a reasonable offer, however, in reality, the average check is lower – 378 rubles. Thus, the majority of buyers will not be interested in the offer, since their purchase usually does not reach the proposed threshold. This means, that they will not take advantage of the offer and will not receive a discount, and the company will not be able to achieve its goal and increase the profit. The point is that the initial assumptions were wrong.<\/p>\n<h2>Conclusions<\/h2>\n<p>As you already understood, the mean often shows a more pleasant result, both for business and for research tasks, because it is always nicer to imagine the situation with the average check or the demographic situation in the country better than it really is. However, one must always remember about the shortcomings of mean in order to be able to correctly choose the appropriate analogue for assessing a particular situation.<\/p>\n",
            "date_published": "2021-10-27T14:20:56+03:00",
            "date_modified": "2021-10-27T14:20:52+03:00",
            "image": "https:\/\/en.leftjoin.ru\/pictures\/first.jpg",
            "_date_published_rfc2822": "Wed, 27 Oct 2021 14:20:56 +0300",
            "_rss_guid_is_permalink": "false",
            "_rss_guid": "71",
            "_e2_data": {
                "is_favourite": false,
                "links_required": [
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css"
                ],
                "og_images": [
                    "https:\/\/en.leftjoin.ru\/pictures\/first.jpg",
                    "https:\/\/en.leftjoin.ru\/pictures\/second.jpg",
                    "https:\/\/en.leftjoin.ru\/pictures\/third.jpg",
                    "https:\/\/en.leftjoin.ru\/pictures\/forth.jpg"
                ]
            }
        },
        {
            "id": "66",
            "url": "https:\/\/en.leftjoin.ru\/all\/python-and-lyrics-of-zemfiras-new-album-capturing-the-spirit-of\/",
            "title": "Python and lyrics of Zemfira’s new album: capturing the spirit of her songs",
            "content_html": "<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/header.jpg\" width=\"600\" height=\"600\" alt=\"\" \/>\n<\/div>\n<p>Zemfira’s latest studio album, Borderline, was released in February, 8 years after the previous one. For this album, various people cooperated with her, including her relatives – the riff for the song “Таблетки” was written by her nephew from London. The album turned out to be diverse: for instance, the song “Остин” is dedicated to the main character of the Homescapes game by the Russian studio Playrix (by the way, check out the latest <a href=\"https:\/\/youtu.be\/SOx8afEUTnE\">Business Secrets with the Bukhman brothers<\/a>, they also mention it there). Zemfira liked the game a lot, thus, she contacted Playrix to create this song. Also, the song “Крым” was written as a soundtrack to a new film by Zemfira’s colleague Renata Litvinova.<\/p>\n<p class=\"note\">Listen new album in <a href=\"https:\/\/music.apple.com\/ru\/album\/бордерлайн\/1554865105\">Apple Music<\/a> \/ <a href=\"https:\/\/music.yandex.ru\/album\/14052981\">Яндекс.Музыке<\/a> \/ <a href=\"https:\/\/open.spotify.com\/album\/6khBsXmKA1FKjYVCIBy9kt\">Spotify<\/a><\/p>\n<p>Nevertheless, the spirit of the whole album is rather gloomy – the songs often repeat the words ‘боль’, ‘ад’, ‘бесишь’ and other synonyms. We decided to conduct an exploratory analysis of her album, and then, using the Word2Vec model and a cosine measure, look at the semantic closeness of the songs and calculate the general mood of the album.<\/p>\n<p>For those who are bored with reading about data preparation and analysis steps, you can <a href=\"https:\/\/leftjoin.ru\/all\/borderline-text-analysis\/#result\">go directly to the results<\/a>.<\/p>\n<h2>Data preparation<\/h2>\n<p>For starters, we write a data processing script. The purpose of the script is to collect a united csv-table from a set of text files, each of which contains a song. At the same time, we have to get rid of all punctuation marks and unnecessary words as we need to focus only on significant content.<\/p>\n<pre class=\"e2-text-code\"><code>import pandas as pd\r\nimport re\r\nimport string\r\nimport pymorphy2\r\nfrom nltk.corpus import stopwords<\/code><\/pre><p>Then we create a morphological analyzer and expand the list of everything that needs to be discarded:<\/p>\n<pre class=\"e2-text-code\"><code>morph = pymorphy2.MorphAnalyzer()\r\nstopwords_list = stopwords.words('russian')\r\nstopwords_list.extend(['куплет', 'это', 'я', 'мы', 'ты', 'припев', 'аутро', 'предприпев', 'lyrics', '1', '2', '3', 'то'])\r\nstring.punctuation += '—'<\/code><\/pre><p>The names of the songs are given in English, so we have to create a dictionary for translation into Russian and a dictionary, from which we will later make a table:<\/p>\n<pre class=\"e2-text-code\"><code>result_dict = dict()\r\n\r\nsongs_dict = {\r\n    'snow':'снег идёт',\r\n    'crimea':'крым',\r\n    'mother':'мама',\r\n    'ostin':'остин',\r\n    'abuse':'абьюз',\r\n    'wait_for_me':'жди меня',\r\n    'tom':'том',\r\n    'come_on':'камон',\r\n    'coat':'пальто',\r\n    'this_summer':'этим летом',\r\n    'ok':'ок',\r\n    'pills':'таблетки'\r\n}<\/code><\/pre><p>Let’s define several necessary functions. The first one reads the entire song from the file and removes line breaks, the second clears the text from unnecessary characters and words, and the third one converts the words to normal form, using the pymorphy2 morphological analyzer. The pymorphy2 module does not always handle ambiguity well – additional processing is required for the words ‘ад’ and ‘рай’.<\/p>\n<pre class=\"e2-text-code\"><code>def read_song(filename):\r\n    f = open(f'{filename}.txt', 'r').read()\r\n    f = f.replace('\\n', ' ')\r\n    return f\r\n\r\ndef clean_string(text):\r\n    text = re.split(' |:|\\.|\\(|\\)|,|&quot;|;|\/|\\n|\\t|-|\\?|\\[|\\]|!', text)\r\n    text = ' '.join([word for word in text if word not in string.punctuation])\r\n    text = text.lower()\r\n    text = ' '.join([word for word in text.split() if word not in stopwords_list])\r\n    return text\r\n\r\ndef string_to_normal_form(string):\r\n    string_lst = string.split()\r\n    for i in range(len(string_lst)):\r\n        string_lst[i] = morph.parse(string_lst[i])[0].normal_form\r\n        if (string_lst[i] == 'аду'):\r\n            string_lst[i] = 'ад'\r\n        if (string_lst[i] == 'рая'):\r\n            string_lst[i] = 'рай'\r\n    string = ' '.join(string_lst)\r\n    return string<\/code><\/pre><p>After all this preparation, we can get back to the data and process each song and read the file with the corresponding name:<\/p>\n<pre class=\"e2-text-code\"><code>name_list = []\r\ntext_list = []\r\nfor song, name in songs_dict.items():\r\n    text = string_to_normal_form(clean_string(read_song(song)))\r\n    name_list.append(name)\r\n    text_list.append(text)<\/code><\/pre><p>Then we combine everything into a DataFrame and save it as a csv-file.<\/p>\n<pre class=\"e2-text-code\"><code>df = pd.DataFrame()\r\ndf['name'] = name_list\r\ndf['text'] = text_list\r\ndf['time'] = [290, 220, 187, 270, 330, 196, 207, 188, 269, 189, 245, 244]\r\ndf.to_csv('borderline.csv', index=False)<\/code><\/pre><p>Result:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/2-table.png\" width=\"477\" height=\"365\" alt=\"\" \/>\n<\/div>\n<h2>Word cloud for the whole album<\/h2>\n<p>To begin with the analysis, we have to construct a word cloud, because it can display the most common words found in these songs. We import the required libraries, read the csv-file and set the configurations:<\/p>\n<pre class=\"e2-text-code\"><code>import nltk\r\nfrom wordcloud import WordCloud\r\nimport pandas as pd\r\nimport matplotlib.pyplot as plt\r\nfrom nltk import word_tokenize, ngrams\r\n\r\n%matplotlib inline\r\nnltk.download('punkt')\r\ndf = pd.read_csv('borderline.csv')<\/code><\/pre><p>Now we create a new figure, set the design parameters and, using the word cloud library, display words with the size directly proportional to the frequency of the word. We additionally indicate the name of the song above the corresponding graph.<\/p>\n<pre class=\"e2-text-code\"><code>fig = plt.figure()\r\nfig.patch.set_facecolor('white')\r\nplt.subplots_adjust(wspace=0.3, hspace=0.2)\r\ni = 1\r\nfor name, text in zip(df.name, df.text):\r\n    tokens = word_tokenize(text)\r\n    text_raw = &quot; &quot;.join(tokens)\r\n    wordcloud = WordCloud(colormap='PuBu', background_color='white', contour_width=10).generate(text_raw)\r\n    plt.subplot(4, 3, i, label=name,frame_on=True)\r\n    plt.tick_params(labelsize=10)\r\n    plt.imshow(wordcloud)\r\n    plt.axis(&quot;off&quot;)\r\n    plt.title(name,fontdict={'fontsize':7,'color':'grey'},y=0.93)\r\n    plt.tick_params(labelsize=10)\r\n    i += 1<\/code><\/pre><div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/3-wordcloud.jpg\" width=\"2560\" height=\"1707\" alt=\"\" \/>\n<\/div>\n<h2>EDA of the lyrics<\/h2>\n<p>Let us move to the next part and analyze the lyrics. To do this, we have to import special libraries to deal with data and visualization:<\/p>\n<pre class=\"e2-text-code\"><code>import plotly.graph_objects as go\r\nimport plotly.figure_factory as ff\r\nfrom scipy import spatial\r\nimport collections\r\nimport pymorphy2\r\nimport gensim\r\n\r\nmorph = pymorphy2.MorphAnalyzer()<\/code><\/pre><p>Firstly, we should count the overall number of words in each song, the number of unique words, and their percentage:<\/p>\n<pre class=\"e2-text-code\"><code>songs = []\r\ntotal = []\r\nuniq = []\r\npercent = []\r\n\r\nfor song, text in zip(df.name, df.text):\r\n    songs.append(song)\r\n    total.append(len(text.split()))\r\n    uniq.append(len(set(text.split())))\r\n    percent.append(round(len(set(text.split())) \/ len(text.split()), 2) * 100)<\/code><\/pre><p>All this information should be written in a DataFrame and additionally we want to count the number of words per minute for each song:<\/p>\n<pre class=\"e2-text-code\"><code>df_words = pd.DataFrame()\r\ndf_words['song'] = songs\r\ndf_words['total words'] = total\r\ndf_words['uniq words'] = uniq\r\ndf_words['percent'] = percent\r\ndf_words['time'] = df['time']\r\ndf_words['words per minute'] = round(total \/ (df['time'] \/\/ 60))\r\ndf_words = df_words[::-1]<\/code><\/pre><div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/4-table.png\" width=\"480\" height=\"369\" alt=\"\" \/>\n<\/div>\n<p>It would be great to visualize the data, so let us build two bar charts: one for the number of words in the song, and the other one for the number of words per minute.<\/p>\n<pre class=\"e2-text-code\"><code>colors_1 = ['rgba(101,181,205,255)'] * 12\r\ncolors_2 = ['rgba(62,142,231,255)'] * 12\r\n\r\nfig = go.Figure(data=[\r\n    go.Bar(name='📝 Total number of words,\r\n           text=df_words['total words'],\r\n           textposition='auto',\r\n           x=df_words.song,\r\n           y=df_words['total words'],\r\n           marker_color=colors_1,\r\n           marker=dict(line=dict(width=0)),),\r\n    go.Bar(name='🌀 Unique words',\r\n           text=df_words['uniq words'].astype(str) + '&lt;br&gt;'+ df_words.percent.astype(int).astype(str) + '%' ,\r\n           textposition='inside',\r\n           x=df_words.song,\r\n           y=df_words['uniq words'],\r\n           textfont_color='white',\r\n           marker_color=colors_2,\r\n           marker=dict(line=dict(width=0)),),\r\n])\r\n\r\nfig.update_layout(barmode='group')\r\n\r\nfig.update_layout(\r\n    title = \r\n        {'text':'&lt;b&gt;The ratio of the number of unique words to the total&lt;\/b&gt;&lt;br&gt;&lt;span style=&quot;color:#666666&quot;&gt;&lt;\/span&gt;'},\r\n    showlegend = True,\r\n    height=650,\r\n    font={\r\n        'family':'Open Sans, light',\r\n        'color':'black',\r\n        'size':14\r\n    },\r\n    plot_bgcolor='rgba(0,0,0,0)',\r\n)\r\nfig.update_layout(legend=dict(\r\n    yanchor=&quot;top&quot;,\r\n    xanchor=&quot;right&quot;,\r\n))\r\n\r\nfig.show()<\/code><\/pre><iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https:\/\/plotly.com\/~Elisejj\/96.embed?showlink=false\" height=\"650\" width=\"100%\"><\/iframe>\n<pre class=\"e2-text-code\"><code>colors_1 = ['rgba(101,181,205,255)'] * 12\r\ncolors_2 = ['rgba(238,85,59,255)'] * 12\r\n\r\nfig = go.Figure(data=[\r\n    go.Bar(name='⏱️ Track length, min.',\r\n           text=round(df_words['time'] \/ 60, 1),\r\n           textposition='auto',\r\n           x=df_words.song,\r\n           y=-df_words['time'] \/\/ 60,\r\n           marker_color=colors_1,\r\n           marker=dict(line=dict(width=0)),\r\n          ),\r\n    go.Bar(name='🔄 Words per minute',\r\n           text=df_words['words per minute'],\r\n           textposition='auto',\r\n           x=df_words.song,\r\n           y=df_words['words per minute'],\r\n           marker_color=colors_2,\r\n           textfont_color='white',\r\n           marker=dict(line=dict(width=0)),\r\n          ),\r\n])\r\n\r\nfig.update_layout(barmode='overlay')\r\n\r\nfig.update_layout(\r\n    title = \r\n        {'text':'&lt;b&gt;Track length and words per minute&lt;\/b&gt;&lt;br&gt;&lt;span style=&quot;color:#666666&quot;&gt;&lt;\/span&gt;'},\r\n    showlegend = True,\r\n    height=650,\r\n    font={\r\n        'family':'Open Sans, light',\r\n        'color':'black',\r\n        'size':14\r\n    },\r\n    plot_bgcolor='rgba(0,0,0,0)'\r\n)\r\n\r\n\r\nfig.show()<\/code><\/pre><iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https:\/\/plotly.com\/~Elisejj\/98.embed?showlink=false\" height=\"650\" width=\"100%\"><\/iframe>\n<h2>Working with Word2Vec model<\/h2>\n<p>Using the gensim module, load the model pointing to a binary file:<\/p>\n<pre class=\"e2-text-code\"><code>model = gensim.models.KeyedVectors.load_word2vec_format('model.bin', binary=True)<\/code><\/pre><p class=\"note\">Для материала мы использовали готовую обученную на Национальном Корпусе Русского Языка модель от сообщества <a href=\"https:\/\/rusvectores.org\/ru\/models\/\">RusVectōrēs<\/a><\/p>\n<p>The Word2Vec model is based on neural networks and allows you to represent words in the form of vectors, taking into account the semantic component. It means that if we take two words – for instance, “mom” and “dad”, then represent them as two vectors and calculate the cosine, the values ​​will be close to 1. Similarly, two words that have nothing in common in their meaning have a cosine measure close to 0.<\/p>\n<p>Now we will define the get_vector function: it will take a list of words, recognize a part of speech for each word, and then receive and summarize vectors, so that we can find vectors even for whole sentences and texts.<\/p>\n<pre class=\"e2-text-code\"><code>def get_vector(word_list):\r\n    vector = 0\r\n    for word in word_list:\r\n        pos = morph.parse(word)[0].tag.POS\r\n        if pos == 'INFN':\r\n            pos = 'VERB'\r\n        if pos in ['ADJF', 'PRCL', 'ADVB', 'NPRO']:\r\n            pos = 'NOUN'\r\n        if word and pos:\r\n            try:\r\n                word_pos = word + '_' + pos\r\n                this_vector = model.word_vec(word_pos)\r\n                vector += this_vector\r\n            except KeyError:\r\n                continue\r\n    return vector<\/code><\/pre><p>For each song, find a vector and select the corresponding column in the DataFrame:<\/p>\n<pre class=\"e2-text-code\"><code>vec_list = []\r\nfor word in df['text']:\r\n    vec_list.append(get_vector(word.split()))\r\ndf['vector'] = vec_list<\/code><\/pre><p>So, now we should compare these vectors with one another, calculating their cosine proximity. Those songs with a cosine metric higher than 0.5 will be saved separately – this way we will get the closest pairs of songs. We will write the information about the comparison of vectors into the two-dimensional list result.<\/p>\n<pre class=\"e2-text-code\"><code>similar = dict()\r\nresult = []\r\nfor song_1, vector_1 in zip(df.name, df.vector):\r\n    sub_list = []\r\n    for song_2, vector_2 in zip(df.name.iloc[::-1], df.vector.iloc[::-1]):\r\n        res = 1 - spatial.distance.cosine(vector_1, vector_2)\r\n        if res &gt; 0.5 and song_1 != song_2 and (song_1 + ' \/ ' + song_2 not in similar.keys() and song_2 + ' \/ ' + song_1 not in similar.keys()):\r\n            similar[song_1 + ' \/ ' + song_2] = round(res, 2)\r\n        sub_list.append(round(res, 2))\r\n    result.append(sub_list)<\/code><\/pre><p>Moreover, we can construct the same bar chart:<\/p>\n<pre class=\"e2-text-code\"><code>df_top_sim = pd.DataFrame()\r\ndf_top_sim['name'] = list(similar.keys())\r\ndf_top_sim['value'] = list(similar.values())\r\ndf_top_sim.sort_values(by='value', ascending=False)<\/code><\/pre><p>И построим такой же bar chart:<\/p>\n<pre class=\"e2-text-code\"><code>colors = ['rgba(101,181,205,255)'] * 5\r\n\r\nfig = go.Figure([go.Bar(x=df_top_sim['name'],\r\n                        y=df_top_sim['value'],\r\n                        marker_color=colors,\r\n                        width=[0.4,0.4,0.4,0.4,0.4],\r\n                        text=df_top_sim['value'],\r\n                        textfont_color='white',\r\n                        textposition='auto')])\r\n\r\nfig.update_layout(\r\n    title = \r\n        {'text':'&lt;b&gt;Топ-5 closest songs&lt;\/b&gt;&lt;br&gt;&lt;span style=&quot;color:#666666&quot;&gt;&lt;\/span&gt;'},\r\n    showlegend = False,\r\n    height=650,\r\n    font={\r\n        'family':'Open Sans, light',\r\n        'color':'black',\r\n        'size':14\r\n    },\r\n    plot_bgcolor='rgba(0,0,0,0)',\r\n    xaxis={'categoryorder':'total descending'}\r\n)\r\n\r\nfig.show()<\/code><\/pre><iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https:\/\/plotly.com\/~Elisejj\/100.embed?showlink=false\" height=\"650\" width=\"100%\"><\/iframe>\n<p>Given the vector of each song, let us calculate the vector of the entire album – add the vectors of the songs. Then, for such a vector, using the model, we get the words that are the closest in spirit and meaning.<\/p>\n<pre class=\"e2-text-code\"><code>def get_word_from_tlist(lst):\r\n    for word in lst:\r\n        word = word[0].split('_')[0]\r\n        print(word, end=' ')\r\n\r\nvec_sum = 0\r\nfor vec in df.vector:\r\n    vec_sum += vec\r\nsim_word = model.similar_by_vector(vec_sum)\r\nget_word_from_tlist(sim_word)<\/code><\/pre><p><span style=\"color: '#65b5cd'; font-size: 1.2em\"><b>небо тоска тьма пламень плакать горе печаль сердце солнце мрак<\/b><\/span><\/p>\n<p>This is probably the key result and the description of Zemfira’s album in just 10 words.<\/p>\n<p>Finally, we build a general heat map, each cell of which is the result of comparing the texts of two tracks with a cosine measure.<\/p>\n<pre class=\"e2-text-code\"><code>colorscale=[[0.0, &quot;rgba(255,255,255,255)&quot;],\r\n            [0.1, &quot;rgba(229,232,237,255)&quot;],\r\n            [0.2, &quot;rgba(216,222,232,255)&quot;],\r\n            [0.3, &quot;rgba(205,214,228,255)&quot;],\r\n            [0.4, &quot;rgba(182,195,218,255)&quot;],\r\n            [0.5, &quot;rgba(159,178,209,255)&quot;],\r\n            [0.6, &quot;rgba(137,161,200,255)&quot;],\r\n            [0.7, &quot;rgba(107,137,188,255)&quot;],\r\n            [0.8, &quot;rgba(96,129,184,255)&quot;],\r\n            [1.0, &quot;rgba(76,114,176,255)&quot;]]\r\n\r\nfont_colors = ['black']\r\nx = list(df.name.iloc[::-1])\r\ny = list(df.name)\r\nfig = ff.create_annotated_heatmap(result, x=x, y=y, colorscale=colorscale, font_colors=font_colors)\r\nfig.show()<\/code><\/pre><iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https:\/\/plotly.com\/~Elisejj\/82.embed?showlink=false\" height=\"650\" width=\"100%\"><\/iframe>\n<h2><a name=\"result\">Results and data interpretation<\/a><\/h2>\n<p>To give valuable conclusions, we would like to take another look at everything we got. First of all, let us focus on the word cloud. It is easy to see that the words ‘боль’, ‘невозможно’, ‘сорваться’, ‘растерзаны’, ‘сложно’, ‘терпеть’, ‘любить’ have a very decent size, because such words are often found throughout the entire lyrics:<\/p>\n<p>Давайте ещё раз посмотрим на всё, что у нас получилось — начнём с облака слов. Нетрудно заметить, что у слов «боль», «невозможно», «сорваться», «растерзаны», «сложно», «терпеть», «любить» размер весьма приличный — всё потому, что такие слова встречаются часто на протяжении всего текста песен:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/9-wordcloud.jpg\" width=\"2560\" height=\"1707\" alt=\"\" \/>\n<\/div>\n<p>The song “Крым” turned out to be one of the most diverse songs – it contains 74% of unique words. Also, the song “Снег идет” contains very few words, so the majority, which is 82%, are unique. The largest song on the album in terms of amount of words is the track “Таблетки” – there are about 150 words in total.<\/p>\n<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https:\/\/plotly.com\/~Elisejj\/96.embed?showlink=false\" height=\"650\" width=\"100%\"><\/iframe>\n<p>As it was shown on the last chart, the most dynamic track is “Таблетки”, as much as 37 words per minute – nearly one word for every two seconds – and the longest track is “Абьюз”, and according to the previous chart, it also has the lowest percentage of unique words – 46%.<\/p>\n<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https:\/\/plotly.com\/~Elisejj\/98.embed?showlink=false\" height=\"650\" width=\"100%\"><\/iframe>\n<p>Top 5 most semantically similar text pairs:<\/p>\n<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https:\/\/plotly.com\/~Elisejj\/100.embed?showlink=false\" height=\"650\" width=\"100%\"><\/iframe>\n<p>We also got the vector of the entire album and found the closest words. Just take a look at them – ‘тьма’, ‘тоска’, ‘плакать’, ‘горе’, ‘печаль’, ‘сердце’ – this is the list of words that characterizes Zemfira’s lyrics!<\/p>\n<p><span style=\"color: '#65b5cd'; font-size: 1.2em\"><b>небо тоска тьма пламень плакать горе печаль сердце солнце мрак<\/b><\/span><\/p>\n<p>The final result is a heat map. From the visualization, it is noticeable that almost all songs are quite similar to each other – the cosine measure for many pairs exceeds the value of 0.4.<\/p>\n<iframe id=\"igraph\" scrolling=\"no\" style=\"border:none;\" seamless=\"seamless\" src=\"https:\/\/plotly.com\/~Elisejj\/82.embed?showlink=false\" height=\"650\" width=\"100%\"><\/iframe>\n<h2>Conclusions<\/h2>\n<p>In the material, we carried out an EDA of the entire text of the new album and, using the pre-trained Word2Vec model, we proved the hypothesis – most of the “Borderline” songs are permeated with rather dark lyrics. However, this is normal, because we love Zemfira precisely for her sincerity and straightforwardness.<\/p>\n",
            "date_published": "2021-09-07T13:46:21+03:00",
            "date_modified": "2021-09-07T13:46:08+03:00",
            "image": "https:\/\/en.leftjoin.ru\/pictures\/header.jpg",
            "_date_published_rfc2822": "Tue, 07 Sep 2021 13:46:21 +0300",
            "_rss_guid_is_permalink": "false",
            "_rss_guid": "66",
            "_e2_data": {
                "is_favourite": false,
                "links_required": [
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css"
                ],
                "og_images": [
                    "https:\/\/en.leftjoin.ru\/pictures\/header.jpg",
                    "https:\/\/en.leftjoin.ru\/pictures\/2-table.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/3-wordcloud.jpg",
                    "https:\/\/en.leftjoin.ru\/pictures\/4-table.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/9-wordcloud.jpg"
                ]
            }
        },
        {
            "id": "58",
            "url": "https:\/\/en.leftjoin.ru\/all\/powerbi-dashboard-overview\/",
            "title": "PowerBI Dashboard Overview",
            "content_html": "<p>We continue the series of materials on BI-systems and today we will have a look at the dashboard prepared in PowerBI using the SuperStore Sales dataset. We will cover how to connect the data to the system, set custom colors for visualizations and create new measures, implement switching between charts using bookmarks and we will discuss the challenges that we faced when building the dashboard.<\/p>\n<p>This is the how the final dashboard looks like:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/--2021-04-28-163157.png\" width=\"1219\" height=\"672\" alt=\"\" \/>\n<\/div>\n<p>The most notable feature of the dashboard is data cards that show the company’s KPI. The cards compare the parameters to the same period in the previous year and show the previous year’s dynamics in the background.<\/p>\n<p>Below we can see the chart that shows top-performing provinces. The bluer the rectangle the more profitable the province, the more orange the rectangle the more losses the province sustains. The size of the rectangle corresponds to the quantity of sales. We can click on rectangles to see more detailed information about profits and sales dynamics in the region on the graph on the left and their KPI at the top. On the graph, there are green and blue points that indicate the month of the current year and the previous year respectively. Hovering over these points, you can see a trend line.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/--2021-04-28-163527.png\" width=\"1222\" height=\"664\" alt=\"\" \/>\n<\/div>\n<p>The next part of the dashboard shows product and customer analysis. This part allows us to answer questions such as “which products were the most profitable or unprofitable” or “which customers contributed to most of the profits or most of the losses”.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/--2021-04-28-163804.png\" width=\"1219\" height=\"690\" alt=\"\" \/>\n<\/div>\n<h2>Data collection<\/h2>\n<p>To connect the data we used an Excel file. PowerBI offers a number of sources to connect your data from such as Excel, csv, json files and various databases.<\/p>\n<h2>Configuring reports and visualizations<\/h2>\n<p>When building a dashboard in PowerBI we wanted to copy the color themes from Tableau. To do this, we have created a JSON file with the list of colors that we want to use. You can see the content of our file below. Then in the views tab, we clicked on the “browse for themes” button and uploaded our colors.<\/p>\n<pre class=\"e2-text-code\"><code>{\r\n\t&quot;name&quot;:&quot;Orange-Blue Diverging&quot;,\r\n\t&quot;dataColors&quot;: [\r\n\t\t\r\n\t\t&quot;#1c5998&quot;,\r\n\t\t&quot;#1c73b1&quot;,\r\n\t\t&quot;#3a87b7&quot;,\r\n\t\t&quot;#67add4&quot;,\r\n\t\t&quot;#7bc8e2&quot;,\r\n\t\t&quot;#cacaca&quot;,\r\n\t\t&quot;#fdab67&quot;,\r\n\t\t&quot;#fd8938&quot;,\r\n\t\t&quot;#f06511&quot;,\r\n\t\t&quot;#d74401&quot;,\r\n\t\t&quot;#a33202&quot;,\r\n\t\t&quot;#7b3014&quot;,\r\n\t\t&quot;#F07C28&quot;,\r\n\t\t&quot;#2B5C8A&quot;,\r\n\t\t&quot;#94C6E1&quot;,\r\n\t\t&quot;#87d180&quot;,\r\n\t]\r\n}<\/code><\/pre><p>Then we have created a separate table called Calendar and populated it with all order dates. After that, we created a column with just a month and a year to create a filter based on it.<\/p>\n<h2>Creating necessary measures<\/h2>\n<p>When creating a dashboard with PowerBI we often need to create new measures. For the data cards, we created such measures as Total Profit, Total Sales, Total Orders, Total Clients and so on. The arrows that you can see in the data cards are also customized and a measure was created for each of them. To apply the color to arrows we formatted the color by rules and indicated red if the value is less than 0, green if the color is more than 0.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/--2021-04-28-172058.png\" width=\"1348\" height=\"928\" alt=\"\" \/>\n<\/div>\n<h2>Adding bookmarks to switch between charts<\/h2>\n<p>To switch between charts, we added bookmarks for sales and profits. For the sales chart, the profits bookmark is hidden and vice versa. The button was downloaded from the internet and added to the respective bookmarks.<\/p>\n<h2>Interesting features and challenges we faced when building the dashboard<\/h2>\n<p>We have created custom data cards for KPI which are different from the default ones offered by PowerBI. The original features of cards include the background trend, the name and value while the arrows and changes are a custom feature. Another interesting feature that we used is cross filtration which allowed us to apply the filter to both the profits\/sales chart and KPI cards.<\/p>\n<p>One of the challenges that we have faced was the inability to build a bar chart with 2 categories. This feature was not implemented in PowerBI at the moment of writing this overview (maybe it is implemented now), so we had to create a table and add bar charts into it. Similarly, we inserted bar charts into the Top Customers table.<\/p>\n<h2>Conclusion<\/h2>\n<p>Our team has evaluated the dashboard and has given the following scores from 1-10 scale (10 being the highest) to this dashboard:<\/p>\n<ol start=\"1\">\n<li>Meets the tasks – 9.8<\/li>\n<li>Learning curve  – 3.0<\/li>\n<li>Tool functionality – 9.5<\/li>\n<li>Ease of use – 7.5<\/li>\n<li>Compliance with the layout – 9.5<\/li>\n<li>Visual evaluation – 8.8<\/li>\n<\/ol>\n<p>Overall: 8.0 out of 10. Have a look at the final dashboard <a href=\"https:\/\/app.powerbi.com\/view?r=eyJrIjoiYmQ0NGY5ZjctZTQzMy00NmVkLWJlYTItNDU1ODlhOGY0ODQ0IiwidCI6ImJmZjcxOTk4LWM0NDQtNGQ2ZS05OGU1LTRiNDZkMTA0Yzc5ZiIsImMiOjJ9\">here<\/a>.<\/p>\n",
            "date_published": "2021-05-05T18:38:05+03:00",
            "date_modified": "2021-05-05T18:32:23+03:00",
            "image": "https:\/\/en.leftjoin.ru\/pictures\/--2021-04-28-163157.png",
            "_date_published_rfc2822": "Wed, 05 May 2021 18:38:05 +0300",
            "_rss_guid_is_permalink": "false",
            "_rss_guid": "58",
            "_e2_data": {
                "is_favourite": false,
                "links_required": [
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css"
                ],
                "og_images": [
                    "https:\/\/en.leftjoin.ru\/pictures\/--2021-04-28-163157.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/--2021-04-28-163527.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/--2021-04-28-163804.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/--2021-04-28-172058.png"
                ]
            }
        },
        {
            "id": "15",
            "url": "https:\/\/en.leftjoin.ru\/all\/kazahstan-marketing-conference-2020\/",
            "title": "Kazakhstan Marketing Conference 2020",
            "content_html": "<p>Yesterday I had a chance to address the largest marketing conference in Kazakhstan: <a href=\"https:\/\/k50.kz\/sobytiya\/kmc-marketing-conference-2020.html\">Kazakhstan Marketing Conference 2020<\/a>.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/valiotti@2x.jpeg\" width=\"494.5\" height=\"348\" alt=\"\" \/>\n<\/div>\n<p>Almaty, as a city, has made a positive impression on me, whereas the conference itself turned out to be highly professional event, filled with plenty of smart, versatile and kind people.<\/p>\n<p>A pleasant <i>bonus<\/i> for conference participants: presentation of my speech <a href=\"https:\/\/www.slideshare.net\/NikolayValiotti\/kazahstan-marketing-conference-2020-223394647\">available on slideshare<\/a> (careful, VPN!), so one can recall what it was about.<\/p>\n<p>Apart from the speech, in the main forum’s section I was holding a masterclass on “How to construct a comprehensible technical specification on analytics?”.<br \/>\nAnd, within the framework of work with the audience, we managed to formulate points for a template of a technical specification.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/template_v2@2x.png\" width=\"449\" height=\"346\" alt=\"\" \/>\n<\/div>\n<p><a href=\"https:\/\/docs.google.com\/spreadsheets\/d\/1JdlMLrrwHDu1LuQ4apRYNnOvCwsQReugTKkb2DWddu4\/edit#gid=815634630\">Sharing the template<\/a>, it will be helpful for those, who faced with difficulties in translating of a task from business language to the technical one.<\/p>\n",
            "date_published": "2020-01-23T11:34:55+03:00",
            "date_modified": "2020-01-27T11:34:54+03:00",
            "image": "https:\/\/en.leftjoin.ru\/pictures\/valiotti@2x.jpeg",
            "_date_published_rfc2822": "Thu, 23 Jan 2020 11:34:55 +0300",
            "_rss_guid_is_permalink": "false",
            "_rss_guid": "15",
            "_e2_data": {
                "is_favourite": false,
                "links_required": [],
                "og_images": [
                    "https:\/\/en.leftjoin.ru\/pictures\/valiotti@2x.jpeg",
                    "https:\/\/en.leftjoin.ru\/pictures\/template_v2@2x.png"
                ]
            }
        },
        {
            "id": "20",
            "url": "https:\/\/en.leftjoin.ru\/all\/looker-overview\/",
            "title": "Looker Overview",
            "content_html": "<p>Today we are going to talk about BI-platform Looker, on which I managed to work in 2019.<\/p>\n<p>Here is the short content of the article for convenient and fast navigation:<\/p>\n<ol start=\"1\">\n<li><a href=\"http:\/\/leftjoin.ru\/drafts\/looker-overview\/#part1\">What is Looker?<\/a><\/li>\n<li><a href=\"http:\/\/leftjoin.ru\/drafts\/looker-overview\/#part2\">Which DBMS you can connect to via Looker and how?<\/a><\/li>\n<li><a href=\"http:\/\/leftjoin.ru\/drafts\/looker-overview\/#part3\">Building of Looker ML data model<\/a><\/li>\n<li><a href=\"http:\/\/leftjoin.ru\/drafts\/looker-overview\/#part4\">Explore Mode (data research on the model built<\/a>)<\/li>\n<li><a href=\"http:\/\/leftjoin.ru\/drafts\/looker-overview\/#part5\">Building of reports and their saving in Look<\/a><\/li>\n<li><a href=\"http:\/\/leftjoin.ru\/drafts\/looker-overview\/#part6\">Examples of dashboards in Looker<\/a><\/li>\n<\/ol>\n<h2><a id=\"part1\"><\/a>What is Looker?<\/h2>\n<p>Creators of <a href=\"https:\/\/looker.com\/\">Looker<\/a> position it as a software of business intelligence class and big data analytics platform, that helps to research, analyze and share business analytics in real time mode.<br \/>\nLooker —  is a really convenient tool and one of a few BI products, that allows to work with pre-set data cubes in a real-time mode (actually, relational tables that are described in Look ML-model).<br \/>\nAn engineer, working with Looker, needs to describe a data model on Look ML language (it’s something between CSS and SQL), publish this data model and then set reporting and dashboards.<br \/>\nLook ML itself is pretty simple, the nexus between the data objects are set by a data-engineer, which consequently allows to use the data without knowledge of SQL language (to be precise: Looker engine generates the code in SQL language itself on user’s behalf).<\/p>\n<p>Just recently, in June 2019, Google <a href=\"https:\/\/cloud.google.com\/blog\/topics\/inside-google-cloud\/expanding-our-platform-for-business-intelligence-and-embedded-analytics\">announced<\/a> acquisition of Looker platform for $2.6 billion.<\/p>\n<h2><a id=\"part2\"><\/a>Which DBMS you can connect to via Looker and how?<\/h2>\n<p>The selection of DBMS that Looker is working with is pretty wide. You can see the various connections on the screen shot below as of October, 2019:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/db-list@2x.png\" width=\"418\" height=\"852\" alt=\"\" \/>\n<div class=\"e2-text-caption\">Available DBMS for connection<\/div>\n<\/div>\n<p>You can easily set a connection to the database via web-interface:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/connection-setting@2x.png.jpg\" width=\"2560\" height=\"1460\" alt=\"\" \/>\n<div class=\"e2-text-caption\">Web-interface of connection to DBMS<\/div>\n<\/div>\n<p><s>With regard to connections to databases, I’d like to highlight the following two facts: first of all, unfortunately, Clickhouse support from Yandex is currently missing (as well as in the foreseeable future). Most likely, the support won’t appear, considering the fact that Looker was acquired by a competitor, Google.<\/s><br \/>\n<i>updated:<\/i> Actually, Looker supports Clickhouse from the December 2019<br \/>\nThe second nuisance is that you can’t build one data model, that would apply to different DBMS. There is no inbuilt storage in Looker, that could combine the results of query (unlike the same Redash).<br \/>\nIt means, that analytical architecture should be built within one DBMS (preferably with high action speed or on aggregated data).<\/p>\n<h2><a id=\"part3\"><\/a>Building of Looker ML data model<\/h2>\n<p>In order to build a report or a dashboard in Looker, you need to provisionally set a data model. Syntax of Look ML language is quite thoroughly <a href=\"https:\/\/docs.looker.com\/data-modeling\/learning-lookml\/what-is-lookml\">described in the documentation<\/a>. Personally, I can just add that model description doesn’t require long-time immersion for a specialist with SQL knowledge. Rather, one needs to rearrange the approach to data model preparation. Look ML language is very much alike CSS:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/lookml@2x.jpg\" width=\"1280\" height=\"703.5\" alt=\"\" \/>\n<div class=\"e2-text-caption\">Console of Look ML model creation<\/div>\n<\/div>\n<p>In the data model the following is set up: links with tables, keys, granularity, information of some fields being facts, and other – measurements. For facts, the aggregation is written. Obviously, at model creation one can use various IF \/ CASE expressions.<\/p>\n<h2><a id=\"part4\"><\/a>Explore mode<\/h2>\n<p>Probably, it’s the main killer-feature of Looker, since it allows any business departments to get data without attraction of analysts \/ data engineers. And, guess that’s why use of accounts with Explore mode is billed separately.<\/p>\n<p>In fact, Explore mode is an interface, that allows to use the set up Look ML data model, select the required metrics and measurements and build customized report \/ visualization.<br \/>\nFor example, we want to understand how many actions of any kind were performed in Looker’s interface last week. In order to do it, using Explore mode, we select Date field and apply a filter to it: last week (in this sense, Looker is quite smart and it and it will be enough writing ‘Last week’ in the filter), thereafter we choose “Category” from the measurements, and Quantity as a metric. After pressing the button <i>Run<\/i> the ready report will be generated.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/explore-tab@2x.png.jpg\" width=\"2560\" height=\"1012\" alt=\"\" \/>\n<div class=\"e2-text-caption\">Building report in Looker<\/div>\n<\/div>\n<p>Then, using the data received in the table form, you can set up the visualization of any type.<br \/>\nFor example, Pie chart:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/vis@2x.png\" width=\"1081\" height=\"566\" alt=\"\" \/>\n<div class=\"e2-text-caption\">Applying visualization to report<\/div>\n<\/div>\n<h2><a id=\"part5\"><\/a>Building of reports and their saving in Look<\/h2>\n<p>Sometimes you can have a desire to save the set of data \/ visualization received in Explore and share it with colleagues, for this purpose Looker has a separate essense – Look. That is ready constructed report with selected filters \/ measurements \/ facts.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/look@2x.png.jpg\" width=\"2560\" height=\"1457\" alt=\"\" \/>\n<div class=\"e2-text-caption\">Example of the saved Look<\/div>\n<\/div>\n<h2><a id=\"part6\"><\/a>Examples of dashboards in Looker<\/h2>\n<p>Systemizing the warehouse of Look created, oftentimes you want to receive a ready composition \/ overview of key metrics, that could be displayed on one list.<br \/>\nFor these purposes dashboard creation fits perfectly. Dashboard is created either on the wing, or using previously created Look. One of the dashboard’s “tricks” is configuration of parameters, that are changed on all the dashboard and can be applied to all the Look at the same time.<\/p>\n<div class=\"e2-text-picture\">\n<div class=\"fotorama\" data-width=\"2560\" data-ratio=\"1.8591140159768\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/dashboard@2x.png.jpg\" width=\"2560\" height=\"1377\" alt=\"\" \/>\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/dashboard_v2@2x.png.jpg\" width=\"2560\" height=\"1456\" alt=\"\" \/>\n<\/div>\n<\/div>\n<h2>Interesting features in one line<\/h2>\n<ul>\n<li>In Looker you can refer to other reports and, using such function, you can create a dynamic parameter, that is passed on by a link.<br \/>\nFor example, you’ve created a report with division of revenue by countries, and in this report you can refer to the dashboard on a separate country. Following the link, a user sees the dashboard on a specific country, that he clicked on.<\/li>\n<li>On every Looker page there is a chat, where support service answers very promptly<\/li>\n<li>Looker is not able to work with data merge on the level of various DBMS, however it can combine the data on the level of ready Look (in our case, this function works really weird).<\/li>\n<li>Within the framework of work with various models, I have found out an extremely non-trivial use of SQL for calculation of unique values in a non-normalized data table, Looker calls it <a href=\"https:\/\/help.looker.com\/hc\/en-us\/articles\/360023722974-A-Simple-Explanation-of-Symmetric-Aggregates-or-Why-On-Earth-Does-My-SQL-Look-Like-That-\">symmetric aggregates<\/a>.<br \/>\nSQL, indeed, looks very non-trivial:<\/li>\n<\/ul>\n<pre class=\"e2-text-code\"><code>SELECT \r\n order_items.order_id AS &quot;order_items.order_id&quot;,\r\n order_items.sale_price AS &quot;order_items.sale_price&quot;,\r\n (COALESCE(CAST( ( SUM(DISTINCT (CAST(FLOOR(COALESCE(users.age ,0)\r\n *(1000000*1.0)) AS DECIMAL(38,0))) + \r\n CAST(STRTOL(LEFT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))\r\n * 1.0e8 + CAST(STRTOL(RIGHT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0)) ) \r\n - SUM(DISTINCT CAST(STRTOL(LEFT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))\r\n * 1.0e8 + CAST(STRTOL(RIGHT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))) ) \r\n AS DOUBLE PRECISION) \r\n \/ CAST((1000000*1.0) AS DOUBLE PRECISION), 0) \r\n \/ NULLIF(COUNT(DISTINCT CASE WHEN users.age IS NOT NULL THEN users.id \r\n ELSE NULL END), 0)) AS &quot;users.average_age&quot;\r\nFROM order_items AS order_items\r\nLEFT JOIN users AS users ON order_items.user_id = users.id\r\n\r\nGROUP BY 1,2\r\nORDER BY 3 DESC\r\nLIMIT 500<\/code><\/pre><ul>\n<li>At implementation of Looker to a purchase, JumpStart Kit is mandatory, which costs not less than $6k. Within this kit you receive support and consultation from Looker at tool implementation.<\/li>\n<\/ul>\n",
            "date_published": "2020-01-08T11:22:02+03:00",
            "date_modified": "2020-05-13T14:25:54+03:00",
            "image": "https:\/\/en.leftjoin.ru\/pictures\/db-list@2x.png",
            "_date_published_rfc2822": "Wed, 08 Jan 2020 11:22:02 +0300",
            "_rss_guid_is_permalink": "false",
            "_rss_guid": "20",
            "_e2_data": {
                "is_favourite": false,
                "links_required": [
                    "system\/library\/fotorama\/fotorama.css",
                    "system\/library\/fotorama\/fotorama.js",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css"
                ],
                "og_images": [
                    "https:\/\/en.leftjoin.ru\/pictures\/db-list@2x.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/connection-setting@2x.png.jpg",
                    "https:\/\/en.leftjoin.ru\/pictures\/lookml@2x.jpg",
                    "https:\/\/en.leftjoin.ru\/pictures\/explore-tab@2x.png.jpg",
                    "https:\/\/en.leftjoin.ru\/pictures\/vis@2x.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/look@2x.png.jpg",
                    "https:\/\/en.leftjoin.ru\/pictures\/dashboard@2x.png.jpg",
                    "https:\/\/en.leftjoin.ru\/pictures\/dashboard_v2@2x.png.jpg"
                ]
            }
        },
        {
            "id": "13",
            "url": "https:\/\/en.leftjoin.ru\/all\/excel-chart-matrix-bcg\/",
            "title": "Diagram of BCG (Boston Consulting Group) Matrix",
            "content_html": "<p>I will water down the blog with an interesting report, that was developed for Yota company on November, 2011. <a href=\"https:\/\/ru.wikipedia.org\/wiki\/%D0%9C%D0%B0%D1%82%D1%80%D0%B8%D1%86%D0%B0_%D0%91%D0%9A%D0%93\">BCG Matrix<\/a> has inspired us to develop this report.<\/p>\n<p>We had: one Excel package, 75 VBA macro, ODBC connection to Oracle, SQL queries to databases of all sorts and colours. We will review report construction within this stack, but first, let’s speak about the very idea of the report.<\/p>\n<p>BCG Matrix – is 2x2 matrix, whereon the clients’ segments are displayed by circumferences with their centres in the intersection of coordinates, formed by the relevant paces of two indicators selected.<\/p>\n<p>To make it simple, we had to divide all the clients of the company into 4 segments: ARPU above average\/below average, traffic consumption (main service) above average\/below average. Thus, it turned out that 4 quadrants appear, and you need to place a bubble chart into each one of them, whereas the size of a bubble means the total amount of users within a segment. In addition to that, one more bubble was added to each quadrant (smaller one), that showcased the churn in each segment (author’s improvement).<\/p>\n<p><b>What did we want to get at the output?<\/b><br \/>\nA chart of the following type:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/bubbles-chart@2x.png\" width=\"1039\" height=\"681\" alt=\"\" \/>\n<div class=\"e2-text-caption\">Representation of the BCG matrix on the data of Yota company<\/div>\n<\/div>\n<p>The task statement is more or less clear, let’s move to the realization.<br \/>\nLet’s assume, that we’ve already collected all the required data (meaning that, we’ve learned to identify the average ARPU and average traffic consumption, in this post we won’t examine SQL-query), then the paramount task lies in understanding how to display the bubbles in the required places by means of Excel tools.<\/p>\n<p>For this aim, a bubble chart comes to help:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/viz-type@2x.png\" width=\"252\" height=\"345\" alt=\"\" \/>\n<div class=\"e2-text-caption\"><i>Insert – Chart – Bubble<\/i><\/div>\n<\/div>\n<p>Going to the menu <i>Selection of data source<\/i> and evaluating, what is required in order to build a chart in the type that we need: coordinates <i>X<\/i>, coordinates <i>Y<\/i>, values of bubbles’ sizes.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/data-source@2x.png\" width=\"518\" height=\"529\" alt=\"\" \/>\n<\/div>\n<p>Great, so it turns out that if we assume that our chart will be located in coordinates on the <i>X<\/i> axis from -1 to 1, and on the <i>Y<\/i> axis from -1 to 1, then the centre of the right upper bubble will be the spot (0.5; 0.5) on the chart. Likewise, we’ll place all the other bubbles.<\/p>\n<p>We should separately consider the bubbles of <i>Churn<\/i> type (for displaying of the churn), they are located more to the right then the main bubble and might intersect with it, therefore we will place the right upper bubble to empirically obtained coordinates (0.65; 0.35).<\/p>\n<p>Thus, for four main and four additional bubbles, we can organize the data as follows:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/bubbles-data@2x.png\" width=\"562\" height=\"97\" alt=\"\" \/>\n<\/div>\n<p>Let’s review more thoroughly how we’ll use them:<\/p>\n<div class=\"e2-text-picture\">\n<div class=\"fotorama\" data-width=\"535\" data-ratio=\"0.9006734006734\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/data-source-active@2x.png\" width=\"535\" height=\"594\" alt=\"\" \/>\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/data-source-churn@2x.png\" width=\"534\" height=\"591\" alt=\"\" \/>\n<\/div>\n<\/div>\n<p>So, we set on X-axis – horizontal coordinates of the centres of our bubbles, that lie in the cells <i>A9:A12<\/i>, on Y-axis – vertical coordinates of the centres of our bubbles, that lie in the cells <i>B9:B12<\/i>, and the sizes of the bubbles are stored in the cells <i>E9:E12<\/i>.<br \/>\nFurthermore, we add another data set for the Churn, once more indicating all the required parameters.<\/p>\n<p>We’ll get the following chart:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/bubbles-preparing@2x.png\" width=\"503\" height=\"423\" alt=\"\" \/>\n<\/div>\n<p>Then, we’re making it pretty: changing colours, deleting axis and getting a beautiful result.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/bubbles-preparing-step2@2x.png\" width=\"568\" height=\"458\" alt=\"\" \/>\n<\/div>\n<p>By adding the required data labels, we receive what we initially needed in the task.<\/p>\n<p>Share your experience in comments – did you build such charts and how you solved the task?<\/p>\n",
            "date_published": "2019-11-19T10:38:11+03:00",
            "date_modified": "2020-05-12T11:24:25+03:00",
            "image": "https:\/\/en.leftjoin.ru\/pictures\/bubbles-chart@2x.png",
            "_date_published_rfc2822": "Tue, 19 Nov 2019 10:38:11 +0300",
            "_rss_guid_is_permalink": "false",
            "_rss_guid": "13",
            "_e2_data": {
                "is_favourite": false,
                "links_required": [
                    "system\/library\/fotorama\/fotorama.css",
                    "system\/library\/fotorama\/fotorama.js"
                ],
                "og_images": [
                    "https:\/\/en.leftjoin.ru\/pictures\/bubbles-chart@2x.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/viz-type@2x.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/data-source@2x.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/bubbles-data@2x.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/data-source-active@2x.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/data-source-churn@2x.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/bubbles-preparing@2x.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/bubbles-preparing-step2@2x.png"
                ]
            }
        },
        {
            "id": "12",
            "url": "https:\/\/en.leftjoin.ru\/all\/retention-rate\/",
            "title": "How to calculate Retention?",
            "content_html": "<p>In this post we will discover, how to properly construct a report on Retention with application of <a href=\"all\/redash-full-fledged-on-demand-analytics\/\">Redash<\/a> and SQL language.<br \/>\nFor starters, let’s explain in a nutshell what the metric <b>Retention rate<\/b> is, why it is important,<\/p>\n<h2>Retention rate<\/h2>\n<p><b>Retention rate<\/b> metric is widespread and is particularly popular within the mobile industry, since it allows to understand how well a product engages the users into daily use. Let’s recall (or discover), how <b>Retention<\/b> is calculated:<\/p>\n<p>Retention of day <i>X<\/i> – is <i>N%<\/i> of users that will return to the product on day <i>X<\/i>. In other words, if on some specific day (day 0) 100 new users came, and 15 returned on the first day, then Retention of the 1st day will be equal to 15\/100=15%.<br \/>\nMost commonly, Retention of days 1, 3, 7 and 30 are singled out as the most descriptive metrics of a product, however it’s useful to address Retention curve as a whole and make conclusions, proceeding from it.<\/p>\n<h2>Retention curve<\/h2>\n<p>In the end, we are interested in construction of such curve, that shows the retention of users from day 0 to day 30.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/Retention@2x.png\" width=\"585\" height=\"200\" alt=\"\" \/>\n<div class=\"e2-text-caption\">Retention rate curve from day 0 do day 30<\/div>\n<\/div>\n<h2>Rolling Retention (RR)<\/h2>\n<p>Besides classic Retention rate, Rolling Retention (hereinafter, RR) is allocated. At calculation of RR, apart from day X, all the subsequent days are also considered. Thus, RR of the 1st day – the amount of users who returned on the 1st and subsequent days.<\/p>\n<p>Let’s compare Retention and Rolling Retention of the 10th day:<br \/>\n<b>Retention<sub>10<\/sub><\/b> — the amount of users, who returned on the 10th day \/ the amount of users, who installed the app 10 days ago * 100%.<br \/>\n<b>Rolling Retention<sub>10<\/sub><\/b> — the amount of users, who returned on the 10th day <i>or later<\/i> \/ the amount of users, who installed the app 10 days ago * 100%.<\/p>\n<h2>Granularity (retention of time periods)<\/h2>\n<p>In some industries and respective tasks, it is useful to understand the Retention of a specific day (most often, in the mobile industry), in other cases it is useful to understand the retention of users on various time intervals: for example, weekly or monthly periods (oftentimes, it’s handy in e-commerce, retail).<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/Monthly-Retention@2x.png\" width=\"669\" height=\"527\" alt=\"\" \/>\n<div class=\"e2-text-caption\">An example of cohorts by months and monthly Retention respective thereto<\/div>\n<\/div>\n<h2>How to build a Retention report on SQL language?<\/h2>\n<p>We have sorted out above how to calculate Retention in formulas. Now let’s apply it with SQL language.<br \/>\nLet’s assume, that we have two tables: <i>user<\/i> — storing data about users’ identifiers and meta-information, <i>client_session<\/i> — information on visits of the mobile app by users.<br \/>\nOnly these two tables will be present in the query, so you can easily adapt the query to yourself.<br \/>\n<i>note<\/i>: within this code, I am using Impala as DBMS.<\/p>\n<h3>Collecting the size of cohorts<\/h3>\n<pre class=\"e2-text-code\"><code>SELECT from_unixtime(user.installed_at, &quot;yyyy-MM-dd&quot;) AS reg_date,\r\n          ndv(user.id) AS users\r\n   FROM USER\r\n   WHERE from_unixtime(user.installed_at)&gt;=date_add(now(), -60)\r\n     AND from_unixtime(user.installed_at)&lt;=date_add(now(), -31)\r\n   GROUP BY 1<\/code><\/pre><p>Let’s sort out this pretty simple query: for every day we calculate the number of unique users for the period [60 days ago; 31 days ago].<br \/>\nIn order not to mess with documentation: command <i>ndv()<\/i> in Impala is analogue of a command <i>count(distinct)<\/i>.<\/p>\n<h3>Calculating the number of returned users on each cohort<\/h3>\n<pre class=\"e2-text-code\"><code>SELECT from_unixtime(user.installed_at, &quot;yyyy-MM-dd&quot;) AS reg_date,\r\n          datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff,\r\n          ndv(user.id) AS ret_base\r\n   FROM USER\r\n   LEFT JOIN client_session cs ON cs.user_id=user.id\r\n   WHERE 1=1\r\n     AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30\r\n     AND from_unixtime(user.installed_at)&gt;=date_add(now(), -60)\r\n     AND from_unixtime(user.installed_at)&lt;=date_add(now(), -31)\r\n   GROUP BY 1, 2<\/code><\/pre><p>In this query, the key part is contained in the command <i>datediff<\/i>: now we are calculating for each cohort and for each <i>datediff<\/i> the number of unique users with the very same command <i>ndv()<\/i> (practically, the number of users, who returned within the days from 0 to 30).<\/p>\n<p>Great, now we have the size of cohorts and the number of returned users.<\/p>\n<h3>Combining all together<\/h3>\n<pre class=\"e2-text-code\"><code>SELECT reg.reg_date AS date_registration,\r\n       reg.users AS cohort_size,\r\n       cohort.date_diff AS day_difference,\r\n       cohort.ret_base AS retention_base,\r\n       cohort.ret_base\/reg.users AS retention_rate\r\nFROM\r\n  (SELECT from_unixtime(user.installed_at, &quot;yyyy-MM-dd&quot;) AS reg_date,\r\n          ndv(user.id) AS users\r\n   FROM USER\r\n   WHERE from_unixtime(user.installed_at)&gt;=date_add(now(), -60)\r\n     AND from_unixtime(user.installed_at)&lt;=date_add(now(), -31)\r\n   GROUP BY 1) reg\r\nLEFT JOIN\r\n  (SELECT from_unixtime(user.installed_at, &quot;yyyy-MM-dd&quot;) AS reg_date,\r\n          datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff,\r\n          ndv(user.id) AS ret_base\r\n   FROM USER\r\n   LEFT JOIN client_session cs ON cs.user_id=user.id\r\n   WHERE 1=1\r\n     AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30\r\n     AND from_unixtime(user.installed_at)&gt;=date_add(now(), -60)\r\n     AND from_unixtime(user.installed_at)&lt;=date_add(now(), -31)\r\n   GROUP BY 1, 2) cohort ON reg.reg_date=cohort.reg_date\r\n    ORDER BY 1,3<\/code><\/pre><p>We have received the query, that calculates <b>Retention<\/b> for each cohort, and, eventually, the result can be displayed as follows:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/Cohort-retention@2x.png\" width=\"585\" height=\"200\" alt=\"\" \/>\n<div class=\"e2-text-caption\">Retention rate, calculated for each cohort of users<\/div>\n<\/div>\n<h3>Construction of the sole Retention curve<\/h3>\n<p>Let’s modify our query a bit and obtain the data for construction of one Retention curve:<\/p>\n<pre class=\"e2-text-code\"><code>SELECT \r\n       cohort.date_diff AS day_difference,\r\n       avg(reg.users) AS cohort_size,\r\n       avg(cohort.ret_base) AS retention_base,\r\n       avg(cohort.ret_base)\/avg(reg.users)*100 AS retention_rate\r\nFROM\r\n  (SELECT from_unixtime(user.installed_at, &quot;yyyy-MM-dd&quot;) AS reg_date,\r\n          ndv(user.id) AS users\r\n   FROM USER\r\n   WHERE from_unixtime(user.installed_at)&gt;=date_add(now(), -60)\r\n     AND from_unixtime(user.installed_at)&lt;=date_add(now(), -31)\r\n   GROUP BY 1) reg\r\nLEFT JOIN\r\n  (SELECT from_unixtime(user.installed_at, &quot;yyyy-MM-dd&quot;) AS reg_date,\r\n          datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff,\r\n          ndv(user.id) AS ret_base\r\n   FROM USER\r\n   LEFT JOIN client_session cs ON cs.user_id=user.id\r\n   WHERE 1=1\r\n     AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30\r\n     AND from_unixtime(user.installed_at)&gt;=date_add(now(), -60)\r\n     AND from_unixtime(user.installed_at)&lt;=date_add(now(), -31)\r\n   GROUP BY 1,2) cohort ON reg.reg_date=cohort.reg_date\r\n    GROUP BY 1        \r\n    ORDER BY 1<\/code><\/pre><p>Now, we have average by all the cohorts <b>Retention rate<\/b>, calculated for each day.<\/p>\n<h2>More on the subject<\/h2>\n<ul>\n<li><a href=\"https:\/\/gopractice.ru\/retention\/\">How to create products, forming habits?<\/a><\/li>\n<li><a href=\"https:\/\/www.braze.com\/blog\/calculate-retention-rate\/\">Top 3 Ways To Calculate User Retention Rate With Formulas<\/a><\/li>\n<\/ul>\n",
            "date_published": "2019-11-03T16:27:55+03:00",
            "date_modified": "2020-05-13T14:20:47+03:00",
            "image": "https:\/\/en.leftjoin.ru\/pictures\/Retention@2x.png",
            "_date_published_rfc2822": "Sun, 03 Nov 2019 16:27:55 +0300",
            "_rss_guid_is_permalink": "false",
            "_rss_guid": "12",
            "_e2_data": {
                "is_favourite": false,
                "links_required": [
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css"
                ],
                "og_images": [
                    "https:\/\/en.leftjoin.ru\/pictures\/Retention@2x.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/Monthly-Retention@2x.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/Cohort-retention@2x.png"
                ]
            }
        },
        {
            "id": "11",
            "url": "https:\/\/en.leftjoin.ru\/all\/parse-website-with-python-p2\/",
            "title": "Parsing the data of site’s catalogue, using Beautiful Soup and Selenium (part 2)",
            "content_html": "<p>Follow-up <a href=\"https:\/\/www.valiotti.com\/leftjoin\/all\/parse-website-with-python-p1\/\">of the previous article<\/a> on data collection from the famous online catalogue of goods.<br \/>\nIf analyzing the behaviour of page with goods thoroughly, one can notice, that the goods are uploaded dynamically, i.e. scrolling the page down, you will receive a new set of goods, and, thus, the code from the previous article will turn out to be useless for this task.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/optimized_goods.gif\" width=\"640\" height=\"480\" alt=\"\" \/>\n<div class=\"e2-text-caption\">Dynamic uploading of goods on the page<\/div>\n<\/div>\n<p>For these cases, there is also a solution in python – Selenium library, it launches the browser’s engine and emulates human’s behaviour.<\/p>\n<p>In the first part of the script we will assemble a tree of categories similarly <a href=\"https:\/\/www.valiotti.com\/leftjoin\/all\/parse-website-with-python-p1\/\">to the previous article<\/a>, but already using Selenium.<\/p>\n<pre class=\"e2-text-code\"><code>import time\r\nfrom selenium import webdriver\r\nfrom bs4 import BeautifulSoup as bs\r\n\r\nbrowser = webdriver.Chrome()\r\nbrowser.get(&quot;https:\/\/i****.ru\/products?category_id=1-ovoschi-frukty-griby-yagody&amp;from_category=true&quot;)\r\ncookies_1= {'domain': '.i****.ru', 'expiry': 1962580137, 'httpOnly': False, 'name': '_igooods_session_cross_domain', 'path': '\/', 'secure': False, 'value': 'WWJFaU8wMTBMSE9uVlR2YnRLKzlvdHE3MVgyTjVlS1JKVm1qMjVNK2JSbEYxcVZNQk9OR3A4VU1LUzZwY1lCeVlTNDVsSkFmUFNSRWt3cXdUYytxQlhnYk5BbnVoZktTMUJLRWQyaWxFeXRsR1ZCVzVnSGJRU0tLVVR0MjRYR2hXbXpaZnRnYWRzV0VnbmpjdjA5T1RzZEFkallmMEVySVA3ZkV3cjU5dVVaZjBmajU5bDIxVkEwbUQvSUVyWGdqaTc5WEJyT2tvNTVsWWx1TEZhQXB1L3dKUXl5aWpOQllEV245VStIajFDdXphWFQxVGVpeGJDV3JseU9lbE1vQmxhRklLa3BsRm9XUkNTakIrWXlDc3I5ZjdZOGgwYmplMFpGRGRxKzg3QTJFSGpkNWh5RmdxZzhpTXVvTUV5SFZnM2dzNHVqWkJRaTlwdmhkclEyNVNDSHJsVkZzeVpBaGc1ZmQ0NlhlSG43YnVHRUVDL0ZmUHVIelNhRkRZSVFYLS05UkJqM24yM0d4bjFBRWFVQjlYSzJnPT0%3D--e17089851778bedd374f240c353f399027fe0fb1'}\r\ncookies_2= {'domain': '.i****.ru', 'expiry': 1962580137, 'httpOnly': False, 'name': 'sa_current_city_coordinates_cross_domain', 'path': '\/', 'secure': False, 'value': '%5B59.91815364%2C30.305578%5D'}\r\ncookies_3= {'domain': '.i****.ru', 'expiry': 1962580137, 'httpOnly': False, 'name': 'sa_current_city_cross_domain', 'path': '\/', 'secure': False, 'value': '%D0%A1%D0%B0%D0%BD%D0%BA%D1%82-%D0%9F%D0%B5%D1%82%D0%B5%D1%80%D0%B1%D1%83%D1%80%D0%B3'}\r\nbrowser.add_cookie(cookies_1)\r\nbrowser.add_cookie(cookies_2)\r\nbrowser.add_cookie(cookies_3)\r\nbrowser.get(&quot;https:\/\/i****.ru\/products?category_id=1-ovoschi-frukty-griby-yagody&amp;from_category=true&quot;)\r\nsource_data = browser.page_source\r\nsoup = bs(source_data)\r\ncategories=soup.find_all('div', {'class':['with-children']})\r\ntree = {}\r\nfor x in categories:\r\n    tree[x.findNext('span').text]=x.findNext('a').get('href')<\/code><\/pre><p>In this snippet, the same way as before, by a get-request with parameters, we call the desired browser page and download the data, then we get an object of <b>bs<\/b> class, with which we make the similar operations. Thus, we receive a dictionary <i>tree<\/i>, where URL pages are stored for each category. Subsequently, we will need this dictionary for item-by-item examination in the cycle.<\/p>\n<p>Let’s initiate the data collection for goods. In order to do it, we import the library <i>pandas<\/i> and create a new dataframe with four columns.<\/p>\n<pre class=\"e2-text-code\"><code>import pandas as pd\r\ndf = pd.DataFrame(columns=['SKU', 'Weight', 'Price','Category'])<\/code><\/pre><p>Thereafter, we’ll use our dictionary <i>tree<\/i> and obtain page’s data for each category. You can see the code below. We still want to install cookie, that a user has installed, and also to perform some tricky commands for operation of browser’s engine, that can emulate cursor’s movement down the page.<\/p>\n<pre class=\"e2-text-code\"><code>for cat, link in tree.items():\r\n    browser.maximize_window()\r\n    browser.get('https:\/\/i****.ru'+link)\r\n    cookies_1= {'domain': '.i****.ru', 'expiry': 1962580137, 'httpOnly': False, 'name': '_i****_session_cross_domain', 'path': '\/', 'secure': False, 'value': 'WWJFaU8wMTBMSE9uVlR2YnRLKzlvdHE3MVgyTjVlS1JKVm1qMjVNK2JSbEYxcVZNQk9OR3A4VU1LUzZwY1lCeVlTNDVsSkFmUFNSRWt3cXdUYytxQlhnYk5BbnVoZktTMUJLRWQyaWxFeXRsR1ZCVzVnSGJRU0tLVVR0MjRYR2hXbXpaZnRnYWRzV0VnbmpjdjA5T1RzZEFkallmMEVySVA3ZkV3cjU5dVVaZjBmajU5bDIxVkEwbUQvSUVyWGdqaTc5WEJyT2tvNTVsWWx1TEZhQXB1L3dKUXl5aWpOQllEV245VStIajFDdXphWFQxVGVpeGJDV3JseU9lbE1vQmxhRklLa3BsRm9XUkNTakIrWXlDc3I5ZjdZOGgwYmplMFpGRGRxKzg3QTJFSGpkNWh5RmdxZzhpTXVvTUV5SFZnM2dzNHVqWkJRaTlwdmhkclEyNVNDSHJsVkZzeVpBaGc1ZmQ0NlhlSG43YnVHRUVDL0ZmUHVIelNhRkRZSVFYLS05UkJqM24yM0d4bjFBRWFVQjlYSzJnPT0%3D--e17089851778bedd374f240c353f399027fe0fb1'}\r\n    cookies_2= {'domain': '.i****.ru', 'expiry': 1962580137, 'httpOnly': False, 'name': 'sa_current_city_coordinates_cross_domain', 'path': '\/', 'secure': False, 'value': '%5B59.91815364%2C30.305578%5D'}\r\n    cookies_3= {'domain': '.i****.ru', 'expiry': 1962580137, 'httpOnly': False, 'name': 'sa_current_city_cross_domain', 'path': '\/', 'secure': False, 'value': '%D0%A1%D0%B0%D0%BD%D0%BA%D1%82-%D0%9F%D0%B5%D1%82%D0%B5%D1%80%D0%B1%D1%83%D1%80%D0%B3'}\r\n    browser.add_cookie(cookies_1)\r\n    browser.add_cookie(cookies_2)\r\n    browser.add_cookie(cookies_3)\r\n    browser.get('https:\/\/i****.ru'+link)\r\n    \r\n    # Script, that searches the end of the page every 3 seconds and is performed until the receipt of new data is finished.\r\n    lenOfPage = browser.execute_script(&quot;window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;&quot;)\r\n    match=False\r\n    while(match==False):\r\n        lastCount = lenOfPage\r\n        time.sleep(3)\r\n        lenOfPage = browser.execute_script(&quot;window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;&quot;)\r\n        if lastCount==lenOfPage:\r\n             match=True<\/code><\/pre><p>Now, we have made it to the end of the page and can collect the data for work of the library <i>beautifulsoup<\/i>.<\/p>\n<pre class=\"e2-text-code\"><code># Collecting data from the page\r\n    source_data = browser.page_source\r\n    soup = bs(source_data)\r\n    skus=soup.find_all('div', {'class':['b-product-small-card']})\r\n    last_value=len(df)+1 if len(df)&gt;0 else 0\r\n    for i,x in enumerate(skus):\r\n        df.loc[last_value+i]=[x.findNext('a').contents[0],\\\r\n                   x.findNext('div',{'class':'product-weight'}).contents[0],\\\r\n                   x.findNext('div',{'class':'g-cart-action small'})['data-price'],\\\r\n                   cat]\r\nbrowser.close()<\/code><\/pre><p>In the code fragment, presented above, we look for all the elements <i>&lt;div><\/i>, that have class – <b>b-product-small-card<\/b>, and then, for every found good, we collect the values of the fields of weight and price.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/source_code@2x.png\" width=\"535\" height=\"340\" alt=\"\" \/>\n<div class=\"e2-text-caption\">Source code of the product card site.<\/div>\n<\/div>\n<p>We launch the script’s performance and go to have a cup of coffee. Voila, now we have pandas dataframe with data of all the goods:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/pd-dataframe@2x.png\" width=\"632\" height=\"405\" alt=\"\" \/>\n<div class=\"e2-text-caption\">DataFrame with goods, collected from the site.<\/div>\n<\/div>\n<p>Now we possess great data for training of the NLP model – names of goods and their affiliation to various categories.<\/p>\n",
            "date_published": "2019-07-30T11:51:25+03:00",
            "date_modified": "2020-01-27T14:24:35+03:00",
            "image": "https:\/\/en.leftjoin.ru\/pictures\/optimized_goods.gif",
            "_date_published_rfc2822": "Tue, 30 Jul 2019 11:51:25 +0300",
            "_rss_guid_is_permalink": "false",
            "_rss_guid": "11",
            "_e2_data": {
                "is_favourite": false,
                "links_required": [
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css"
                ],
                "og_images": [
                    "https:\/\/en.leftjoin.ru\/pictures\/optimized_goods.gif",
                    "https:\/\/en.leftjoin.ru\/pictures\/source_code@2x.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/pd-dataframe@2x.png"
                ]
            }
        },
        {
            "id": "8",
            "url": "https:\/\/en.leftjoin.ru\/all\/yandex-datalens-review\/",
            "title": "Overview of Yandex DataLens",
            "content_html": "<p>Let’s take our minds off of the project on receipt data collection for a while. We will speak about the project’s following steps a bit later.<\/p>\n<p>Today we’ll be discussing a new service from <a href=\"https:\/\/datalens.yandex.ru\">Yandex – DataLens<\/a> (the access to demo was kindly provided to me by my great friend <a href=\"https:\/\/fevlake.com\/\">Vasiliy Ozerov <\/a> and the team <a href=\"https:\/\/fevlake.com\">Fevlake <\/a> \/ <a href=\"http:\/\/rebrainme.com\">Rebrain<\/a>). Currently, the service is in <i>Preview<\/i> mode and is, in essence, a cloud BI. The main shtick of the service is that it can easily and handy work with clickhouse (<a href=\"https:\/\/tech.yandex.ru\/clickhouse\/\">Yandex Clickhouse<\/a>).<\/p>\n<h2>Connection of data sources<\/h2>\n<p>Let’s review the major things: connection of a data source and dataset setting.<br \/>\nThe selection of DBMS is not vast, nevertheless some main things are present. For the purpose of our testing, let’s take MySQL.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/2019-04-08_11-11-17@2x.png\" width=\"1001\" height=\"713\" alt=\"\" \/>\n<div class=\"e2-text-caption\">Selection of data sources DataLens<\/div>\n<\/div>\n<p>On the basis of the connection created, it is suggested to create a dataset:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/2019-04-08_11-13-52@2x.png\" width=\"1032\" height=\"698\" alt=\"\" \/>\n<div class=\"e2-text-caption\">Interface of dataset settings, definition of measurements and metrics<\/div>\n<\/div>\n<p>On this stage it’s defined which table’s attributes are becoming measurements, and which are turning into metrics. You can choose data aggregation type for the metrics.<br \/>\nUnfortunately, I didn’t manage to discover how it’s possible to state several interconnected tables (for example, attach a handbook for measurements) instead of a single table. Perhaps, on this stage developers suggest us to solve this issue by creating of required view.<\/p>\n<h2>Data visualization<\/h2>\n<p>Regarding the interface itself – everything is pretty easy and handy. It reminds of a cloud version of Tableau. If comparing to Redash, which is most frequently used in conjunction with Clickhouse, the opportunities of visualization are simply staggering.<br \/>\nEven pivot tables, in which one can use Measure Names as columns’ names are worth something:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/2019-04-08_11-17-35@2x.png\" width=\"854\" height=\"423\" alt=\"\" \/>\n<div class=\"e2-text-caption\">Setting of pivot tables in DataLens<\/div>\n<\/div>\n<p>Obviously, there is an opportunity to make also basic charts in DataLens from Yandex:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/2019-04-08_11-18-15@2x.png\" width=\"1168\" height=\"679\" alt=\"\" \/>\n<div class=\"e2-text-caption\">Construction of a linear chart in DataLens<\/div>\n<\/div>\n<p>There are also area charts:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/2019-04-08_11-20-27@2x.png\" width=\"1165\" height=\"669\" alt=\"\" \/>\n<div class=\"e2-text-caption\">Construction of an area chart in DataLens<\/div>\n<\/div>\n<p>However, I didn’t manage to find out how data classification by months \/ quarters \/ weeks is carried out. According to an example of data, available in the demo version, developers are still solving this issue by creating additional attributes (DayMonth, DayWeek, etc).<\/p>\n<h2>Dashboards<\/h2>\n<p>For now, interface of dashboard blocks’ creation looks bulky, and interface windows are not always comprehensive. Here is, for instance, a window, allowing to state a parameter:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/2019-04-08_11-22-46@2x.png\" width=\"1177\" height=\"615\" alt=\"\" \/>\n<div class=\"e2-text-caption\">Not really apparent setting window for dashboard parameters<\/div>\n<\/div>\n<p>However, in the gallery of examples we can see highly functional and convenient dashboards with selectors, tabs and parameters:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/2019-04-08_11-25-19@2x.png\" width=\"1280\" height=\"691\" alt=\"\" \/>\n<div class=\"e2-text-caption\">An example of a working dashboard with parameters and tabs in DataLens<\/div>\n<\/div>\n<p>Looking forward to fixing of interface shortcomings, improving of Datalens and preparing to use it together with Clickhouse!<\/p>\n",
            "date_published": "2019-04-08T11:42:38+03:00",
            "date_modified": "2020-05-13T14:22:57+03:00",
            "image": "https:\/\/en.leftjoin.ru\/pictures\/2019-04-08_11-11-17@2x.png",
            "_date_published_rfc2822": "Mon, 08 Apr 2019 11:42:38 +0300",
            "_rss_guid_is_permalink": "false",
            "_rss_guid": "8",
            "_e2_data": {
                "is_favourite": false,
                "links_required": [],
                "og_images": [
                    "https:\/\/en.leftjoin.ru\/pictures\/2019-04-08_11-11-17@2x.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/2019-04-08_11-13-52@2x.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/2019-04-08_11-17-35@2x.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/2019-04-08_11-18-15@2x.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/2019-04-08_11-20-27@2x.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/2019-04-08_11-22-46@2x.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/2019-04-08_11-25-19@2x.png"
                ]
            }
        },
        {
            "id": "6",
            "url": "https:\/\/en.leftjoin.ru\/all\/collecting-receipts-with-python-p1\/",
            "title": "Collecting data from hypermarket receipts on Python",
            "content_html": "<p>Recently, once again buying products in a hypermarket, I recalled that, according to the Russian Federal Act FZ-54, any trade operator, that issues a receipt, is obliged to send the data thereof to the Tax Service.<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/lenta-receipt@2x.jpg\" width=\"787\" height=\"762\" alt=\"\" \/>\n<div class=\"e2-text-caption\">Receipt from “Lenta” hypermarket. The QR-code of our interest is circled.<\/div>\n<\/div>\n<p>So, what does it mean for us, data analysts? It means that we can know ourselves and our needs better, and also acquire interesting data on own purchases.<\/p>\n<p>Let’s try to assemble a small prototype of an app that will allow to make a dynamic of our purchases within the framework of blog posts’ series. So, we’ll start from the fact, that each receipt has a QR-code, and if you identify it, you’ll receive the following line:<\/p>\n<blockquote>\n<p><i>t=20190320T2303&s=5803.00&fn=9251440300007971&i=141637&fp=4087570038&n=1<\/i><\/p>\n<\/blockquote>\n<p>This line comprises:<\/p>\n<blockquote>\n<p><i>t<\/i> – timestamp, the time when you made a purchase<br \/>\n<i>s<\/i> – sum of the receipt<br \/>\n<i>fn<\/i> – code number of fss, will be needed further in a request to API<br \/>\n<i>i<\/i> – receipt number, will be needed further in a request to API<br \/>\n<i>fp<\/i> – fiscalsign parameter, will be needed further in a request to API<\/p>\n<\/blockquote>\n<p>Within the solution of the first step, we will parse the receipt data and collect it in <i>pandas<\/i> dataframe, using Python modules.<\/p>\n<p>We will use <a href=\"https:\/\/habr.com\/ru\/post\/358966\/\">API<\/a>, that provides data on the receipt from the Tax Service website.<\/p>\n<p>Initially, we will receive authentication data:<\/p>\n<pre class=\"e2-text-code\"><code>import requests\r\nyour_phone = '+7XXXYYYZZZZ' #you need to state your phone number, SMS with password will arrive thereon\r\nr = requests.post('https:\/\/proverkacheka.nalog.ru:9999\/v1\/mobile\/users\/signup', json = {&quot;email&quot;:&quot;email@email.com&quot;,&quot;name&quot;:&quot;USERNAME&quot;,&quot;phone&quot;:your_phone})<\/code><\/pre><p>As a result of performing POST request we receive a password in SMS to the indicated phone number. Further on, we will be using it in a variable <i><b>pwd<\/b><\/i><\/p>\n<p>Now we’ll parse our line with values from QR-code:<\/p>\n<pre class=\"e2-text-code\"><code>import re\r\nqr_string='t=20190320T2303&amp;s=5803.00&amp;fn=9251440300007971&amp;i=141637&amp;fp=4087570038&amp;n=1'\r\nt=re.findall(r't=(\\w+)', qr_string)[0]\r\ns=re.findall(r's=(\\w+)', qr_string)[0]\r\nfn=re.findall(r'fn=(\\w+)', qr_string)[0]\r\ni=re.findall(r'i=(\\w+)', qr_string)[0]\r\nfp=re.findall(r'fp=(\\w+)', qr_string)[0]<\/code><\/pre><p>We’ll use the variables obtained in order to extract the data.<br \/>\nOne <a href=\"https:\/\/habr.com\/ru\/post\/358966\/\">Habr post<\/a> pretty thoroughly examines status of errors at formation of API request, therefore I won’t repeat this information.<\/p>\n<p>In the beginning, we need to verify the presence of data on this receipt, so we form a GET request.<\/p>\n<pre class=\"e2-text-code\"><code>headers = {'Device-Id':'', 'Device-OS':''}\r\npayload = {'fiscalSign': fp, 'date': t,'sum':s}\r\ncheck_request=requests.get('https:\/\/proverkacheka.nalog.ru:9999\/v1\/ofds\/*\/inns\/*\/fss\/'+fn+'\/operations\/1\/tickets\/'+i,params=payload, headers=headers,auth=(your_phone, pwd))\r\nprint(check_request.status_code)<\/code><\/pre><p>In the request one needs to indicate headers, at least empty ones. In my case, GET request returns error 406, thus I get that such receipt is found (why GET request returns 406 remains a mystery to me, so I will be glad to receive some clues in comments). If not indicating sum or date, GET request returns error 400 – bad request.<\/p>\n<p>Let’s move on to the most interesting part, receiving data of the receipt:<\/p>\n<pre class=\"e2-text-code\"><code>request_info=requests.get('https:\/\/proverkacheka.nalog.ru:9999\/v1\/inns\/*\/kkts\/*\/fss\/'+fn+'\/tickets\/'+i+'?fiscalSign='+fp+'&amp;sendToEmail=no',headers=headers,auth=(your_phone, pwd))\r\nprint(request_info.status_code)\r\nproducts=request_info.json()<\/code><\/pre><p>We should receive code 200 (successful execution of the request), and in the variable <i>products<\/i> – everything, that applies to our receipt.<\/p>\n<p>In order to further work with this data, let’s use <i>pandas<\/i> and transform everything in dataframe.<\/p>\n<pre class=\"e2-text-code\"><code>import pandas as pd\r\nfrom datetime import datetime\r\nmy_products=pd.DataFrame(products['document']['receipt']['items'])\r\nmy_products['price']=my_products['price']\/100\r\nmy_products['sum']=my_products['sum']\/100\r\ndatetime_check = datetime.strptime(t, '%Y%m%dT%H%M') #((https:\/\/docs.python.org\/3\/library\/datetime.html#strftime-and-strptime-behavior formate the date))\r\nmy_products['date']=datetime_check\r\nmy_products.set_index('date',inplace=True)<\/code><\/pre><p>Now we have working pandas.dataframe with receipts, visually it looks as follows:<\/p>\n<div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/2019-03-22_17-14-33@2x.png\" width=\"679\" height=\"374\" alt=\"\" \/>\n<div class=\"e2-text-caption\">“Header” of receipt data<\/div>\n<\/div>\n<p>You can construct a bar chart of purchases or observe everything as a box plot:<\/p>\n<pre class=\"e2-text-code\"><code>import matplotlib.pyplot as plt\r\n%matplotlib inline\r\nmy_products['sum'].plot(kind='hist', bins=20)\r\nplt.show()\r\nmy_products['sum'].plot(kind='box')\r\nplt.show()<\/code><\/pre><div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/hist_cheques.png\" width=\"386\" height=\"252\" alt=\"\" \/>\n<div class=\"e2-text-caption\">boxplot_cheques.png<\/div>\n<\/div>\n<p>In conclusion, we will simply get descriptive statistics as text, using a command <i>.describe()<\/i>:<\/p>\n<pre class=\"e2-text-code\"><code>my_products.describe()<\/code><\/pre><div class=\"e2-text-picture\">\n<img src=\"https:\/\/en.leftjoin.ru\/pictures\/2019-03-22_17-27-06@2x.png\" width=\"362\" height=\"268\" alt=\"\" \/>\n<\/div>\n<p>It’s convenient to write down data as .csv file, so that the next time you can amend the statistics:<\/p>\n<pre class=\"e2-text-code\"><code>with open('hyper_receipts.csv', 'a') as f:\r\n             my_products.to_csv(f, header=True)<\/code><\/pre>",
            "date_published": "2019-03-22T17:41:37+03:00",
            "date_modified": "2020-01-27T11:33:13+03:00",
            "image": "https:\/\/en.leftjoin.ru\/pictures\/lenta-receipt@2x.jpg",
            "_date_published_rfc2822": "Fri, 22 Mar 2019 17:41:37 +0300",
            "_rss_guid_is_permalink": "false",
            "_rss_guid": "6",
            "_e2_data": {
                "is_favourite": false,
                "links_required": [
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css",
                    "system\/library\/highlight\/highlight.js",
                    "system\/library\/highlight\/highlight.css"
                ],
                "og_images": [
                    "https:\/\/en.leftjoin.ru\/pictures\/lenta-receipt@2x.jpg",
                    "https:\/\/en.leftjoin.ru\/pictures\/2019-03-22_17-14-33@2x.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/hist_cheques.png",
                    "https:\/\/en.leftjoin.ru\/pictures\/2019-03-22_17-27-06@2x.png"
                ]
            }
        }
    ],
    "_e2_version": 3386,
    "_e2_ua_string": "E2 (v3386; Aegea)"
}