10 posts tagged

analysis

Mean VS median: how to choose a target metric?

Estimated read time – 8 min

In today’s article, we would like to highlight a simple, but important topic – how to choose a simple metric to evaluate a particular dataset. Everyone has been familiar with the arithmetic mean for a long time, almost every student knows very well that you should sum up all the available values, divide by their number and get the average value. However, school knowledge does not include any alternative options, of which, in fact, there are many in statistics – almost for every occasion. In solving research and marketing problems, people often take mean as a target. Is this legitimate or is there a better option? Let’s figure it out.

To begin with, it is worth remembering the definitions of the two metrics that we will talk about today.
Mean is the most popular statistic used to calculate a data center. What is the median? Median is a value that splits data, sorted in order of increasing values, into two equal parts. This means that the median shows the central value in the sample if the number of cases is odd and the arithmetic mean of the two values ​​if the number of cases in the sample is even.

Research tasks

So, the estimation of the sample mean is often important in many research questions. For instance, specialists studying demography often study changes in the number of regions in Russia in order to track the dynamics and reflect it in reports. Let’s try to calculate the average size of the Russian city, as well as the median, and then compare the results.
First, you need to find and load data by connecting the pandas library for this.

import pandas as pd
city = pd.read_csv('city.csv', sep = ';')

Then, you need to calculate the mean and median of the sample.

mean_pop = round (city.population_2020.mean (), 0)
median_pop = round (city.population_2020.median (), 0)

The values, of course, are different, since the distribution of observations in the sample is different from the normal one. In order to understand whether they are very different, let’s build a distribution graph and display the mean and median.

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_palette('rainbow')
fig = plt.figure(figsize = (20, 15))
ax = fig.add_subplot(1, 1, 1)
g = sns.histplot(data = city, x= 'population_2020', alpha=0.6, bins = 100, ax=ax)

g.axvline(mean_pop, linewidth=2, color='r', alpha=0.9, linestyle='--', label = 'Mean = {:,.0f}'.format(mean_pop).replace(',', ' '))
g.axvline(median_pop, linewidth=2, color='darkgreen', alpha=0.9, linestyle='--', label = 'Median = {:,.0f}'.format(median_pop).replace(',', ' '))

plt.ticklabel_format(axis='x', style='plain')
plt.xlabel("Population", fontsize=20)
plt.ylabel("Number of cities", fontsize=20)
plt.title("Distribution of population of russian cities", fontsize=20)
plt.legend(fontsize="xx-large")
plt.show()

Also, on this data it is worth building a boxplot for more accurate visualization with the main distribution quantiles, median, mean and outliers.

fig = plt.figure(figsize = (10, 10))
sns.set_theme(style="whitegrid")
sns.set_palette(palette="pastel")

sns.boxplot(y = city['population_2020'], showfliers = False)

plt.scatter(0, 550100, marker='*', s=100, color = 'black', label = 'Outlier')
plt.scatter(0, 560200, marker='*', s=100, color = 'black')
plt.scatter(0, 570300, marker='*', s=100, color = 'black')
plt.scatter(0, mean_pop, marker='o', s=100, color = 'red', edgecolors = 'black', label = 'Mean')
plt.legend()

plt.ylabel("Population", fontsize=15)
plt.ticklabel_format(axis='y', style='plain')
plt.title("Boxplot of population of russian cities", fontsize=15)
plt.show()

It follows from the graphs that the median is significantly less than the average, and it is also clear that this is a consequence of the presence of outliers which are Moscow and St. Petersburg. Since the arithmetic mean is an extremely sensitive metric to outliers, it is not worth relying on conclusions about the mean if they are present in the sample. An increase or decrease in the population of Moscow can greatly change the average population in Russia, but this will not affect the real regional trend.
Using the arithmetic mean, we say that the number of a typical (average) city in the Russian Federation is 268 thousand people. However, this misleads us, since the mean value is significantly higher than the median solely due to the size of the population of Moscow and St. Petersburg. In fact, the number of a typical Russian city is significantly less (2 times less!) and is approximately 104 thousand citizens.

Marketing tasks

In a business tasks, the difference between the mean and the median is also important, as using the wrong metric can seriously affect the results of the advertising campaign or make it difficult to achieve the goal. Let’s take a look at a real example of the difficulties an entrepreneur can face in retail if he chooses the wrong target metric.
To begin with, as in the previous example, let’s load a dataset about supermarket purchases. Let’s select the dataset columns necessary for analysis and rename them to simplify the code in the future. Since this data is not as well prepared as the previous ones, it is necessary to group all purchased items by receipts. In this case, it is necessary to group by two variables: by the customer’s id and by the date of purchase (the date and time is determined by the moment of closing the bill, therefore, all purchases within one bill coincide by date variable). Then, let’s name the resulting column “total_bill”, that is, the check amount and calculate the average and median.

df = pd.read_excel ('invoice_data.xlsx')
df.columns = ['user', 'total_price', 'date']
groupped_df = pd.DataFrame (df.groupby (['user', 'date']). total_price.sum ())
groupped_df.columns = ['total_bill']
mean_bill = groupped_df.total_bill.mean ()
median_bill = groupped_df.total_bill.median ()

Now, as in the previous example, you need to plot the distribution of customer checks and boxplot, and also display the median and mean on each of them.

sns.set_palette('rainbow')
fig = plt.figure(figsize = (20, 15))
ax = fig.add_subplot(1, 1, 1)
sns.histplot(groupped_df, x = 'total_bill', binwidth=200, alpha=0.6, ax=ax)
plt.xlabel("Purchases", fontsize=20)
plt.ylabel("Total bill", fontsize=20)
plt.title("Distribution of total bills", fontsize=20)
plt.axvline(mean_bill, linewidth=2, color='r', alpha=1, linestyle='--', label = 'Mean = {:.0f}'.format(mean_bill))
plt.axvline(median_bill, linewidth=2, color='darkgreen', alpha=1, linestyle='--', label = 'Median = {:.0f}'.format(median_bill))
plt.legend(fontsize="xx-large")
plt.show()
fig = plt.figure(figsize = (10, 10))
sns.set_theme(style="whitegrid")
sns.set_palette(palette="pastel")

sns.boxplot(y = groupped_df['total_bill'], showfliers = False)

plt.scatter(0, 1800, marker='*', s=100, color = 'black', label = 'Outlier')
plt.scatter(0, 1850, marker='*', s=100, color = 'black')
plt.scatter(0, 1900, marker='*', s=100, color = 'black')
plt.scatter(0, mean_bill, marker='o', s=100, color = 'red', edgecolors = 'black', label = 'Mean')
plt.legend()

plt.ticklabel_format(axis='y', style='plain')
plt.ylabel("Total bill", fontsize=15)
plt.title("Boxplot of total bills", fontsize=15)
plt.show()

The graphs show that the distribution is different from normal, which means that the median and mean are not equal. The median value is smaller than the average by about 220 rubles.
Now, imagine that marketers have a task to increase the average buyer’s bill. A marketer may decide that since the average check is 601 rubles, the following promotion can be offered: “All buyers who make a purchase over 600 rubles, will get 20% discount on any good for 100 rubles.” In general, it is a reasonable offer, however, in reality, the average check is lower – 378 rubles. Thus, the majority of buyers will not be interested in the offer, since their purchase usually does not reach the proposed threshold. This means, that they will not take advantage of the offer and will not receive a discount, and the company will not be able to achieve its goal and increase the profit. The point is that the initial assumptions were wrong.

Conclusions

As you already understood, the mean often shows a more pleasant result, both for business and for research tasks, because it is always nicer to imagine the situation with the average check or the demographic situation in the country better than it really is. However, one must always remember about the shortcomings of mean in order to be able to correctly choose the appropriate analogue for assessing a particular situation.

 No comments    95   6 mon   analysis   matplotlib   pandas   python

Python and lyrics of Zemfira’s new album: capturing the spirit of her songs

Estimated read time – 16 min

Zemfira’s latest studio album, Borderline, was released in February, 8 years after the previous one. For this album, various people cooperated with her, including her relatives – the riff for the song “Таблетки” was written by her nephew from London. The album turned out to be diverse: for instance, the song “Остин” is dedicated to the main character of the Homescapes game by the Russian studio Playrix (by the way, check out the latest Business Secrets with the Bukhman brothers, they also mention it there). Zemfira liked the game a lot, thus, she contacted Playrix to create this song. Also, the song “Крым” was written as a soundtrack to a new film by Zemfira’s colleague Renata Litvinova.

Listen new album in Apple Music / Яндекс.Музыке / Spotify

Nevertheless, the spirit of the whole album is rather gloomy – the songs often repeat the words ‘боль’, ‘ад’, ‘бесишь’ and other synonyms. We decided to conduct an exploratory analysis of her album, and then, using the Word2Vec model and a cosine measure, look at the semantic closeness of the songs and calculate the general mood of the album.

For those who are bored with reading about data preparation and analysis steps, you can go directly to the results.

Data preparation

For starters, we write a data processing script. The purpose of the script is to collect a united csv-table from a set of text files, each of which contains a song. At the same time, we have to get rid of all punctuation marks and unnecessary words as we need to focus only on significant content.

import pandas as pd
import re
import string
import pymorphy2
from nltk.corpus import stopwords

Then we create a morphological analyzer and expand the list of everything that needs to be discarded:

morph = pymorphy2.MorphAnalyzer()
stopwords_list = stopwords.words('russian')
stopwords_list.extend(['куплет', 'это', 'я', 'мы', 'ты', 'припев', 'аутро', 'предприпев', 'lyrics', '1', '2', '3', 'то'])
string.punctuation += '—'

The names of the songs are given in English, so we have to create a dictionary for translation into Russian and a dictionary, from which we will later make a table:

result_dict = dict()

songs_dict = {
    'snow':'снег идёт',
    'crimea':'крым',
    'mother':'мама',
    'ostin':'остин',
    'abuse':'абьюз',
    'wait_for_me':'жди меня',
    'tom':'том',
    'come_on':'камон',
    'coat':'пальто',
    'this_summer':'этим летом',
    'ok':'ок',
    'pills':'таблетки'
}

Let’s define several necessary functions. The first one reads the entire song from the file and removes line breaks, the second clears the text from unnecessary characters and words, and the third one converts the words to normal form, using the pymorphy2 morphological analyzer. The pymorphy2 module does not always handle ambiguity well – additional processing is required for the words ‘ад’ and ‘рай’.

def read_song(filename):
    f = open(f'{filename}.txt', 'r').read()
    f = f.replace('\n', ' ')
    return f

def clean_string(text):
    text = re.split(' |:|\.|\(|\)|,|"|;|/|\n|\t|-|\?|\[|\]|!', text)
    text = ' '.join([word for word in text if word not in string.punctuation])
    text = text.lower()
    text = ' '.join([word for word in text.split() if word not in stopwords_list])
    return text

def string_to_normal_form(string):
    string_lst = string.split()
    for i in range(len(string_lst)):
        string_lst[i] = morph.parse(string_lst[i])[0].normal_form
        if (string_lst[i] == 'аду'):
            string_lst[i] = 'ад'
        if (string_lst[i] == 'рая'):
            string_lst[i] = 'рай'
    string = ' '.join(string_lst)
    return string

After all this preparation, we can get back to the data and process each song and read the file with the corresponding name:

name_list = []
text_list = []
for song, name in songs_dict.items():
    text = string_to_normal_form(clean_string(read_song(song)))
    name_list.append(name)
    text_list.append(text)

Then we combine everything into a DataFrame and save it as a csv-file.

df = pd.DataFrame()
df['name'] = name_list
df['text'] = text_list
df['time'] = [290, 220, 187, 270, 330, 196, 207, 188, 269, 189, 245, 244]
df.to_csv('borderline.csv', index=False)

Result:

Word cloud for the whole album

To begin with the analysis, we have to construct a word cloud, because it can display the most common words found in these songs. We import the required libraries, read the csv-file and set the configurations:

import nltk
from wordcloud import WordCloud
import pandas as pd
import matplotlib.pyplot as plt
from nltk import word_tokenize, ngrams

%matplotlib inline
nltk.download('punkt')
df = pd.read_csv('borderline.csv')

Now we create a new figure, set the design parameters and, using the word cloud library, display words with the size directly proportional to the frequency of the word. We additionally indicate the name of the song above the corresponding graph.

fig = plt.figure()
fig.patch.set_facecolor('white')
plt.subplots_adjust(wspace=0.3, hspace=0.2)
i = 1
for name, text in zip(df.name, df.text):
    tokens = word_tokenize(text)
    text_raw = " ".join(tokens)
    wordcloud = WordCloud(colormap='PuBu', background_color='white', contour_width=10).generate(text_raw)
    plt.subplot(4, 3, i, label=name,frame_on=True)
    plt.tick_params(labelsize=10)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.title(name,fontdict={'fontsize':7,'color':'grey'},y=0.93)
    plt.tick_params(labelsize=10)
    i += 1

EDA of the lyrics

Let us move to the next part and analyze the lyrics. To do this, we have to import special libraries to deal with data and visualization:

import plotly.graph_objects as go
import plotly.figure_factory as ff
from scipy import spatial
import collections
import pymorphy2
import gensim

morph = pymorphy2.MorphAnalyzer()

Firstly, we should count the overall number of words in each song, the number of unique words, and their percentage:

songs = []
total = []
uniq = []
percent = []

for song, text in zip(df.name, df.text):
    songs.append(song)
    total.append(len(text.split()))
    uniq.append(len(set(text.split())))
    percent.append(round(len(set(text.split())) / len(text.split()), 2) * 100)

All this information should be written in a DataFrame and additionally we want to count the number of words per minute for each song:

df_words = pd.DataFrame()
df_words['song'] = songs
df_words['total words'] = total
df_words['uniq words'] = uniq
df_words['percent'] = percent
df_words['time'] = df['time']
df_words['words per minute'] = round(total / (df['time'] // 60))
df_words = df_words[::-1]

It would be great to visualize the data, so let us build two bar charts: one for the number of words in the song, and the other one for the number of words per minute.

colors_1 = ['rgba(101,181,205,255)'] * 12
colors_2 = ['rgba(62,142,231,255)'] * 12

fig = go.Figure(data=[
    go.Bar(name='📝 Total number of words,
           text=df_words['total words'],
           textposition='auto',
           x=df_words.song,
           y=df_words['total words'],
           marker_color=colors_1,
           marker=dict(line=dict(width=0)),),
    go.Bar(name='🌀 Unique words',
           text=df_words['uniq words'].astype(str) + '<br>'+ df_words.percent.astype(int).astype(str) + '%' ,
           textposition='inside',
           x=df_words.song,
           y=df_words['uniq words'],
           textfont_color='white',
           marker_color=colors_2,
           marker=dict(line=dict(width=0)),),
])

fig.update_layout(barmode='group')

fig.update_layout(
    title = 
        {'text':'<b>The ratio of the number of unique words to the total</b><br><span style="color:#666666"></span>'},
    showlegend = True,
    height=650,
    font={
        'family':'Open Sans, light',
        'color':'black',
        'size':14
    },
    plot_bgcolor='rgba(0,0,0,0)',
)
fig.update_layout(legend=dict(
    yanchor="top",
    xanchor="right",
))

fig.show()
colors_1 = ['rgba(101,181,205,255)'] * 12
colors_2 = ['rgba(238,85,59,255)'] * 12

fig = go.Figure(data=[
    go.Bar(name='⏱️ Track length, min.',
           text=round(df_words['time'] / 60, 1),
           textposition='auto',
           x=df_words.song,
           y=-df_words['time'] // 60,
           marker_color=colors_1,
           marker=dict(line=dict(width=0)),
          ),
    go.Bar(name='🔄 Words per minute',
           text=df_words['words per minute'],
           textposition='auto',
           x=df_words.song,
           y=df_words['words per minute'],
           marker_color=colors_2,
           textfont_color='white',
           marker=dict(line=dict(width=0)),
          ),
])

fig.update_layout(barmode='overlay')

fig.update_layout(
    title = 
        {'text':'<b>Track length and words per minute</b><br><span style="color:#666666"></span>'},
    showlegend = True,
    height=650,
    font={
        'family':'Open Sans, light',
        'color':'black',
        'size':14
    },
    plot_bgcolor='rgba(0,0,0,0)'
)


fig.show()

Working with Word2Vec model

Using the gensim module, load the model pointing to a binary file:

model = gensim.models.KeyedVectors.load_word2vec_format('model.bin', binary=True)

Для материала мы использовали готовую обученную на Национальном Корпусе Русского Языка модель от сообщества RusVectōrēs

The Word2Vec model is based on neural networks and allows you to represent words in the form of vectors, taking into account the semantic component. It means that if we take two words – for instance, “mom” and “dad”, then represent them as two vectors and calculate the cosine, the values ​​will be close to 1. Similarly, two words that have nothing in common in their meaning have a cosine measure close to 0.

Now we will define the get_vector function: it will take a list of words, recognize a part of speech for each word, and then receive and summarize vectors, so that we can find vectors even for whole sentences and texts.

def get_vector(word_list):
    vector = 0
    for word in word_list:
        pos = morph.parse(word)[0].tag.POS
        if pos == 'INFN':
            pos = 'VERB'
        if pos in ['ADJF', 'PRCL', 'ADVB', 'NPRO']:
            pos = 'NOUN'
        if word and pos:
            try:
                word_pos = word + '_' + pos
                this_vector = model.word_vec(word_pos)
                vector += this_vector
            except KeyError:
                continue
    return vector

For each song, find a vector and select the corresponding column in the DataFrame:

vec_list = []
for word in df['text']:
    vec_list.append(get_vector(word.split()))
df['vector'] = vec_list

So, now we should compare these vectors with one another, calculating their cosine proximity. Those songs with a cosine metric higher than 0.5 will be saved separately – this way we will get the closest pairs of songs. We will write the information about the comparison of vectors into the two-dimensional list result.

similar = dict()
result = []
for song_1, vector_1 in zip(df.name, df.vector):
    sub_list = []
    for song_2, vector_2 in zip(df.name.iloc[::-1], df.vector.iloc[::-1]):
        res = 1 - spatial.distance.cosine(vector_1, vector_2)
        if res > 0.5 and song_1 != song_2 and (song_1 + ' / ' + song_2 not in similar.keys() and song_2 + ' / ' + song_1 not in similar.keys()):
            similar[song_1 + ' / ' + song_2] = round(res, 2)
        sub_list.append(round(res, 2))
    result.append(sub_list)

Moreover, we can construct the same bar chart:

df_top_sim = pd.DataFrame()
df_top_sim['name'] = list(similar.keys())
df_top_sim['value'] = list(similar.values())
df_top_sim.sort_values(by='value', ascending=False)

И построим такой же bar chart:

colors = ['rgba(101,181,205,255)'] * 5

fig = go.Figure([go.Bar(x=df_top_sim['name'],
                        y=df_top_sim['value'],
                        marker_color=colors,
                        width=[0.4,0.4,0.4,0.4,0.4],
                        text=df_top_sim['value'],
                        textfont_color='white',
                        textposition='auto')])

fig.update_layout(
    title = 
        {'text':'<b>Топ-5 closest songs</b><br><span style="color:#666666"></span>'},
    showlegend = False,
    height=650,
    font={
        'family':'Open Sans, light',
        'color':'black',
        'size':14
    },
    plot_bgcolor='rgba(0,0,0,0)',
    xaxis={'categoryorder':'total descending'}
)

fig.show()

Given the vector of each song, let us calculate the vector of the entire album – add the vectors of the songs. Then, for such a vector, using the model, we get the words that are the closest in spirit and meaning.

def get_word_from_tlist(lst):
    for word in lst:
        word = word[0].split('_')[0]
        print(word, end=' ')

vec_sum = 0
for vec in df.vector:
    vec_sum += vec
sim_word = model.similar_by_vector(vec_sum)
get_word_from_tlist(sim_word)

небо тоска тьма пламень плакать горе печаль сердце солнце мрак

This is probably the key result and the description of Zemfira’s album in just 10 words.

Finally, we build a general heat map, each cell of which is the result of comparing the texts of two tracks with a cosine measure.

colorscale=[[0.0, "rgba(255,255,255,255)"],
            [0.1, "rgba(229,232,237,255)"],
            [0.2, "rgba(216,222,232,255)"],
            [0.3, "rgba(205,214,228,255)"],
            [0.4, "rgba(182,195,218,255)"],
            [0.5, "rgba(159,178,209,255)"],
            [0.6, "rgba(137,161,200,255)"],
            [0.7, "rgba(107,137,188,255)"],
            [0.8, "rgba(96,129,184,255)"],
            [1.0, "rgba(76,114,176,255)"]]

font_colors = ['black']
x = list(df.name.iloc[::-1])
y = list(df.name)
fig = ff.create_annotated_heatmap(result, x=x, y=y, colorscale=colorscale, font_colors=font_colors)
fig.show()

Results and data interpretation

To give valuable conclusions, we would like to take another look at everything we got. First of all, let us focus on the word cloud. It is easy to see that the words ‘боль’, ‘невозможно’, ‘сорваться’, ‘растерзаны’, ‘сложно’, ‘терпеть’, ‘любить’ have a very decent size, because such words are often found throughout the entire lyrics:

Давайте ещё раз посмотрим на всё, что у нас получилось — начнём с облака слов. Нетрудно заметить, что у слов «боль», «невозможно», «сорваться», «растерзаны», «сложно», «терпеть», «любить» размер весьма приличный — всё потому, что такие слова встречаются часто на протяжении всего текста песен:

The song “Крым” turned out to be one of the most diverse songs – it contains 74% of unique words. Also, the song “Снег идет” contains very few words, so the majority, which is 82%, are unique. The largest song on the album in terms of amount of words is the track “Таблетки” – there are about 150 words in total.

As it was shown on the last chart, the most dynamic track is “Таблетки”, as much as 37 words per minute – nearly one word for every two seconds – and the longest track is “Абьюз”, and according to the previous chart, it also has the lowest percentage of unique words – 46%.

Top 5 most semantically similar text pairs:

We also got the vector of the entire album and found the closest words. Just take a look at them – ‘тьма’, ‘тоска’, ‘плакать’, ‘горе’, ‘печаль’, ‘сердце’ – this is the list of words that characterizes Zemfira’s lyrics!

небо тоска тьма пламень плакать горе печаль сердце солнце мрак

The final result is a heat map. From the visualization, it is noticeable that almost all songs are quite similar to each other – the cosine measure for many pairs exceeds the value of 0.4.

Conclusions

In the material, we carried out an EDA of the entire text of the new album and, using the pre-trained Word2Vec model, we proved the hypothesis – most of the “Borderline” songs are permeated with rather dark lyrics. However, this is normal, because we love Zemfira precisely for her sincerity and straightforwardness.

 No comments    154   8 mon   analysis   Analytics engineering   data analytics   plotly   python

PowerBI Dashboard Overview

Estimated read time – 5 min

We continue the series of materials on BI-systems and today we will have a look at the dashboard prepared in PowerBI using the SuperStore Sales dataset. We will cover how to connect the data to the system, set custom colors for visualizations and create new measures, implement switching between charts using bookmarks and we will discuss the challenges that we faced when building the dashboard.

This is the how the final dashboard looks like:

The most notable feature of the dashboard is data cards that show the company’s KPI. The cards compare the parameters to the same period in the previous year and show the previous year’s dynamics in the background.

Below we can see the chart that shows top-performing provinces. The bluer the rectangle the more profitable the province, the more orange the rectangle the more losses the province sustains. The size of the rectangle corresponds to the quantity of sales. We can click on rectangles to see more detailed information about profits and sales dynamics in the region on the graph on the left and their KPI at the top. On the graph, there are green and blue points that indicate the month of the current year and the previous year respectively. Hovering over these points, you can see a trend line.

The next part of the dashboard shows product and customer analysis. This part allows us to answer questions such as “which products were the most profitable or unprofitable” or “which customers contributed to most of the profits or most of the losses”.

Data collection

To connect the data we used an Excel file. PowerBI offers a number of sources to connect your data from such as Excel, csv, json files and various databases.

Configuring reports and visualizations

When building a dashboard in PowerBI we wanted to copy the color themes from Tableau. To do this, we have created a JSON file with the list of colors that we want to use. You can see the content of our file below. Then in the views tab, we clicked on the “browse for themes” button and uploaded our colors.

{
	"name":"Orange-Blue Diverging",
	"dataColors": [
		
		"#1c5998",
		"#1c73b1",
		"#3a87b7",
		"#67add4",
		"#7bc8e2",
		"#cacaca",
		"#fdab67",
		"#fd8938",
		"#f06511",
		"#d74401",
		"#a33202",
		"#7b3014",
		"#F07C28",
		"#2B5C8A",
		"#94C6E1",
		"#87d180",
	]
}

Then we have created a separate table called Calendar and populated it with all order dates. After that, we created a column with just a month and a year to create a filter based on it.

Creating necessary measures

When creating a dashboard with PowerBI we often need to create new measures. For the data cards, we created such measures as Total Profit, Total Sales, Total Orders, Total Clients and so on. The arrows that you can see in the data cards are also customized and a measure was created for each of them. To apply the color to arrows we formatted the color by rules and indicated red if the value is less than 0, green if the color is more than 0.

Adding bookmarks to switch between charts

To switch between charts, we added bookmarks for sales and profits. For the sales chart, the profits bookmark is hidden and vice versa. The button was downloaded from the internet and added to the respective bookmarks.

Interesting features and challenges we faced when building the dashboard

We have created custom data cards for KPI which are different from the default ones offered by PowerBI. The original features of cards include the background trend, the name and value while the arrows and changes are a custom feature. Another interesting feature that we used is cross filtration which allowed us to apply the filter to both the profits/sales chart and KPI cards.

One of the challenges that we have faced was the inability to build a bar chart with 2 categories. This feature was not implemented in PowerBI at the moment of writing this overview (maybe it is implemented now), so we had to create a table and add bar charts into it. Similarly, we inserted bar charts into the Top Customers table.

Conclusion

Our team has evaluated the dashboard and has given the following scores from 1-10 scale (10 being the highest) to this dashboard:

  1. Meets the tasks – 9.8
  2. Learning curve  – 3.0
  3. Tool functionality – 9.5
  4. Ease of use – 7.5
  5. Compliance with the layout – 9.5
  6. Visual evaluation – 8.8

Overall: 8.0 out of 10. Have a look at the final dashboard here.

 No comments    173   1 y   analysis   BI   BI-tools   powerbi

Kazakhstan Marketing Conference 2020

Estimated read time – 2 min

Yesterday I had a chance to address the largest marketing conference in Kazakhstan: Kazakhstan Marketing Conference 2020.

Almaty, as a city, has made a positive impression on me, whereas the conference itself turned out to be highly professional event, filled with plenty of smart, versatile and kind people.

A pleasant bonus for conference participants: presentation of my speech available on slideshare (careful, VPN!), so one can recall what it was about.

Apart from the speech, in the main forum’s section I was holding a masterclass on “How to construct a comprehensible technical specification on analytics?”.
And, within the framework of work with the audience, we managed to formulate points for a template of a technical specification.

Sharing the template, it will be helpful for those, who faced with difficulties in translating of a task from business language to the technical one.

 No comments    448   2020   analysis   conference   marketing   template

Looker Overview

Estimated read time – 10 min

Today we are going to talk about BI-platform Looker, on which I managed to work in 2019.

Here is the short content of the article for convenient and fast navigation:

  1. What is Looker?
  2. Which DBMS you can connect to via Looker and how?
  3. Building of Looker ML data model
  4. Explore Mode (data research on the model built)
  5. Building of reports and their saving in Look
  6. Examples of dashboards in Looker

What is Looker?

Creators of Looker position it as a software of business intelligence class and big data analytics platform, that helps to research, analyze and share business analytics in real time mode.
Looker — is a really convenient tool and one of a few BI products, that allows to work with pre-set data cubes in a real-time mode (actually, relational tables that are described in Look ML-model).
An engineer, working with Looker, needs to describe a data model on Look ML language (it’s something between CSS and SQL), publish this data model and then set reporting and dashboards.
Look ML itself is pretty simple, the nexus between the data objects are set by a data-engineer, which consequently allows to use the data without knowledge of SQL language (to be precise: Looker engine generates the code in SQL language itself on user’s behalf).

Just recently, in June 2019, Google announced acquisition of Looker platform for $2.6 billion.

Which DBMS you can connect to via Looker and how?

The selection of DBMS that Looker is working with is pretty wide. You can see the various connections on the screen shot below as of October, 2019:

Available DBMS for connection

You can easily set a connection to the database via web-interface:

Web-interface of connection to DBMS

With regard to connections to databases, I’d like to highlight the following two facts: first of all, unfortunately, Clickhouse support from Yandex is currently missing (as well as in the foreseeable future). Most likely, the support won’t appear, considering the fact that Looker was acquired by a competitor, Google.
updated: Actually, Looker supports Clickhouse from the December 2019
The second nuisance is that you can’t build one data model, that would apply to different DBMS. There is no inbuilt storage in Looker, that could combine the results of query (unlike the same Redash).
It means, that analytical architecture should be built within one DBMS (preferably with high action speed or on aggregated data).

Building of Looker ML data model

In order to build a report or a dashboard in Looker, you need to provisionally set a data model. Syntax of Look ML language is quite thoroughly described in the documentation. Personally, I can just add that model description doesn’t require long-time immersion for a specialist with SQL knowledge. Rather, one needs to rearrange the approach to data model preparation. Look ML language is very much alike CSS:

Console of Look ML model creation

In the data model the following is set up: links with tables, keys, granularity, information of some fields being facts, and other – measurements. For facts, the aggregation is written. Obviously, at model creation one can use various IF / CASE expressions.

Explore mode

Probably, it’s the main killer-feature of Looker, since it allows any business departments to get data without attraction of analysts / data engineers. And, guess that’s why use of accounts with Explore mode is billed separately.

In fact, Explore mode is an interface, that allows to use the set up Look ML data model, select the required metrics and measurements and build customized report / visualization.
For example, we want to understand how many actions of any kind were performed in Looker’s interface last week. In order to do it, using Explore mode, we select Date field and apply a filter to it: last week (in this sense, Looker is quite smart and it and it will be enough writing ‘Last week’ in the filter), thereafter we choose “Category” from the measurements, and Quantity as a metric. After pressing the button Run the ready report will be generated.

Building report in Looker

Then, using the data received in the table form, you can set up the visualization of any type.
For example, Pie chart:

Applying visualization to report

Building of reports and their saving in Look

Sometimes you can have a desire to save the set of data / visualization received in Explore and share it with colleagues, for this purpose Looker has a separate essense – Look. That is ready constructed report with selected filters / measurements / facts.

Example of the saved Look

Examples of dashboards in Looker

Systemizing the warehouse of Look created, oftentimes you want to receive a ready composition / overview of key metrics, that could be displayed on one list.
For these purposes dashboard creation fits perfectly. Dashboard is created either on the wing, or using previously created Look. One of the dashboard’s “tricks” is configuration of parameters, that are changed on all the dashboard and can be applied to all the Look at the same time.

Interesting features in one line

  • In Looker you can refer to other reports and, using such function, you can create a dynamic parameter, that is passed on by a link.
    For example, you’ve created a report with division of revenue by countries, and in this report you can refer to the dashboard on a separate country. Following the link, a user sees the dashboard on a specific country, that he clicked on.
  • On every Looker page there is a chat, where support service answers very promptly
  • Looker is not able to work with data merge on the level of various DBMS, however it can combine the data on the level of ready Look (in our case, this function works really weird).
  • Within the framework of work with various models, I have found out an extremely non-trivial use of SQL for calculation of unique values in a non-normalized data table, Looker calls it symmetric aggregates.
    SQL, indeed, looks very non-trivial:
SELECT 
 order_items.order_id AS "order_items.order_id",
 order_items.sale_price AS "order_items.sale_price",
 (COALESCE(CAST( ( SUM(DISTINCT (CAST(FLOOR(COALESCE(users.age ,0)
 *(1000000*1.0)) AS DECIMAL(38,0))) + 
 CAST(STRTOL(LEFT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))
 * 1.0e8 + CAST(STRTOL(RIGHT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0)) ) 
 - SUM(DISTINCT CAST(STRTOL(LEFT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))
 * 1.0e8 + CAST(STRTOL(RIGHT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))) ) 
 AS DOUBLE PRECISION) 
 / CAST((1000000*1.0) AS DOUBLE PRECISION), 0) 
 / NULLIF(COUNT(DISTINCT CASE WHEN users.age IS NOT NULL THEN users.id 
 ELSE NULL END), 0)) AS "users.average_age"
FROM order_items AS order_items
LEFT JOIN users AS users ON order_items.user_id = users.id

GROUP BY 1,2
ORDER BY 3 DESC
LIMIT 500
  • At implementation of Looker to a purchase, JumpStart Kit is mandatory, which costs not less than $6k. Within this kit you receive support and consultation from Looker at tool implementation.
 No comments    472   2020   analysis   Analytics engineering   BI-tools   looker   sql
Earlier Ctrl + ↓