Later Ctrl + ↑

Dashboard for the first 8 months of a child’s life

Estimated read time – 4 min

In December 2020, I became a dad, which means that our family life with my wife has changed drastically. Of course, I am sharing this news with you for a reason, but in the context of the data that we will study and research today. They are very personal for me, and therefore have some special magic and value. Today I want to show how dramatically the life of a family is changing by the example of my own analysis of the life data of the first 8 months of a baby.

Data collection

Initial data: tracking the main elements of caring for a baby in the first 8 months: sleep, nursing, changing a diaper. The data was collected using BabyTracker app.
My wife is a great fellow, because during the first 7 months she carefully and regularly monitored all the important points. She forgot to turn off the timer for nursing the baby at night only a couple of times, but I quickly saw noticeable outliers in the data, and the dataset was cleared of them.
Initially, I had several data visualization ideas in my head, and I tried to immediately implement them into the projected dashboard. I wanted to show the baby’s sleep intervals in the form of a vertical Gantt chart, but the night’s sleep went through the midnight (0:00), and it was completely incomprehensible how this could be corrected in Tableau. After a number of unsuccessful independent attempts to find a solution to this problem, I decided to consult with Roman Bunin. Unfortunately, we came to the conclusion together that there is no way to solve this. Then I had to write a little Python code that splits such time intervals and adds new lines to the dataset.
However, while we were texting, Roma sent an example identical to my idea! This example claims that a woman collected data on her child’s sleep and wakefulness in the first year of child’s life, and then wrote the code with which it turned out to be embroidered towel with pattern datavis baby sleep. For me, this was surprising, since it turned out that this way of visualization is the main method that allows you to show how difficult life and sleep of parents is in the first months of the birth of a child.
In my dashboard on Tableau Public I got three semantic blocks and several “KPIs” about which I would like to tell you in detail and share the basic everyday wisdom. At the top of the dashboard, you can see the key averages of the daytime and nighttime sleep hours, nursing hours, frequency of nursing, and the number of diaper changes in the first three months. I have allocated exactly three months, because I think this is the most difficult period, because significant changes that require serious adaptation are taking place in your life .

Dream

The left diagram – called “Towel” – illustrates the baby’s sleeping periods. In this diagram, it is important to pay attention to white gaps, especially at night. These are the hours when the baby is awake, which means that the parents are also awake. Look at how the chart changes, especially in the early months, when we gave up the habit of going to bed at 1 or 2 in the morning and fell asleep earlier. Roughly speaking, in the first three months (until March 2021), the child could fall asleep at 2 or 3 in the morning, but we were lucky that our child’s night sleep was quite long.
The right graph clearly illustrates how the baby’s day and night sleep length changes over time, and the boxplots show the distribution of the hours of daytime and nighttime sleep. The graph confirms the conclusion: “This is temporary and will definitely get better soon!”

Nursing

From the left diagram, you can see how the number and duration of nursing change. This number is gradually decreasing, and the duration of nursing periods is shortened. Since mid-July, we have changed the way we track nursing periods, so they are not valid for this analysis.
From my point of view, the findings are a great opportunity for couples planning a pregnancy, not to create illusions about the opportunity to work or do any other business in the first months after giving birth. Pay attention to the frequency and duration of nursing, all this time the parent is completely busy with the child. However, do not be overly alarmed: over time, the number of nursing periods will decrease.

Diaper change

The left graph is the highlight of this dashboard. As you can imagine, this is a map of the most fun moments – changing a diaper. The stars represent the moments of the day when you need to change the diaper, and the light gray color below shows the number of changes per day. The graph on the right shows diaper changes counted by part of the day. In general, the diagram does not show any interesting dependencies, however, it prepares you for the fact that this process is frequent, regular and happens at any time of the day.

Conclusions

It seems to me that the use of real personal data and such visualization is sometimes much more revealing than a lot of videos or books about what this period will be like. That is why I decided to share my findings and observations with you here. The main conclusion that I wanted you to draw from the dataviz: children are great! ❤️

 No comments    85   7 mon   data analytics   python

Python and lyrics of Zemfira’s new album: capturing the spirit of her songs

Estimated read time – 16 min

Zemfira’s latest studio album, Borderline, was released in February, 8 years after the previous one. For this album, various people cooperated with her, including her relatives – the riff for the song “Таблетки” was written by her nephew from London. The album turned out to be diverse: for instance, the song “Остин” is dedicated to the main character of the Homescapes game by the Russian studio Playrix (by the way, check out the latest Business Secrets with the Bukhman brothers, they also mention it there). Zemfira liked the game a lot, thus, she contacted Playrix to create this song. Also, the song “Крым” was written as a soundtrack to a new film by Zemfira’s colleague Renata Litvinova.

Listen new album in Apple Music / Яндекс.Музыке / Spotify

Nevertheless, the spirit of the whole album is rather gloomy – the songs often repeat the words ‘боль’, ‘ад’, ‘бесишь’ and other synonyms. We decided to conduct an exploratory analysis of her album, and then, using the Word2Vec model and a cosine measure, look at the semantic closeness of the songs and calculate the general mood of the album.

For those who are bored with reading about data preparation and analysis steps, you can go directly to the results.

Data preparation

For starters, we write a data processing script. The purpose of the script is to collect a united csv-table from a set of text files, each of which contains a song. At the same time, we have to get rid of all punctuation marks and unnecessary words as we need to focus only on significant content.

import pandas as pd
import re
import string
import pymorphy2
from nltk.corpus import stopwords

Then we create a morphological analyzer and expand the list of everything that needs to be discarded:

morph = pymorphy2.MorphAnalyzer()
stopwords_list = stopwords.words('russian')
stopwords_list.extend(['куплет', 'это', 'я', 'мы', 'ты', 'припев', 'аутро', 'предприпев', 'lyrics', '1', '2', '3', 'то'])
string.punctuation += '—'

The names of the songs are given in English, so we have to create a dictionary for translation into Russian and a dictionary, from which we will later make a table:

result_dict = dict()

songs_dict = {
    'snow':'снег идёт',
    'crimea':'крым',
    'mother':'мама',
    'ostin':'остин',
    'abuse':'абьюз',
    'wait_for_me':'жди меня',
    'tom':'том',
    'come_on':'камон',
    'coat':'пальто',
    'this_summer':'этим летом',
    'ok':'ок',
    'pills':'таблетки'
}

Let’s define several necessary functions. The first one reads the entire song from the file and removes line breaks, the second clears the text from unnecessary characters and words, and the third one converts the words to normal form, using the pymorphy2 morphological analyzer. The pymorphy2 module does not always handle ambiguity well – additional processing is required for the words ‘ад’ and ‘рай’.

def read_song(filename):
    f = open(f'{filename}.txt', 'r').read()
    f = f.replace('\n', ' ')
    return f

def clean_string(text):
    text = re.split(' |:|\.|\(|\)|,|"|;|/|\n|\t|-|\?|\[|\]|!', text)
    text = ' '.join([word for word in text if word not in string.punctuation])
    text = text.lower()
    text = ' '.join([word for word in text.split() if word not in stopwords_list])
    return text

def string_to_normal_form(string):
    string_lst = string.split()
    for i in range(len(string_lst)):
        string_lst[i] = morph.parse(string_lst[i])[0].normal_form
        if (string_lst[i] == 'аду'):
            string_lst[i] = 'ад'
        if (string_lst[i] == 'рая'):
            string_lst[i] = 'рай'
    string = ' '.join(string_lst)
    return string

After all this preparation, we can get back to the data and process each song and read the file with the corresponding name:

name_list = []
text_list = []
for song, name in songs_dict.items():
    text = string_to_normal_form(clean_string(read_song(song)))
    name_list.append(name)
    text_list.append(text)

Then we combine everything into a DataFrame and save it as a csv-file.

df = pd.DataFrame()
df['name'] = name_list
df['text'] = text_list
df['time'] = [290, 220, 187, 270, 330, 196, 207, 188, 269, 189, 245, 244]
df.to_csv('borderline.csv', index=False)

Result:

Word cloud for the whole album

To begin with the analysis, we have to construct a word cloud, because it can display the most common words found in these songs. We import the required libraries, read the csv-file and set the configurations:

import nltk
from wordcloud import WordCloud
import pandas as pd
import matplotlib.pyplot as plt
from nltk import word_tokenize, ngrams

%matplotlib inline
nltk.download('punkt')
df = pd.read_csv('borderline.csv')

Now we create a new figure, set the design parameters and, using the word cloud library, display words with the size directly proportional to the frequency of the word. We additionally indicate the name of the song above the corresponding graph.

fig = plt.figure()
fig.patch.set_facecolor('white')
plt.subplots_adjust(wspace=0.3, hspace=0.2)
i = 1
for name, text in zip(df.name, df.text):
    tokens = word_tokenize(text)
    text_raw = " ".join(tokens)
    wordcloud = WordCloud(colormap='PuBu', background_color='white', contour_width=10).generate(text_raw)
    plt.subplot(4, 3, i, label=name,frame_on=True)
    plt.tick_params(labelsize=10)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.title(name,fontdict={'fontsize':7,'color':'grey'},y=0.93)
    plt.tick_params(labelsize=10)
    i += 1

EDA of the lyrics

Let us move to the next part and analyze the lyrics. To do this, we have to import special libraries to deal with data and visualization:

import plotly.graph_objects as go
import plotly.figure_factory as ff
from scipy import spatial
import collections
import pymorphy2
import gensim

morph = pymorphy2.MorphAnalyzer()

Firstly, we should count the overall number of words in each song, the number of unique words, and their percentage:

songs = []
total = []
uniq = []
percent = []

for song, text in zip(df.name, df.text):
    songs.append(song)
    total.append(len(text.split()))
    uniq.append(len(set(text.split())))
    percent.append(round(len(set(text.split())) / len(text.split()), 2) * 100)

All this information should be written in a DataFrame and additionally we want to count the number of words per minute for each song:

df_words = pd.DataFrame()
df_words['song'] = songs
df_words['total words'] = total
df_words['uniq words'] = uniq
df_words['percent'] = percent
df_words['time'] = df['time']
df_words['words per minute'] = round(total / (df['time'] // 60))
df_words = df_words[::-1]

It would be great to visualize the data, so let us build two bar charts: one for the number of words in the song, and the other one for the number of words per minute.

colors_1 = ['rgba(101,181,205,255)'] * 12
colors_2 = ['rgba(62,142,231,255)'] * 12

fig = go.Figure(data=[
    go.Bar(name='📝 Total number of words,
           text=df_words['total words'],
           textposition='auto',
           x=df_words.song,
           y=df_words['total words'],
           marker_color=colors_1,
           marker=dict(line=dict(width=0)),),
    go.Bar(name='🌀 Unique words',
           text=df_words['uniq words'].astype(str) + '<br>'+ df_words.percent.astype(int).astype(str) + '%' ,
           textposition='inside',
           x=df_words.song,
           y=df_words['uniq words'],
           textfont_color='white',
           marker_color=colors_2,
           marker=dict(line=dict(width=0)),),
])

fig.update_layout(barmode='group')

fig.update_layout(
    title = 
        {'text':'<b>The ratio of the number of unique words to the total</b><br><span style="color:#666666"></span>'},
    showlegend = True,
    height=650,
    font={
        'family':'Open Sans, light',
        'color':'black',
        'size':14
    },
    plot_bgcolor='rgba(0,0,0,0)',
)
fig.update_layout(legend=dict(
    yanchor="top",
    xanchor="right",
))

fig.show()
colors_1 = ['rgba(101,181,205,255)'] * 12
colors_2 = ['rgba(238,85,59,255)'] * 12

fig = go.Figure(data=[
    go.Bar(name='⏱️ Track length, min.',
           text=round(df_words['time'] / 60, 1),
           textposition='auto',
           x=df_words.song,
           y=-df_words['time'] // 60,
           marker_color=colors_1,
           marker=dict(line=dict(width=0)),
          ),
    go.Bar(name='🔄 Words per minute',
           text=df_words['words per minute'],
           textposition='auto',
           x=df_words.song,
           y=df_words['words per minute'],
           marker_color=colors_2,
           textfont_color='white',
           marker=dict(line=dict(width=0)),
          ),
])

fig.update_layout(barmode='overlay')

fig.update_layout(
    title = 
        {'text':'<b>Track length and words per minute</b><br><span style="color:#666666"></span>'},
    showlegend = True,
    height=650,
    font={
        'family':'Open Sans, light',
        'color':'black',
        'size':14
    },
    plot_bgcolor='rgba(0,0,0,0)'
)


fig.show()

Working with Word2Vec model

Using the gensim module, load the model pointing to a binary file:

model = gensim.models.KeyedVectors.load_word2vec_format('model.bin', binary=True)

Для материала мы использовали готовую обученную на Национальном Корпусе Русского Языка модель от сообщества RusVectōrēs

The Word2Vec model is based on neural networks and allows you to represent words in the form of vectors, taking into account the semantic component. It means that if we take two words – for instance, “mom” and “dad”, then represent them as two vectors and calculate the cosine, the values ​​will be close to 1. Similarly, two words that have nothing in common in their meaning have a cosine measure close to 0.

Now we will define the get_vector function: it will take a list of words, recognize a part of speech for each word, and then receive and summarize vectors, so that we can find vectors even for whole sentences and texts.

def get_vector(word_list):
    vector = 0
    for word in word_list:
        pos = morph.parse(word)[0].tag.POS
        if pos == 'INFN':
            pos = 'VERB'
        if pos in ['ADJF', 'PRCL', 'ADVB', 'NPRO']:
            pos = 'NOUN'
        if word and pos:
            try:
                word_pos = word + '_' + pos
                this_vector = model.word_vec(word_pos)
                vector += this_vector
            except KeyError:
                continue
    return vector

For each song, find a vector and select the corresponding column in the DataFrame:

vec_list = []
for word in df['text']:
    vec_list.append(get_vector(word.split()))
df['vector'] = vec_list

So, now we should compare these vectors with one another, calculating their cosine proximity. Those songs with a cosine metric higher than 0.5 will be saved separately – this way we will get the closest pairs of songs. We will write the information about the comparison of vectors into the two-dimensional list result.

similar = dict()
result = []
for song_1, vector_1 in zip(df.name, df.vector):
    sub_list = []
    for song_2, vector_2 in zip(df.name.iloc[::-1], df.vector.iloc[::-1]):
        res = 1 - spatial.distance.cosine(vector_1, vector_2)
        if res > 0.5 and song_1 != song_2 and (song_1 + ' / ' + song_2 not in similar.keys() and song_2 + ' / ' + song_1 not in similar.keys()):
            similar[song_1 + ' / ' + song_2] = round(res, 2)
        sub_list.append(round(res, 2))
    result.append(sub_list)

Moreover, we can construct the same bar chart:

df_top_sim = pd.DataFrame()
df_top_sim['name'] = list(similar.keys())
df_top_sim['value'] = list(similar.values())
df_top_sim.sort_values(by='value', ascending=False)

И построим такой же bar chart:

colors = ['rgba(101,181,205,255)'] * 5

fig = go.Figure([go.Bar(x=df_top_sim['name'],
                        y=df_top_sim['value'],
                        marker_color=colors,
                        width=[0.4,0.4,0.4,0.4,0.4],
                        text=df_top_sim['value'],
                        textfont_color='white',
                        textposition='auto')])

fig.update_layout(
    title = 
        {'text':'<b>Топ-5 closest songs</b><br><span style="color:#666666"></span>'},
    showlegend = False,
    height=650,
    font={
        'family':'Open Sans, light',
        'color':'black',
        'size':14
    },
    plot_bgcolor='rgba(0,0,0,0)',
    xaxis={'categoryorder':'total descending'}
)

fig.show()

Given the vector of each song, let us calculate the vector of the entire album – add the vectors of the songs. Then, for such a vector, using the model, we get the words that are the closest in spirit and meaning.

def get_word_from_tlist(lst):
    for word in lst:
        word = word[0].split('_')[0]
        print(word, end=' ')

vec_sum = 0
for vec in df.vector:
    vec_sum += vec
sim_word = model.similar_by_vector(vec_sum)
get_word_from_tlist(sim_word)

небо тоска тьма пламень плакать горе печаль сердце солнце мрак

This is probably the key result and the description of Zemfira’s album in just 10 words.

Finally, we build a general heat map, each cell of which is the result of comparing the texts of two tracks with a cosine measure.

colorscale=[[0.0, "rgba(255,255,255,255)"],
            [0.1, "rgba(229,232,237,255)"],
            [0.2, "rgba(216,222,232,255)"],
            [0.3, "rgba(205,214,228,255)"],
            [0.4, "rgba(182,195,218,255)"],
            [0.5, "rgba(159,178,209,255)"],
            [0.6, "rgba(137,161,200,255)"],
            [0.7, "rgba(107,137,188,255)"],
            [0.8, "rgba(96,129,184,255)"],
            [1.0, "rgba(76,114,176,255)"]]

font_colors = ['black']
x = list(df.name.iloc[::-1])
y = list(df.name)
fig = ff.create_annotated_heatmap(result, x=x, y=y, colorscale=colorscale, font_colors=font_colors)
fig.show()

Results and data interpretation

To give valuable conclusions, we would like to take another look at everything we got. First of all, let us focus on the word cloud. It is easy to see that the words ‘боль’, ‘невозможно’, ‘сорваться’, ‘растерзаны’, ‘сложно’, ‘терпеть’, ‘любить’ have a very decent size, because such words are often found throughout the entire lyrics:

Давайте ещё раз посмотрим на всё, что у нас получилось — начнём с облака слов. Нетрудно заметить, что у слов «боль», «невозможно», «сорваться», «растерзаны», «сложно», «терпеть», «любить» размер весьма приличный — всё потому, что такие слова встречаются часто на протяжении всего текста песен:

The song “Крым” turned out to be one of the most diverse songs – it contains 74% of unique words. Also, the song “Снег идет” contains very few words, so the majority, which is 82%, are unique. The largest song on the album in terms of amount of words is the track “Таблетки” – there are about 150 words in total.

As it was shown on the last chart, the most dynamic track is “Таблетки”, as much as 37 words per minute – nearly one word for every two seconds – and the longest track is “Абьюз”, and according to the previous chart, it also has the lowest percentage of unique words – 46%.

Top 5 most semantically similar text pairs:

We also got the vector of the entire album and found the closest words. Just take a look at them – ‘тьма’, ‘тоска’, ‘плакать’, ‘горе’, ‘печаль’, ‘сердце’ – this is the list of words that characterizes Zemfira’s lyrics!

небо тоска тьма пламень плакать горе печаль сердце солнце мрак

The final result is a heat map. From the visualization, it is noticeable that almost all songs are quite similar to each other – the cosine measure for many pairs exceeds the value of 0.4.

Conclusions

In the material, we carried out an EDA of the entire text of the new album and, using the pre-trained Word2Vec model, we proved the hypothesis – most of the “Borderline” songs are permeated with rather dark lyrics. However, this is normal, because we love Zemfira precisely for her sincerity and straightforwardness.

Clickhouse Training 101 by Altinity

Estimated read time – 5 min

Just recently I have completed a Clickhouse Training by Altinity (101 Series Training). For those who are just getting to know Clickhouse, Altinity offers free basic training: Data Warehouse Basics. I recommend starting with it if you are planning to dive into learning.

Certification by Altinity

I would like to share my experience on the training as well as the training notes.
The training costs $500 and lasts 4 days for 2 hours. It is carried out in the evenings Moscow time (starting from 19:00 GMT +3).

Session # 1.

The first day mostly revises everything covered in Data Warehouse Basics, but it has several new ideas on how to get useful information on queries from system tables.

For example, this query will show which commands are running and their status.

SELECT command, is_done
FROM system.mutations
WHERE table = 'ontime'

Besides, for me it was useful to learn about column compression with the use of codecs:

ALTER TABLE ontime
 MODIFY COLUMN TailNum LowCardinality(String) CODEC(ZSTD(1))

For those who are just starting with Clickhouse, the first day will be super useful as it will help in understanding table engines and syntax for their creation, partitions, inserting data (for example directly from S3).

INSERT INTO sdata
SELECT * FROM s3(
 'https://s3.us-east-1.amazonaws.com/d1-altinity/data/sdata*.csv.gz',
 'aws_access_key_id',
 'aws_secret_access_key',
 'Parquet',
 'DevId Int32, Type String, MDate Date, MDatetime
DateTime, Value Float64')

Session # 2.

I found the second day the most intense and useful as within this session Robert from Altinity talks about aggregate functions and materialized views ( detailed scheme for the creation of materialized views ) in Clickhouse in more detail.

It was super useful for me to learn about index types in Clickhouse.

Session # 3.

During the third day, colleagues share their knowledge on how to work with Kafka and JSON objects stored in the tables.
It was interesting to find out that working with arrays in Clickhouse is very similar to arrays in Python:

WITH [1, 2, 4] AS array
SELECT
 array[1] AS First,
 array[2] AS Second,
 array[3] AS Third,
 array[-1] AS Last,
 length(array) AS Length

When working with arrays, there is a great feature called ARRAY JOIN which “unrolls” arrays to rows.

Clickhouse allows you to efficiently interact with JSON objects stored in a table:

-- Get a JSON string value
SELECT JSONExtractString(row, 'request') AS request
FROM log_row LIMIT 3
-- Get a JSON numeric value
SELECT JSONExtractInt(row, 'status') AS status
FROM log_row LIMIT 3

This piece of code is an example of how to extract the elements of the JSON array “request” and “status” separately.

ALTER TABLE log_row
 ADD COLUMN
status Int16 DEFAULT
 JSONExtractInt(row, 'status')
ALTER TABLE log_row
UPDATE status = status WHERE 1 = 1

Session # 4.

The most difficult topic from my point of view was saved for the last day – building sharding and replication patterns and building queries on distributed Clickhouse servers.

Special respect to Altinity for an excellent collection of labs during the training.

Links:

 No comments    411   9 mon   clickhouse   sql

Data normalization with SQL

Estimated read time – 5 min

According to GIGO (garbage in, garbage out) principle, errors in input data lead to erroneous analysis results. The results of our work directly depend on the quality of data preparation.

For instance, when we need to prepare data to use in ML algorithms (like k-NN, k-means, logistic regression, etc.), features in the original dataset may vary in scale like the age and height of a person. This may lead to the incorrect performance of the algorithm. Thus, such data needs to be rescaled first.

In this tutorial, we will consider the ways to scale the data using SQL query: min-max normalization, min-max normalization for an arbitrary range, and z-score normalization. For each of these methods we have prepared two SQL query options: one using a SELECT subquery and another using a window function OVER().

We will work with the simple table students that contains the data on the height of the students:

name height
Ivan 174
Peter 181
Dan 199
Kate 158
Mike 179
Silvia 165
Giulia 152
Robert 188
Steven 177
Sophia 165

Min-max rescaling

Min-max scaling approach scales the data using the fixed range from 0 to 1. In this case, all the data is on the same scale which will exclude the impact of outliers on the conclusions.

The formula for the min-max scaling is given as:

We multiply the numerator by 1.0 in order to get a floating point number at the end.

SQL-query with a subquery:

SELECT height, 
       1.0 * (height-t1.min_height)/(t1.max_height - t1.min_height) AS scaled_minmax
  FROM students, 
      (SELECT min(height) as min_height, 
              max(height) as max_height 
         FROM students
      ) as t1;

SQL-query with a window function:

SELECT height, 
       (height - MIN(height) OVER ()) * 1.0 / (MAX(height) OVER () - MIN(height) OVER ()) AS scaled_minmax
  FROM students;

As a result, we get the values in [0, 1] range where 0 is the height of the shortest student and 1 is the height of the tallest one.

name height scaled_minmax
Ivan 174 0.46809
Peter 181 0.61702
Dan 199 1
Kate 158 0.12766
Mike 179 0.57447
Silvia 165 0.2766
Giulia 152 0
Robert 188 0.76596
Steven 177 0.53191
Sophia 165 0.2766

Rescaling within a given range

This is an option of min-max normalization between an arbitrary set of values. When it comes to data scaling, the values do not always need to be in the range between 0 and 1. In these cases, the following formula is applied.

This allows us to scale the data to an arbitrary scale. In our example, let’s assume a=10.0 and b=20.0.

SQL-query with subquery:

SELECT height, 
       ((height - min_height) * (20.0 - 10.0) / (max_height - min_height)) + 10 AS scaled_ab
  FROM students,
      (SELECT MAX(height) as max_height, 
              MIN(height) as min_height
         FROM students  
      ) t1;

SQL-query with a window function:

SELECT height, 
       ((height - MIN(height) OVER() ) * (20.0 - 10.0) / (MAX(height) OVER() - MIN(height) OVER())) + 10.0 AS scaled_ab
  FROM students;

We receive similar results as before, but with data spread between 10 and 20.

name height scaled_ab
Ivan 174 14.68085
Peter 181 16.17021
Dan 199 20
Kate 158 11.2766
Mike 179 15.74468
Silvia 165 12.76596
Giulia 152 10
Robert 188 17.65957
Steven 177 15.31915
Sophia 165 12.76596

Z-score normalization

Using Z-score normalization, the data will be scaled so that it has the properties of a standard normal distribution where the mean (μ) is equal to 0 and the standard deviation (σ) to 1.

Z-score is calculated using the formula:

SQL-query with a subquery:

SELECT height, 
       (height - t1.mean) * 1.0 / t1.sigma AS zscore
  FROM students,
      (SELECT AVG(height) AS mean, 
              STDDEV(height) AS sigma
         FROM students
        ) t1;

SQL-query with a window function:

SELECT height, 
       (height - AVG(height) OVER()) * 1.0 / STDDEV(height) OVER() AS z-score
  FROM students;

As a result, we can easily notice the outliers that exceed the standard deviation.

name height zscore
Ivan 174 0.01488
Peter 181 0.53582
Dan 199 1.87538
Kate 158 -1.17583
Mike 179 0.38698
Silvia 165 -0.65489
Giulia 152 -1.62235
Robert 188 1.05676
Steven 177 0.23814
Sophia 165 -0.65489
 No comments    276   12 mon   Analytics engineering   sql

SAP Analytics Cloud Dashboard Overview

Estimated read time – 8 min

Our last guide on BI tools was dedicated to QlikSense and today we will have a look at SAP Analytics Cloud. The dashboard has been prepared for us by the head BI consultant at SAPRUN Alexey Salynin.

In this overview, we will touch the following topics: creating a new data source in SAP, adding filters, making a Tree map,
setting up a preview on mobile devices, working with the Smart Insight assistant, multi language feature.

This dashboard has the same structure as our previous dashboards. At the top, we can see the KPI cards and a date filter. Below there is a heatmap that shows top provinces and a chart that shows profit dynamics on the right. In the second part, we can see the products and customers analysis.

Creating a new data source in SAP

To add a data source in SAP, we can either work with models or datasets. Models allow us to work with several data sources at the same time and carry out planning in order to solve complex problems. In our case, it’s a simple Excel file, so we used a normal dataset. As we have chosen only the orders sheet. Here is how the dataset looks:

Some of the columns were created, such as products hierarchy, geo hierarchy and some additional calculations.

Adding a filter

There are a number of ways to add a filter starting from default filtration methods to input fields to a timeline. If you click the edit button, you can see the filter format.

A tree map shows the top provinces by sales. Each province contains information on variation similarly to the KPI cards. The map allows filtering by provinces.

On the right of the tree map there is a profit dynamics chart. The chart has a dynamic axis which allows choosing the time period in time series charts. Time series graphs in SAP can show a forecast. The forecast is done automatically using built-in algorithms. In our case, triple exponential smoothing was chosen. When hovering over the forecasted point, we can see upper and lower confidence bounds.

In the next part of the dashboard we can see the products and customers analysis section.

Here you can see that products are displayed in a list without any hierarchy. To activate the hierarchy we need to change the dimension properties and set the hierarchy.

By default, the first level is displayed. We can change the level and see the products by category. Moreover, the bars are filters for the table on the right.

Smart Insight

One of the interesting features in SAP is smart insight. SAP automatically analyses the value and builds charts to provide more insights. For example, if I want to see what contributed to high level of furniture sales, I can click on smart insights and see the following results:

The feature is very handy as the insights are not simple graphs, but they are connected to each other and show a story. The insights allow us to interact with the data and explore it deeper. We can later add these insights into our visualization. This feature can also be used on a new dataset and empty dashboard to explore the data without building anything.

The whole dashboard was built in a responsive view mode so it will be automatically adapted to different screens. Some visualizations like tables are not adapted for the mobile screen, so we can hide them for mobile screens.

We can save the dashboard in the catalog and add it to favourites. Unfortunately, the functionality of the system doesn’t allow posting dashboards on public sources. But we hope that with new releases the feature will appear.

Last but not least, the dashboard has been created in Russian and when we needed to translate it into English SAP multi language feature came very handy. We can choose the target language and translate all the text. Then the dashboard will be displayed in a language based on the user’s location.

We have evaluated the dashboard together with Alexey (his scores in brackets) and received the following results:
Meets the tasks – 9.8 (10)
Learning curve  – 6.5 (7)
Tool functionality – 9.8 (9)
Ease of use – 8.2 (8)
Compliance with the layout – 9.3 (10)
Visual evaluation – 8.6 (9)

Overall: 8.71 out of 10. Have a look at the final results:

 No comments    312   12 mon   BI-tools   SAP
Earlier Ctrl + ↓