Graph of telegram channels related to analytics

Время чтения текста – 3 минуты

The authors of various Telegram-blogs often publish a selection of their favorite channels so as to share their professional choice with their audience. The idea, of course, is not new, but I decided not just to make a rating of interesting analytical telegram blogs, but to solve this problem analytically.

As part of the current course of my studies, I am learning many modern approaches to data analysis and visualization. At the very beginning of the course, there was a warm-up exercise: object-oriented programming in Python for collecting and iterative building a graph with TMDB API. Usually, this method is used to construct a connection graph of actors, where the connection is a role in the same film. However, I decided that I could apply it to another problem: building a graph of connections for the analytic community.

Since my time has been particularly limited recently, and I have already completed a similar task for the course, I decided to transfer this knowledge to someone else who is interested in analytics. Fortunately, at that moment, a candidate for the vacancy of a junior data analyst, Andrey, texted me in direct messages. He is now in the process of comprehending all the intricacies of analytics, so we agreed on an internship, in which Andrey parsed data from telegram channels.

Andrey’s main task was to collect all texts from the Internet analyst’s telegram channel, select the channels, which Aleksey Nikushin linked, collect texts from these telegram channels and links on these channels. “Link” means any mention of the channel: through @, through a link or repost. As a result of parsing, Andrey got two files: nodes and edges.
Now I will present to you the graph that I got based on this data and comment on the results.

I would like to take this opportunity to express my compliments to the karpov.courses team, as Andrey has excellent knowledge of the Python language!

As a result, the top 10 channels in terms of degree (number of connections) looks like this:

  1. Интернет-аналитика
  2. Reveal The Data
  3. Инжиниринг Данных
  4. Data Events
  5. Datalytics
  6. Чартомойка
  7. LEFT JOIN
  8. Epic Growth
  9. RTD: ссылки и репосты
  10. Дашбордец

In my opinion, it turned out extremely exciting and visually interesting, and Andrey is a great fellow! By the way, he also started his own channel ”Это разве аналитика?”, where analytics news is published.

Looking ahead, this problem has a continuation. With the help of the Markov chain, we modeled where the user ends up if he iteratively navigates through all the mentions in the channels. It turned out unexpectedly, but we will tell about it next time!

 No comments    379   2021   data analytics   python

How and why should you export reports from Jupyter Notebook to PDF

Время чтения текста – 5 минут

If you are a data analyst and you need to present a report to a client, if you are looking for a job and do not know how to draw up a test task in such a way that people will pay attention to you, if you have a lot of educational projects related to data analytics and visualization, this post will be very, very useful to you.
Looking at someone else’s code in a Jupyter Notebook can be problematic, because the result is often lost between lines of code with data preparation, importing the necessary libraries and a series of attempts to implement the idea. That is why a method such as exporting results to a PDF file in LaTeX format is a great option for final visualization. It will save time and look presentable. In scientific circles, articles and reports are very often formatted using LaTeX, since it has a number of advantages:

  • Math equations and formulas look neater.
  • The bibliography is automatically generated based on all references used in the document.
  • The author can focus on the content (not on the appearance of the document), since the layout of the text and other data is set automatically by specifying the necessary parameters in the code.

Today we will talk in detail about how to export such beautiful reports from Jupyter Notebook to PDF using LaTeX.

Installing LaTeX

The most important point in generating a report from a Jupyter Notebook in Python is exporting it to the final file. The main library you need to install is – nbconvert – which converts your notebook into any convenient document format: pdf (as in our case), html, latex, etc. This library needs not only to be installed, but some preinstalling of several other packages as well: Pandoc, TeX, and Chromium. According to the link to the library, the whole process is described in detail for each software, so we will not dwell on it here.
Once you have completed all the preliminary steps, you need to install and import the library into your Jupyter Notebook.

! pip install nbconvert
import nbconvert

Export tables to Markdown format

Usually, tables look a bit odd in reports, as they can be difficult to read quickly, but sometimes it is still necessary to add a small table to the final document. In order for the table to look neat, you need to save it in Markdown format. This can be done manually, but if there is a lot of data in the table, it is better to come up with a more convenient method. We suggest using the following simple pandas_df_to_markdown_table () function, which converts any dataframe to a markdown-table. Note: after the conversion, indices disappear, therefore, if they are important (as in our example), it is worth saving them into a variable in the first column of the dataframe.

data_g = px.data.gapminder ()
summary = round (data_g.describe (), 2)
summary.insert (0, 'metric', summary.index)

# Function to convert dataframe to Markdown Table
def pandas_df_to_markdown_table (df):
    from IPython.display import Markdown, display
    fmt = ['---' for i in range (len (df.columns))]
    df_fmt = pd.DataFrame ([fmt], columns = df.columns)
    df_formatted = pd.concat ([df_fmt, df])
    display (Markdown (df_formatted.to_csv (sep = "|", index = False)))

pandas_df_to_markdown_table (summary)

Export image to report

In this example, we will build a bubble-chart, the construction method of which was described in a recent post. Previously we used the Seaborn library, which shown that the display of data with the size of circles on the graph is correct. The same graphs can be created using the Plotly library.
In order to display the plot in the final report, you also need to complete an additional step. The point is that plt.show () will not help to display the graph when exporting. Therefore, you need to save the graph in the working directory, and then, using the iPython.display library, display it using the Image () function.

from IPython.display import Image
import plotly.express as px
fig = px.scatter (data_g.query ("year == 2007"), x = "gdpPercap", y = "lifeExp",
                 size = "pop", color = "continent",
                 log_x = True, size_max = 70)
fig.write_image ('figure_1.jpg')
Image (data = 'figure_1.jpg', width = 1000)

Formation and export of the report

When all stages of data analysis are completed, the report can be exported. If you need headings or text in the report, then write them in the cells of the notebook, changing the format from Code to Markdown. For export, you can use the terminal, running the second line there without an exclamation mark, or you can run the code written below in the cell of the Jupiter Notebook. We advise you not to load the report with code and use TemplateExporter.exclude_input = True parameter so that the cells with the code are not exported. Also, when you run this cell in your notebook, the code produces a standard output, and you need to write %% capture at the beginning of the cell not to export it.

%% capture
! jupyter nbconvert --to pdf --TemplateExporter.exclude_input = True ~ / Desktop / VALIOTTI / Reports / Sample \ LaTeX \ Report.ipynb
! open ~ / Desktop / VALIOTTI / Reports / Sample \ LaTeX \ Report.pdf

If you did everything correctly and methodically, then you will end up with a report similar to this one!
Present your data nicely :)

 No comments    2730   2021   data analytics   python

Dashboard for the first 8 months of a child’s life

Время чтения текста – 4 минуты

In December 2020, I became a dad, which means that our family life with my wife has changed drastically. Of course, I am sharing this news with you for a reason, but in the context of the data that we will study and research today. They are very personal for me, and therefore have some special magic and value. Today I want to show how dramatically the life of a family is changing by the example of my own analysis of the life data of the first 8 months of a baby.

Data collection

Initial data: tracking the main elements of caring for a baby in the first 8 months: sleep, nursing, changing a diaper. The data was collected using BabyTracker app.
My wife is a great fellow, because during the first 7 months she carefully and regularly monitored all the important points. She forgot to turn off the timer for nursing the baby at night only a couple of times, but I quickly saw noticeable outliers in the data, and the dataset was cleared of them.
Initially, I had several data visualization ideas in my head, and I tried to immediately implement them into the projected dashboard. I wanted to show the baby’s sleep intervals in the form of a vertical Gantt chart, but the night’s sleep went through the midnight (0:00), and it was completely incomprehensible how this could be corrected in Tableau. After a number of unsuccessful independent attempts to find a solution to this problem, I decided to consult with Roman Bunin. Unfortunately, we came to the conclusion together that there is no way to solve this. Then I had to write a little Python code that splits such time intervals and adds new lines to the dataset.
However, while we were texting, Roma sent an example identical to my idea! This example claims that a woman collected data on her child’s sleep and wakefulness in the first year of child’s life, and then wrote the code with which it turned out to be embroidered towel with pattern datavis baby sleep. For me, this was surprising, since it turned out that this way of visualization is the main method that allows you to show how difficult life and sleep of parents is in the first months of the birth of a child.
In my dashboard on Tableau Public I got three semantic blocks and several “KPIs” about which I would like to tell you in detail and share the basic everyday wisdom. At the top of the dashboard, you can see the key averages of the daytime and nighttime sleep hours, nursing hours, frequency of nursing, and the number of diaper changes in the first three months. I have allocated exactly three months, because I think this is the most difficult period, because significant changes that require serious adaptation are taking place in your life .

Dream

The left diagram – called “Towel” – illustrates the baby’s sleeping periods. In this diagram, it is important to pay attention to white gaps, especially at night. These are the hours when the baby is awake, which means that the parents are also awake. Look at how the chart changes, especially in the early months, when we gave up the habit of going to bed at 1 or 2 in the morning and fell asleep earlier. Roughly speaking, in the first three months (until March 2021), the child could fall asleep at 2 or 3 in the morning, but we were lucky that our child’s night sleep was quite long.
The right graph clearly illustrates how the baby’s day and night sleep length changes over time, and the boxplots show the distribution of the hours of daytime and nighttime sleep. The graph confirms the conclusion: “This is temporary and will definitely get better soon!”

Nursing

From the left diagram, you can see how the number and duration of nursing change. This number is gradually decreasing, and the duration of nursing periods is shortened. Since mid-July, we have changed the way we track nursing periods, so they are not valid for this analysis.
From my point of view, the findings are a great opportunity for couples planning a pregnancy, not to create illusions about the opportunity to work or do any other business in the first months after giving birth. Pay attention to the frequency and duration of nursing, all this time the parent is completely busy with the child. However, do not be overly alarmed: over time, the number of nursing periods will decrease.

Diaper change

The left graph is the highlight of this dashboard. As you can imagine, this is a map of the most fun moments – changing a diaper. The stars represent the moments of the day when you need to change the diaper, and the light gray color below shows the number of changes per day. The graph on the right shows diaper changes counted by part of the day. In general, the diagram does not show any interesting dependencies, however, it prepares you for the fact that this process is frequent, regular and happens at any time of the day.

Conclusions

It seems to me that the use of real personal data and such visualization is sometimes much more revealing than a lot of videos or books about what this period will be like. That is why I decided to share my findings and observations with you here. The main conclusion that I wanted you to draw from the dataviz: children are great! ❤️

 No comments    328   2021   data analytics   python

Python and lyrics of Zemfira’s new album: capturing the spirit of her songs

Время чтения текста – 16 минут

Zemfira’s latest studio album, Borderline, was released in February, 8 years after the previous one. For this album, various people cooperated with her, including her relatives – the riff for the song “Таблетки” was written by her nephew from London. The album turned out to be diverse: for instance, the song “Остин” is dedicated to the main character of the Homescapes game by the Russian studio Playrix (by the way, check out the latest Business Secrets with the Bukhman brothers, they also mention it there). Zemfira liked the game a lot, thus, she contacted Playrix to create this song. Also, the song “Крым” was written as a soundtrack to a new film by Zemfira’s colleague Renata Litvinova.

Listen new album in Apple Music / Яндекс.Музыке / Spotify

Nevertheless, the spirit of the whole album is rather gloomy – the songs often repeat the words ‘боль’, ‘ад’, ‘бесишь’ and other synonyms. We decided to conduct an exploratory analysis of her album, and then, using the Word2Vec model and a cosine measure, look at the semantic closeness of the songs and calculate the general mood of the album.

For those who are bored with reading about data preparation and analysis steps, you can go directly to the results.

Data preparation

For starters, we write a data processing script. The purpose of the script is to collect a united csv-table from a set of text files, each of which contains a song. At the same time, we have to get rid of all punctuation marks and unnecessary words as we need to focus only on significant content.

import pandas as pd
import re
import string
import pymorphy2
from nltk.corpus import stopwords

Then we create a morphological analyzer and expand the list of everything that needs to be discarded:

morph = pymorphy2.MorphAnalyzer()
stopwords_list = stopwords.words('russian')
stopwords_list.extend(['куплет', 'это', 'я', 'мы', 'ты', 'припев', 'аутро', 'предприпев', 'lyrics', '1', '2', '3', 'то'])
string.punctuation += '—'

The names of the songs are given in English, so we have to create a dictionary for translation into Russian and a dictionary, from which we will later make a table:

result_dict = dict()

songs_dict = {
    'snow':'снег идёт',
    'crimea':'крым',
    'mother':'мама',
    'ostin':'остин',
    'abuse':'абьюз',
    'wait_for_me':'жди меня',
    'tom':'том',
    'come_on':'камон',
    'coat':'пальто',
    'this_summer':'этим летом',
    'ok':'ок',
    'pills':'таблетки'
}

Let’s define several necessary functions. The first one reads the entire song from the file and removes line breaks, the second clears the text from unnecessary characters and words, and the third one converts the words to normal form, using the pymorphy2 morphological analyzer. The pymorphy2 module does not always handle ambiguity well – additional processing is required for the words ‘ад’ and ‘рай’.

def read_song(filename):
    f = open(f'{filename}.txt', 'r').read()
    f = f.replace('\n', ' ')
    return f

def clean_string(text):
    text = re.split(' |:|\.|\(|\)|,|"|;|/|\n|\t|-|\?|\[|\]|!', text)
    text = ' '.join([word for word in text if word not in string.punctuation])
    text = text.lower()
    text = ' '.join([word for word in text.split() if word not in stopwords_list])
    return text

def string_to_normal_form(string):
    string_lst = string.split()
    for i in range(len(string_lst)):
        string_lst[i] = morph.parse(string_lst[i])[0].normal_form
        if (string_lst[i] == 'аду'):
            string_lst[i] = 'ад'
        if (string_lst[i] == 'рая'):
            string_lst[i] = 'рай'
    string = ' '.join(string_lst)
    return string

After all this preparation, we can get back to the data and process each song and read the file with the corresponding name:

name_list = []
text_list = []
for song, name in songs_dict.items():
    text = string_to_normal_form(clean_string(read_song(song)))
    name_list.append(name)
    text_list.append(text)

Then we combine everything into a DataFrame and save it as a csv-file.

df = pd.DataFrame()
df['name'] = name_list
df['text'] = text_list
df['time'] = [290, 220, 187, 270, 330, 196, 207, 188, 269, 189, 245, 244]
df.to_csv('borderline.csv', index=False)

Result:

Word cloud for the whole album

To begin with the analysis, we have to construct a word cloud, because it can display the most common words found in these songs. We import the required libraries, read the csv-file and set the configurations:

import nltk
from wordcloud import WordCloud
import pandas as pd
import matplotlib.pyplot as plt
from nltk import word_tokenize, ngrams

%matplotlib inline
nltk.download('punkt')
df = pd.read_csv('borderline.csv')

Now we create a new figure, set the design parameters and, using the word cloud library, display words with the size directly proportional to the frequency of the word. We additionally indicate the name of the song above the corresponding graph.

fig = plt.figure()
fig.patch.set_facecolor('white')
plt.subplots_adjust(wspace=0.3, hspace=0.2)
i = 1
for name, text in zip(df.name, df.text):
    tokens = word_tokenize(text)
    text_raw = " ".join(tokens)
    wordcloud = WordCloud(colormap='PuBu', background_color='white', contour_width=10).generate(text_raw)
    plt.subplot(4, 3, i, label=name,frame_on=True)
    plt.tick_params(labelsize=10)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.title(name,fontdict={'fontsize':7,'color':'grey'},y=0.93)
    plt.tick_params(labelsize=10)
    i += 1

EDA of the lyrics

Let us move to the next part and analyze the lyrics. To do this, we have to import special libraries to deal with data and visualization:

import plotly.graph_objects as go
import plotly.figure_factory as ff
from scipy import spatial
import collections
import pymorphy2
import gensim

morph = pymorphy2.MorphAnalyzer()

Firstly, we should count the overall number of words in each song, the number of unique words, and their percentage:

songs = []
total = []
uniq = []
percent = []

for song, text in zip(df.name, df.text):
    songs.append(song)
    total.append(len(text.split()))
    uniq.append(len(set(text.split())))
    percent.append(round(len(set(text.split())) / len(text.split()), 2) * 100)

All this information should be written in a DataFrame and additionally we want to count the number of words per minute for each song:

df_words = pd.DataFrame()
df_words['song'] = songs
df_words['total words'] = total
df_words['uniq words'] = uniq
df_words['percent'] = percent
df_words['time'] = df['time']
df_words['words per minute'] = round(total / (df['time'] // 60))
df_words = df_words[::-1]

It would be great to visualize the data, so let us build two bar charts: one for the number of words in the song, and the other one for the number of words per minute.

colors_1 = ['rgba(101,181,205,255)'] * 12
colors_2 = ['rgba(62,142,231,255)'] * 12

fig = go.Figure(data=[
    go.Bar(name='📝 Total number of words,
           text=df_words['total words'],
           textposition='auto',
           x=df_words.song,
           y=df_words['total words'],
           marker_color=colors_1,
           marker=dict(line=dict(width=0)),),
    go.Bar(name='🌀 Unique words',
           text=df_words['uniq words'].astype(str) + '<br>'+ df_words.percent.astype(int).astype(str) + '%' ,
           textposition='inside',
           x=df_words.song,
           y=df_words['uniq words'],
           textfont_color='white',
           marker_color=colors_2,
           marker=dict(line=dict(width=0)),),
])

fig.update_layout(barmode='group')

fig.update_layout(
    title = 
        {'text':'<b>The ratio of the number of unique words to the total</b><br><span style="color:#666666"></span>'},
    showlegend = True,
    height=650,
    font={
        'family':'Open Sans, light',
        'color':'black',
        'size':14
    },
    plot_bgcolor='rgba(0,0,0,0)',
)
fig.update_layout(legend=dict(
    yanchor="top",
    xanchor="right",
))

fig.show()
colors_1 = ['rgba(101,181,205,255)'] * 12
colors_2 = ['rgba(238,85,59,255)'] * 12

fig = go.Figure(data=[
    go.Bar(name='⏱️ Track length, min.',
           text=round(df_words['time'] / 60, 1),
           textposition='auto',
           x=df_words.song,
           y=-df_words['time'] // 60,
           marker_color=colors_1,
           marker=dict(line=dict(width=0)),
          ),
    go.Bar(name='🔄 Words per minute',
           text=df_words['words per minute'],
           textposition='auto',
           x=df_words.song,
           y=df_words['words per minute'],
           marker_color=colors_2,
           textfont_color='white',
           marker=dict(line=dict(width=0)),
          ),
])

fig.update_layout(barmode='overlay')

fig.update_layout(
    title = 
        {'text':'<b>Track length and words per minute</b><br><span style="color:#666666"></span>'},
    showlegend = True,
    height=650,
    font={
        'family':'Open Sans, light',
        'color':'black',
        'size':14
    },
    plot_bgcolor='rgba(0,0,0,0)'
)


fig.show()

Working with Word2Vec model

Using the gensim module, load the model pointing to a binary file:

model = gensim.models.KeyedVectors.load_word2vec_format('model.bin', binary=True)

Для материала мы использовали готовую обученную на Национальном Корпусе Русского Языка модель от сообщества RusVectōrēs

The Word2Vec model is based on neural networks and allows you to represent words in the form of vectors, taking into account the semantic component. It means that if we take two words – for instance, “mom” and “dad”, then represent them as two vectors and calculate the cosine, the values ​​will be close to 1. Similarly, two words that have nothing in common in their meaning have a cosine measure close to 0.

Now we will define the get_vector function: it will take a list of words, recognize a part of speech for each word, and then receive and summarize vectors, so that we can find vectors even for whole sentences and texts.

def get_vector(word_list):
    vector = 0
    for word in word_list:
        pos = morph.parse(word)[0].tag.POS
        if pos == 'INFN':
            pos = 'VERB'
        if pos in ['ADJF', 'PRCL', 'ADVB', 'NPRO']:
            pos = 'NOUN'
        if word and pos:
            try:
                word_pos = word + '_' + pos
                this_vector = model.word_vec(word_pos)
                vector += this_vector
            except KeyError:
                continue
    return vector

For each song, find a vector and select the corresponding column in the DataFrame:

vec_list = []
for word in df['text']:
    vec_list.append(get_vector(word.split()))
df['vector'] = vec_list

So, now we should compare these vectors with one another, calculating their cosine proximity. Those songs with a cosine metric higher than 0.5 will be saved separately – this way we will get the closest pairs of songs. We will write the information about the comparison of vectors into the two-dimensional list result.

similar = dict()
result = []
for song_1, vector_1 in zip(df.name, df.vector):
    sub_list = []
    for song_2, vector_2 in zip(df.name.iloc[::-1], df.vector.iloc[::-1]):
        res = 1 - spatial.distance.cosine(vector_1, vector_2)
        if res > 0.5 and song_1 != song_2 and (song_1 + ' / ' + song_2 not in similar.keys() and song_2 + ' / ' + song_1 not in similar.keys()):
            similar[song_1 + ' / ' + song_2] = round(res, 2)
        sub_list.append(round(res, 2))
    result.append(sub_list)

Moreover, we can construct the same bar chart:

df_top_sim = pd.DataFrame()
df_top_sim['name'] = list(similar.keys())
df_top_sim['value'] = list(similar.values())
df_top_sim.sort_values(by='value', ascending=False)

И построим такой же bar chart:

colors = ['rgba(101,181,205,255)'] * 5

fig = go.Figure([go.Bar(x=df_top_sim['name'],
                        y=df_top_sim['value'],
                        marker_color=colors,
                        width=[0.4,0.4,0.4,0.4,0.4],
                        text=df_top_sim['value'],
                        textfont_color='white',
                        textposition='auto')])

fig.update_layout(
    title = 
        {'text':'<b>Топ-5 closest songs</b><br><span style="color:#666666"></span>'},
    showlegend = False,
    height=650,
    font={
        'family':'Open Sans, light',
        'color':'black',
        'size':14
    },
    plot_bgcolor='rgba(0,0,0,0)',
    xaxis={'categoryorder':'total descending'}
)

fig.show()

Given the vector of each song, let us calculate the vector of the entire album – add the vectors of the songs. Then, for such a vector, using the model, we get the words that are the closest in spirit and meaning.

def get_word_from_tlist(lst):
    for word in lst:
        word = word[0].split('_')[0]
        print(word, end=' ')

vec_sum = 0
for vec in df.vector:
    vec_sum += vec
sim_word = model.similar_by_vector(vec_sum)
get_word_from_tlist(sim_word)

небо тоска тьма пламень плакать горе печаль сердце солнце мрак

This is probably the key result and the description of Zemfira’s album in just 10 words.

Finally, we build a general heat map, each cell of which is the result of comparing the texts of two tracks with a cosine measure.

colorscale=[[0.0, "rgba(255,255,255,255)"],
            [0.1, "rgba(229,232,237,255)"],
            [0.2, "rgba(216,222,232,255)"],
            [0.3, "rgba(205,214,228,255)"],
            [0.4, "rgba(182,195,218,255)"],
            [0.5, "rgba(159,178,209,255)"],
            [0.6, "rgba(137,161,200,255)"],
            [0.7, "rgba(107,137,188,255)"],
            [0.8, "rgba(96,129,184,255)"],
            [1.0, "rgba(76,114,176,255)"]]

font_colors = ['black']
x = list(df.name.iloc[::-1])
y = list(df.name)
fig = ff.create_annotated_heatmap(result, x=x, y=y, colorscale=colorscale, font_colors=font_colors)
fig.show()

Results and data interpretation

To give valuable conclusions, we would like to take another look at everything we got. First of all, let us focus on the word cloud. It is easy to see that the words ‘боль’, ‘невозможно’, ‘сорваться’, ‘растерзаны’, ‘сложно’, ‘терпеть’, ‘любить’ have a very decent size, because such words are often found throughout the entire lyrics:

Давайте ещё раз посмотрим на всё, что у нас получилось — начнём с облака слов. Нетрудно заметить, что у слов «боль», «невозможно», «сорваться», «растерзаны», «сложно», «терпеть», «любить» размер весьма приличный — всё потому, что такие слова встречаются часто на протяжении всего текста песен:

The song “Крым” turned out to be one of the most diverse songs – it contains 74% of unique words. Also, the song “Снег идет” contains very few words, so the majority, which is 82%, are unique. The largest song on the album in terms of amount of words is the track “Таблетки” – there are about 150 words in total.

As it was shown on the last chart, the most dynamic track is “Таблетки”, as much as 37 words per minute – nearly one word for every two seconds – and the longest track is “Абьюз”, and according to the previous chart, it also has the lowest percentage of unique words – 46%.

Top 5 most semantically similar text pairs:

We also got the vector of the entire album and found the closest words. Just take a look at them – ‘тьма’, ‘тоска’, ‘плакать’, ‘горе’, ‘печаль’, ‘сердце’ – this is the list of words that characterizes Zemfira’s lyrics!

небо тоска тьма пламень плакать горе печаль сердце солнце мрак

The final result is a heat map. From the visualization, it is noticeable that almost all songs are quite similar to each other – the cosine measure for many pairs exceeds the value of 0.4.

Conclusions

In the material, we carried out an EDA of the entire text of the new album and, using the pre-trained Word2Vec model, we proved the hypothesis – most of the “Borderline” songs are permeated with rather dark lyrics. However, this is normal, because we love Zemfira precisely for her sincerity and straightforwardness.

 No comments    348   2021   analysis   Analytics engineering   data analytics   plotly   python

Building frequency counts and bigrams using the posts of traders

Время чтения текста – 9 минут

Stocktwits is the largest social network for investors and traders of all levels which allows us to see what is happening in the financial markets. Today we will build a frequency dictionary and bigrams of the users’ posts and divide them by the number of followers. This will allow us to see the difference between the posts of different types of traders.

This is how the feed on the CCIV security looks at Stocktwits:

Some users have the status of officials:

Scraping the posts

Stocktwits has an API that allows getting 30 posts at a time. The API request returns a JSON file, so we will write a get_30_messages function that reads the JSON file and writes all the entries into the list called rows. The information about posts already contains the information about users, so we will not create separate tables and will save everything in one DataFrame. For this purpose, we will create a list with the names of columns and initiate an empty list called rows where we will append all the scraped posts.

Some posts don’t have a “likes” key in the JSON file which results in KeyError. To avoid the error, we will assign 0 to the “likes” in such posts.

cols = ['post_id', 'text', 'created_at', 'user_id', 'likes', 'sentiment', 'identity','followers', 'following', 'ideas', 'watchlist_stocks_count', 'like_count', 'plus_tier']
rows = []
 
def get_30_messages(data):
    for p in data['messages']:
        try:
            likes = p['likes']['total']
        except KeyError:
            likes = 0
        rows.append({'id': p['id'], 
                    'text': p['body'], 
                    'created_at': p['created_at'], 
                    'user_id': p['user']['id'], 
                    'likes': likes,
                    'sentiment': p['entities']['sentiment'], 
                    'symbol': symbol,
                    'identity': p['user']['identity'],
                    'followers': p['user']['followers'], 
                    'following': p['user']['following'], 
                    'ideas': p['user']['ideas'], 
                    'watchlist_stocks_count': p['user']['watchlist_stocks_count'], 
                    'like_count': p['user']['like_count'], 
                    'plus_tier': p['user']['like_count']
                    })

We will scrap the posts from the pages of 16 most trending securities.

symbols = ['DIA', 'SPY', 'QQQ', 'INO', 'OCGN', 'BTC.X', 'SNAP', 'INTC', 'VXX', 'ASTS', 'SKLZ', 'RIOT', 'DJIA', 'GOLD', 'GGII', 'COIN']

As the API request returns only 30 most recent posts, to get older posts, we need to save the id of the last post into a dictionary and insert it as the max parameter during the next request. Unfortunately, the API allows us to make only 200 requests per hour, so in order to stay within the limits, we will run the for loop for each security only 11 times.

last_id_values = dict()
        
for symbol in symbols:
    file = requests.get(f"https://api.stocktwits.com/api/2/streams/symbol/{symbol}.json")
    data = json.loads(file.content)
    
    for i in range(10):
        get_30_messages(data)
            
        last_id = data['cursor']['max']
        last_id_values[symbol] = last_id
        
        file = requests.get(f"https://api.stocktwits.com/api/2/streams/symbol/{symbol}.json?max={last_id}")
        data = json.loads(file.content)
    
    get_30_messages(data)

Thus, we have collected only about 6000 posts, which is not enough for the analysis. That’s why, we will create a timer to run the same code after 1 hour and 5 minutes for 11 cycles.

def get_older_posts():
    for symbol in symbols:
        for i in range(12):
            file = requests.get(f"https://api.stocktwits.com/api/2/streams/symbol/{symbol}.json?max={last_id_values[symbol]}")
            data = json.loads(file.content)        
            get_30_messages(data)
 
            last_id = data['cursor']['max']
            last_id_values[symbol] = last_id
 
for i in range(11):
    time.sleep(3900)
    get_older_posts()

After all the data is collected, let’s create a DataFrame.

df = pd.DataFrame(rows, columns = cols)

The resulting table will look like this:

It is important to check that the post_id doesn’t have duplicate values. By looking at the number of unique values and the number of total values in posts_id we can notice that we have about 10000 duplicate values.

df.posts_id.nunique(), len(df.posts_id)

This happened because some posts get posted on multiple pages. So the last step will be dropping the duplicate values.

df.drop_duplicates(subset="posts_id", inplace=True)

Frequency counts and bigrams

First of all, let’s create a frequency count for posts without dividing them into groups.

df.text.str.split(expand=True).stack().value_counts()

We can see that articles, conjunctions, and prepositions prevail over the other words:

Thus, we need to remove them from the dataset. However, even if the dataset is cleaned, the results will look like this. Apart from the fact that 39 is the most frequent word, the data is not very informative and it’s difficult to make any conclusions based on it.

In this case, we will need to build bigrams. One bigram is a sequence of two elements, that is two words standing next to each other. There are many algorithms for building n-grams with different optimization levels. We will use a built-in function in nltk to create a bigram for one group. First, let’s import the additional libraries, download stop words for the English language, and clean the data. Then we will add more stop words including the names of the stock tickers that are used in every post.

import nltk
from nltk.corpus import stopwords
from string import punctuation
import unicodedata
import collections
import nltk
from nltk.stem import WordNetLemmatizer
 
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

english_stopwords = stopwords.words("english")
symbols = ['DIA', 'SPY', 'QQQ', 'INO', 'OCGN', 'BTC.X', 'SNAP', 'INTC', 'VXX', 'ASTS', 'SKLZ', 'RIOT', 'DJIA', 'GOLD', 'GGII', 'COIN']
symbols_lower = [sym.lower() for sym in symbols]
append_stopword = ['https', 'www', 'btc', 'x', 's', 't', 'p', 'amp', 'utm', 'm', 'gon', 'na', '’', '2021', '04', 'stocktwits', 'com', 'spx', 'ndx', 'gld', 'slv', 'es', 'f', '...', '--', 'cqginc', 'cqgthom', 'gt']
english_stopwords.extend(symbols_lower)
english_stopwords.extend(append_stopword)

Let’s define a function to prepare the text that will translate all the words to lowercase, bring them to the base form and remove stop words and punctuation.

wordnet_lemmatizer = WordNetLemmatizer()
 
def preprocess_text(text):
    tokens = nltk.word_tokenize(text.lower())
    tokens = [wordnet_lemmatizer.lemmatize(token) for token in tokens if token not in english_stopwords\
              and token != " " \
              and token.strip() not in punctuation]
    
    text = " ".join(tokens)
    
    return text
    
    df.text = df.text.apply(process_text)

For example, let’s take the group of the least popular users with less than 300 followers, build bigrams and output the most frequent ones.

non_pop_df = df[(df['followers'] < 300)]
 
non_pop_counts = collections.Counter()
for sent in non_pop_df.text:
    words = nltk.word_tokenize(sent)
    non_pop_counts.update(nltk.bigrams(words))
non_pop_counts.most_common()

Results of the bigrams study

Users with less than 300 followers mostly write about their personal plans on making money. This is shown by the collocations like short term, long term, and make money.
Less than 300 followers:
1. look like, 439
2. next week, 422
3. let 39, 364
4. capital gain, 306
5. long term, 274
6. let go, 261
7. stock market, 252
8. buy dip, 252
9. gain tax, 221
10. make money, 203
11. short term, 201
12. buy buy, 192

More popular users with 300 to 3000 followers discuss more abstract issues like sweep premium, stock price and artificial intelligence.
From 300 to 3000 followers:
1. sweep premium, 166
2. price target, 165
3. total day, 140
4. stock market, 139
5. ask premium, 132
6. stock price, 129
7. current stock, 117
8. money trade, 114
9. trade option, 114
10. activity alert, 113
11. trade volume, 113
12. artificial intelligence, 113

Popular users that have below 30000 followers discuss their observations as well as promote their accounts or articles.
From 3000 to 30000 followers:
1. unusual option, 632
2. print size, 613
3. option activity, 563
4. large print, 559
5. activity alerted, 355
6. observed unusual, 347
7. sweepcast observed, 343
8. |🎯 see, 311
9. see profile, 253
10. profile link, 241
11. call expiring, 235
12. new article, 226

Very popular traders with more than 30000 followers mostly act as information sources and post about changes at the stock market. This is indicated by the frequent up and down arrows and collocations like “stock x-day” or “moving average”.
Users with more than 30000 followers:
1. dow stock, 69
2. elliottwave trading, 53
3. ⇩ indexindicators.com, 51
4. ⇧ indexindicators.com, 50
5. u stock, 47
6. stock 5-day, 36
7. moving average, 29
8. stock moving, 28
9. stock x-day, 27
10. ⇧ 10-day, 26
11. stock daily, 25
12. daily rsi, 25

We have also built the bigrams of officials, but the results turned out to be very similar to the most popular users.

 No comments    138   2021   data analytics   nltk
Earlier Ctrl + ↓