8 posts tagged

plotly

Later Ctrl + ↑

Building a scatter plot for Untappd Breweries

⏱ Время чтения текста – 7 минут

Today we are going to build a scatter plot for Russian Breweries that would display the ratio between the number of reviews and their average ratings for the past 30 days. Data will be taken from check-ins left by Untappd users who rated beers. To make a plot we need markers with specified color and size. The color will depend on a brewery registration date, thus displaying it’s registration period on Untappd, while the size of a marker correlates with the range of beers represented. This article is the first part of our series dedicated to building dashboards with Plotly.

Writing a Clickhouse query

First, we need to process the data before using it in our dashboard. Here, we are using public data collected from Untappd. You can find more about this in our previous articles: “Handling website buttons in Selenium” and “Example of using dictionaries in Clickhouse with Untappd”.

from datetime import datetime, timedelta
from clickhouse_driver import Client
import plotly.graph_objects as go
import pandas as pd
import numpy as np
client = Client(host='ec1-2-34-567-89.us-east-2.compute.amazonaws.com', user='default', password='', port='9000', database='default')

Our scatter plot will depend on the get_scatter_plot(n_days, top_n) function, which takes two arguments denoting a time span and a number of breweries to display. Let’s write a SQL query to calculate the Brewery Pure Average. It can be presented the following: multiply the beer rating by the total number of ratings and divide it by the number of brewery reviews. We will also pass a brewery name and its beer range to the query, these parameters can be fetched from our dictionary using the dictGet function. We are only interested in those breweries that have Brewery Pure Average > 0 and the number of reviews > 100.

brewery_pure_average = client.execute(f"""
SELECT
       t1.brewery_id,
       sum(t1.beer_pure_average_mult_count / t2.count_for_that_brewery) AS brewery_pure_average,
       t2.count_for_that_brewery,
       dictGet('breweries', 'brewery_name', toUInt64(t1.brewery_id)),
       dictGet('breweries', 'beer_count', toUInt64(t1.brewery_id)),
       t3.stats_age_on_service / 365
   FROM
   (
       SELECT
           beer_id,
           brewery_id,
           sum(rating_score) AS beer_pure_average_mult_count
       FROM beer_reviews
       WHERE created_at >= today()-{n_days}
       GROUP BY
           beer_id,
           brewery_id
   ) AS t1
   ANY LEFT JOIN
   (
       SELECT
           brewery_id,
           count(rating_score) AS count_for_that_brewery
       FROM beer_reviews
       WHERE created_at >= today()-{n_days}
       GROUP BY brewery_id
   ) AS t2 ON t1.brewery_id = t2.brewery_id
   ANY LEFT JOIN
   (
       SELECT
           brewery_id,
           stats_age_on_service
       FROM brewery_info
   ) AS t3 ON t1.brewery_id = t3.brewery_id
   GROUP BY
       t1.brewery_id,
       t2.count_for_that_brewery,
       t3.stats_age_on_service
   HAVING t2.count_for_that_brewery >= 150
   ORDER BY brewery_pure_average
   LIMIT {top_n}
    """)

scatter_plot_df_with_age = pd.DataFrame(brewery_pure_average, columns=['brewery_id', 'brewery_pure_average', 'rating_count', 'brewery_name', 'beer_count'])

Working with a DataFrame

Add two dotted lines that will pass through the median values of each axis. That way we can find out which breweries are above average, the best ones will be in the upper right area.

dict_list = []
dict_list.append(dict(type="line",
                     line=dict(
                         color="#666666",
                         dash="dot"),
                     x0=0,
                     y0=np.median(scatter_plot_df_with_age.brewery_pure_average),
                     x1=7000,
                     y1=np.median(scatter_plot_df_with_age.brewery_pure_average),
                     line_width=1,
                     layer="below"))
dict_list.append(dict(type="line",
                     line=dict(
                         color="#666666",
                         dash="dot"),
                     x0=np.median(scatter_plot_df_with_age.rating_count),
                     y0=0,
                     x1=np.median(scatter_plot_df_with_age.rating_count),
                     y1=5,
                     line_width=1,
                     layer="below"))

Add annotations to display median values by hovering:

annotations_list = []
annotations_list.append(
    dict(
        x=8000,
        y=np.median(scatter_plot_df_with_age.brewery_pure_average) - 0.1,
        xref="x",
        yref="y",
        text=f"Median value: {round(np.median(scatter_plot_df_with_age.brewery_pure_average), 2)}",
        showarrow=False,
        font={
            'family':'Roboto, light',
            'color':'#666666',
            'size':12
        }
    )
)
annotations_list.append(
    dict(
        x=np.median(scatter_plot_df_with_age.rating_count) + 180,
        y=0.8,
        xref="x",
        yref="y",
        text=f"Median value: {round(np.median(scatter_plot_df_with_age.rating_count), 2)}",
        showarrow=False,
        font={
            'family':'Roboto, light',
            'color':'#666666',
            'size':12
        },
        textangle=-90
    )
)

Let’s make our plot more informative by splitting breweries into 4 groups according to the beer range. The first group will include breweries with less than 10 brands, the second group for those holding 10-30 brands, the third one for 30-50 brands, and the last one for large breweries with >50 brands. We stored marker sizes in the bucket_beer_count list.

bucket_beer_count = []
for beer_count in scatter_plot_df_with_age.beer_count:
   if beer_count < 10:
       bucket_beer_count.append(7)
   elif 10 <= beer_count <= 30:
       bucket_beer_count.append(9)
   elif 31 <= beer_count <= 50:
       bucket_beer_count.append(11)
   else:
       bucket_beer_count.append(13)
scatter_plot_df_with_age['bucket_beer_count'] = bucket_beer_count

Next step is to perform age-based splitting

bucket_age = []
for age in scatter_plot_df_with_age.age_on_service:
   if age < 4:
       bucket_age.append(0)
   elif 4 <= age <= 6:
       bucket_age.append(1)
   elif 6 < age < 8:
       bucket_age.append(2)
   else:
       bucket_age.append(3)
scatter_plot_df_with_age['bucket_age'] = bucket_age

Let’s divide our DataFrame into 4 parts to build separate scatter plots with its own color and size.

scatter_plot_df_0 = scatter_plot_df[scatter_plot_df.bucket == 0]
scatter_plot_df_1 = scatter_plot_df[scatter_plot_df.bucket == 1]
scatter_plot_df_2 = scatter_plot_df[scatter_plot_df.bucket == 2]
scatter_plot_df_3 = scatter_plot_df[scatter_plot_df.bucket == 3]

Plotting

Now we are ready to build the plot, add our 4 brewery groups one by one, setting its key parameters: name, marker color, annotation transparency and text.

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=scatter_plot_df_0.rating_count,
    y=scatter_plot_df_0.brewery_pure_average,
    name='< 4',
    mode='markers',
    opacity=0.85,
    text=scatter_plot_df_0.name_count,
    marker_color='rgb(114, 183, 178)',
    marker_size=scatter_plot_df_0.bucket_beer_count,
    textfont={"family":"Roboto, light",
              "color":"black"
             }
))

fig.add_trace(go.Scatter(
    x=scatter_plot_df_1.rating_count,
    y=scatter_plot_df_1.brewery_pure_average,
    name='4 – 6',
    mode='markers',
    opacity=0.85,
    marker_color='rgb(76, 120, 168)',
    text=scatter_plot_df_1.name_count,
    marker_size=scatter_plot_df_1.bucket_beer_count,
    textfont={"family":"Roboto, light",
              "color":"black"
             }
))

fig.add_trace(go.Scatter(
    x=scatter_plot_df_2.rating_count,
    y=scatter_plot_df_2.brewery_pure_average,
    name='6 – 8',
    mode='markers',
    opacity=0.85,
    marker_color='rgb(245, 133, 23)',
    text=scatter_plot_df_2.name_count,
    marker_size=scatter_plot_df_2.bucket_beer_count,
    textfont={"family":"Roboto, light",
              "color":"black"
             }
))

fig.add_trace(go.Scatter(
    x=scatter_plot_df_3.rating_count,
    y=scatter_plot_df_3.brewery_pure_average,
    name='8+',
    mode='markers',
    opacity=0.85,
    marker_color='rgb(228, 87, 86)',
    text=scatter_plot_df_3.name_count,
    marker_size=scatter_plot_df_3.bucket_beer_count,
    textfont={"family":"Roboto, light",
              "color":"black"
             }
))

fig.update_layout(
    title=f"The ratio between the number of reviews and the average brewery rating for the past <br> {n_days} days, top {top_n} breweries",
    font={
            'family':'Roboto, light',
            'color':'black',
            'size':14
        },
    plot_bgcolor='rgba(0,0,0,0)',
    yaxis_title="Average rating",
    xaxis_title="Number of reviews",
    legend_title_text='Registration period<br> on Untappd in years:',
    height=750,
    shapes=dict_list,
    annotations=annotations_list
)

Voila, the scatter plot is done! Each point is a separate brewery. The color shows the brewery beer range and when hovering we will see a summary including the average rating for the past 30 days, number of reviews, brewery name, and beer range. The dotted lines are passing through the median values we calculated with NumPy, they’re showing us the best breweries in the upper right. In our next article, we are going to create a breweries dashboard with dynamic parameters.

No comments 737 2020 dash plotly python untappd

Sentiment analysis of Russians on Constitutional Amendments

⏱ Время чтения текста – 11 минут

In today’s article, we are going to use public data from vk.com to interpret and classify users’ attitudes about the 2020 amendments to the Constitution of Russia.

API Overview

First off, we need to receive data using the newsfeed.search method, this method allows us to get up to one thousand of the latest posts from the news feed by keyword.
The response data contains different fields, like post ids, user or community ids, text data, likes count, comments, apps, geolocation, and many more. We are only needed ids and text data.
Some expanded information about the author will also be useful for our analysis, this includes city, gender, age, and can be received with the users.get method.

Create Clickhouse Tables

The received data should be stored somewhere, we chose to use ClickHouse, an open-source column-oriented DBMS. Let’s create two tables to store users and their posts. The first table will be populated with ids and text data, the second one will hold user data, such as their ids, age, and city. The ReplacingMergeTree () engine will remove duplicates in our tables.

The article assumes that you’re familiar with how to install ClickHouse on AWS, create external dictionaries and materialized views

CREATE TABLE vk_posts(
   post_id UInt64,
   post_date DateTime,
   owner_id UInt64,
   from_id UInt64,
   text String
) ENGINE ReplacingMergeTree()
ORDER BY post_date

CREATE TABLE vk_users(
   user_id UInt64,
   user_sex Nullable(UInt8),
   user_city String,
   user_age Nullable(UInt16)
) ENGINE ReplacingMergeTree()
ORDER BY user_id

Collecting user posts with the VK API

Let’s get to writing our script, import the libraries, and create several variables with constant values:

If you don’t have an access token yet and want to create one, refer to this step by step guide: “Collecting Data on Ad Campaigns from VK.com”

from clickhouse_driver import Client
from datetime import datetime
import requests
import pandas as pd
import time

token = 'your_token'
version = 5.103
client = Client(host='ec1-23-456-789-1011.us-east-2.compute.amazonaws.com', user='default', password='', port='9000', database='default')      
data_list = []
start_from = 0
query_string = 'конституция' #constitution

Define the get_and_insert_info_by_user function that will receive a list of user ids and expanded information about them, and send it to the vk_users table. Since the user_ids parameter takes a list as a string object, we need to change the structure and omit the square brackets.
Most users prefer to conceal their gender, age, and city. In such cases, we need to use Nullable values. To obtain user age we need to subtract the birth year from the current year, if the birth year is missing we can check it using the regular expression.

get_and_insert_info_by_user() function

def get_and_insert_info_by_user(users):
    try:
        r = requests.get('https://api.vk.com/method/users.get', params={
            'access_token':token,
            'v':version,
            'user_ids':str(users)[1:-2],
            'fields':'sex, city, bdate'
        }).json()['response']
        for user in r:
            user_list = []
            user_list.append(user['id'])
            if client.execute(f"SELECT count(1) FROM vk_users where user_id={user['id']}")[0][0] == 0:
                print(user['id'])
                try:
                    user_list.append(user['sex'])
                except Exception:
                    user_list.append('cast(Null as Nullable(UInt8))')
                try:
                    user_list.append(user['city']['title'])
                except Exception:
                    user_list.append('')
                try:
                    now = datetime.now()
    			    year = item.split('.')[-1]
    			    if re.match(r'\d\d\d\d', year):
        		        age = now.year - int(year)
			    	   user_list.append(age)
                except Exception:
                    user_list.append('cast(Null as Nullable(UInt16))')
                user_insert_tuple = tuple(user_list)
                client.execute(f'INSERT INTO vk_users VALUES {user_insert_tuple}')
    except KeyError:
        pass

Our script will work in a while loop to constantly update data, as we can only receive a thousand of the latest data points.The newsfeed.search method returns 200 posts per call, so we need to invoke it five times to collect all the posts.

While loop to collect new posts

while True:
    for i in range(5):
        r = requests.get('https://api.vk.com/method/newsfeed.search', params={
            'access_token':token,
            'v':version,
            'q':query_string,
            'count':200,
            'start_from': start_from
        })
        data_list.append(r.json()['response'])
        try:
            start_from = r.json()['response']['next_from']
        except KeyError:
            pass

The data we received can be parsed, VK users always have a positive id, while for communities it’s negative. We need only users data for our analysis, where from_id > 0. The next step is to check whether a post contains any text data or not. Finally, we will collect and store unique entries by user id. Pause the script after each iteration for 180 seconds to wait for new user posts and not violate the VK API rules.

Adding new data to Clickhouse

user_ids = []
    for data in data_list:
        for data_item in data['items']:
            if data_item['from_id'] > 0:
                post_list = []
                if not data_item['text']:
                    continue
                if client.execute(f"SELECT count(1) FROM vk_posts WHERE post_id={data_item['id']} AND from_id={data_item['from_id']}")[0][0] == 0:
                    user_ids.append(data_item['from_id'])
                    date = datetime.fromtimestamp(data_item['date'])
                    date = datetime.strftime(date, '%Y-%m-%d %H:%M:%S')
                    post_list.append(date)
                    post_list.append(data_item['id'])
                    post_list.append(data_item['owner_id'])
                    post_list.append(data_item['from_id'])
post_list.append(data_item['text'].replace("'","").replace('"','').replace("\n",""))
                    post_list.append(query_string)
                    post_tuple = tuple(post_list)
                    print(post_list)
                    try:
                        client.execute(f'INSERT INTO vk_posts VALUES {post_tuple}')
                    except Exception as E:
                        print('!!!!! try to insert into vk_post but got', E)
    try:
        get_and_insert_info_by_user(user_ids)
    except Exception as E:
        print("Try to insert user list:", user_ids, "but got:", E)
    time.sleep(180)

Dostoevsky for sentiment analysis

For one week our script collected almost 20000 posts from VK users that mention the keyword “constitution” (or “конституция” in Russian). It’s time to write our second script for data analysis and visualization. First, create a DataFrame with the data received, and evaluate the sentiment of each post, identifying whether it’s positive, negative, or neutral. We are going to use the Dostoevsky library to analyze the emotion behind a text.

from dostoevsky.tokenization import RegexTokenizer
from dostoevsky.models import FastTextSocialNetworkModel
from clickhouse_driver import Client
import pandas as pd
client = Client(host='ec1-23-456-789-1011.us-east-2.compute.amazonaws.com', user='default', password='', port='9000', database='default')

Assign all the contents of our table to the vk_posts variable with a simple query. Iterate through all the posts, select those with text data and populate our DataFrame.

vk_posts = client.execute('SELECT * FROM vk_posts')
list_of_posts = []
list_of_ids = []
for post in vk_posts:
    if str(post[-2]).replace(" ", ""):
        list_of_posts.append(str(post[-2]).replace("\n",""))
        list_of_ids.append(int(post[2]))
df_posts = pd.DataFrame()
df_posts['post'] = list_of_posts
df_posts['id'] = list_of_ids

Instantiate our model and iterate through the posts to evaluate the sentiment of each entry.

tokenizer = RegexTokenizer()
model = FastTextSocialNetworkModel(tokenizer=tokenizer)
sentiment_list = []
results = model.predict(list_of_posts, k=2)
for sentiment in results:
    sentiment_list.append(sentiment)

Add several boolean columns to our DataFrame that will reflect whether it’s a positive, negative, or neutral post.

neutral_list = []
negative_list = []
positive_list = []
speech_list = []
skip_list = []
for sentiment in sentiment_list:
    neutral = sentiment.get('neutral')
    negative = sentiment.get('negative')
    positive = sentiment.get('positive')
    if neutral is None:
        neutral_list.append(0)
    else:
        neutral_list.append(sentiment.get('neutral'))
    if negative is None:
        negative_list.append(0)
    else:
        negative_list.append(sentiment.get('negative'))
    if positive is None:
        positive_list.append(0)
    else:
        positive_list.append(sentiment.get('positive'))
df_posts['neutral'] = neutral_list
df_posts['negative'] = negative_list
df_posts['positive'] = positive_list

That’s how the DataFrame looks now:

Let’s examine the most negative posts:

df_posts[df_posts.negative > 0.9]

Now, let’s add data about the authors of these posts by merging two tables together on the id column.

vk_users = client.execute('SELECT * FROM vk_users')
vk_user_ids_list = []
vk_user_sex_list = []
vk_user_city_list = []
vk_user_age_list = []
for user in vk_users:
    vk_user_ids_list.append(user[0])
    vk_user_sex_list.append(user[1])
    vk_user_city_list.append(user[2])
    vk_user_age_list.append(user[3])
df_users = pd.DataFrame()
df_users['id'] = vk_user_ids_list
df_users['sex'] = vk_user_sex_list
df_users['city'] = vk_user_city_list
df_users['age'] = vk_user_age_list
df = df_posts.merge(df_users, on='id')

And the table now looks the following:

Analysing data with Plotly

Check out our previous article on data visualization with Plotly: “Building an interactive waterfall chart in Python”

Let’s find the percentage of posts for each group: positive, negative, neutral. Iterate through these three columns and calculate the values more than zero for each data point. Then do the same for different age categories and gender.

According to our chart, 45% of recent user posts relevant to the keyword “constitution” have a negative meaning, while the other 52% are neutral. Later it’ll be known how different the Internet opinions from the voting results.

It’s noticeable that among the men audience the proportion of positive posts is less than 2%, while for women it’s 3.5%. However, the number of negative posts for each group is almost the same, 47% and 43% respectively.

According to our analysis, posts made by younger audiences between 18-25 years have more positive sentiment, which is 6%. While users under 18 years leave mostly negative posts, this may be because most users under the age of 18 prefer to hide their real age, this makes it difficult to obtain accurate data for such a group.
The proportion of negative posts is almost equal for all groups and accounts for 44%.
As you can see, the data is distributed equally in all three charts. This means that half of all posts relevant to the keyword “constitution” and made by VK users over the past week mostly have a negative sentiment.

No comments 734 2020 Analytics engineering data analytics plotly

Building an interactive waterfall chart in Python

⏱ Время чтения текста – 4 минуты

Back in 2014, we built a waterfall chart in Excel, widely known in the consulting world, for one of our presentations about the e-commerce market in Ulmart. It’s been a while and today we are going to draw one in Python and the Plotly library. This type of charts is oftentimes used to illustrate changes with the appearance of a new positive or negative factor. In the latter article about data visualization, we explained how to build a beautiful Bar Chart with bars that resemble thermometers, it’s especially useful when we want to compare planned targets with actual values.

We are using the Ulmart data on the e-commerce market growth from 2013 to 2014. Data on the X-axis is chart captions, on the Y-axis we displayed the initial and final values, as well as their change. With the sum() function calculate the total and add it to the end of our list. The <br> tag in the x_list shows a line break in text.

import plotly.graph_objects as go

x_list = ['2013','The Russian <br>Macroeconomy', 'Decline in working age<br>population','Internet usage growth','Development of<br>cross-border trade', 'National companies', '2014']
y_list = [738.5, 48.7, -7.4, 68.7, 99.7, 48.0]
total = round(sum(y_list))
y_list.append(total)

Let’s create a list with column values, we called it text_list. The values will be taken from the y_list, but first we need to transform them. Convert all numerical values into strings and if it’s not the first or the last column, add a plus sign for clarity. In case it’s a positive change, the color will be green, otherwise red. Highlight the first and the last values with the <b> tag;

text_list = []
for index, item in enumerate(y_list):
    if item > 0 and index != 0 and index != len(y_list) - 1:
        text_list.append(f'+{str(y_list[index])}')
    else:
        text_list.append(str(y_list[index]))
for index, item in enumerate(text_list):
    if item[0] == '+' and index != 0 and index != len(text_list) - 1:
        text_list[index] = '<span style="color:#2ca02c">' + text_list[index] + '</span>'
    elif item[0] == '-' and index != 0 and index != len(text_list) - 1:
        text_list[index] = '<span style="color:#d62728">' + text_list[index] + '</span>'
    if index == 0 or index == len(text_list) - 1:
        text_list[index] = '<b>' + text_list[index] + '</b>'

Let’s set parameters for the dashed lines we want to add. Create a list of dictionaries and fill it with light-gray dashed lines, passing the following:

dict_list = []
for i in range(0, 1200, 200):
    dict_list.append(dict(
            type="line",
            line=dict(
                 color="#666666",
                 dash="dot"
            ),
            x0=-0.5,
            y0=i,
            x1=6,
            y1=i,
            line_width=1,
            layer="below"))

Now, create a graph object with the Waterfall() method. Each column in our table can be of a certain type: total, absolute (both with final values) or relative (holds intermediate values). Then we need to set colors, make the connecting line transparent, positive changes will be green, while negative ones are red, and the final columns are purple. Here we are using the Open Sans font.

Learn more about how to choose the right fonts for your data visualization from this article: “Choosing Fonts for Your Data Visualization”

fig = go.Figure(go.Waterfall(
    name = "e-commerce", orientation = "v",
    measure = ["absolute", "relative", "relative", "relative", "relative", "relative", "total"],
    x = x_list,
    y = y_list,
    text = text_list,
    textposition = "outside",
    connector = {"line":{"color":'rgba(0,0,0,0)'}},
    increasing = {"marker":{"color":"#2ca02c"}},
    decreasing = {"marker":{"color":"#d62728"}},
    totals={'marker':{"color":"#9467bd"}},
    textfont={"family":"Open Sans, light",
              "color":"black"
             }
))

Finally, add the title with the description, hide the legend, set the Y label and add dashed lines to our chart.

fig.update_layout(
    title = 
        {'text':'<b>Waterfall chart</b><br><span style="color:#666666">E-commerce market growth from 2013 to 2014</span>'},
    showlegend = False,
    height=650,
    font={
        'family':'Open Sans, light',
        'color':'black',
        'size':14
    },
    plot_bgcolor='rgba(0,0,0,0)',
    yaxis_title="млрд руб.",
    shapes=dict_list
)
fig.update_xaxes(tickangle=-45, tickfont=dict(family='Open Sans, light', color='black', size=14))
fig.update_yaxes(tickangle=0, tickfont=dict(family='Open Sans, light', color='black', size=14))

fig.show()

And here it is:

No comments 969 2020 data analytics plotly python visualisation

Your password

LEFT JOIN: blog on analytics, visualisation & data science