7 posts tagged

plotly

How to build Animated Charts like Hans Rosling in Plotly

Estimated read time – 12 min

Hans Rosling’s work on world countries economic growth presented in 2007 at TEDTalks can be attributed to one of the most iconic data visualizations, ever created. Just check out this video, in case you don’t know what we’re talking about:

Sometimes we want to compare standards of living in other countries. One way to do this is to refer to the Big Mac index, which the Economist magazine has kept track of since 1986. The key idea this index represents is to measure purchasing power parity (PPP) in different countries, considering costs of domestic production. To make a standard burger, one would need the following ingredients: cheese, meat, bread and vegetables. Considering that all these ingredients can be produced locally, we can compare the production cost of one Big Mac in different countries, and measure purchasing power. Plus, McDonald’s is the world’s most popular franchise network, its restaurants are almost everywhere around the globe.

In today’s material, we will build a Motion Chart for the Big Mac index using Plotly. Following Hann Rosling’s idea, the chart will display country population along the X-axis and GDP per capita in US dollars along the Y. The size of the dots is going to be proportional to the Big Mac Index for a given country. And the color of the dots will represent the continent where the country is located.

Preparing Data

Even though The Economist has been updating it for over 30 years and sharing its observations publicly, the dataset contains many missing values. It also lacks continents names, but we can handle it by supplementing the data with some more datasets that can be found in our repo.

Let’s start by importing the libraries:

import pandas as pd
from pandas.errors import ParserError
import plotly.graph_objects as go
import numpy as np
import requests
import io

We can access the dataset directly from GitHub. Just use the following function to send a GET request to a CSV file and create a Pandas DataFrame. However, in some cases, this may raise a  ParseError because of the caption title, so we will add a try block:

def read_raw_file(link):
    raw_csv = requests.get(link).content
    try:
        df = pd.read_csv(io.StringIO(raw_csv.decode('utf-8')))
    except ParserError:
        df = pd.read_csv(io.StringIO(raw_csv.decode('utf-8')), skiprows=3)
    return df

bigmac_df = read_raw_file('https://github.com/valiotti/leftjoin/raw/master/motion-chart-big-mac/big-mac.csv')
population_df = read_raw_file('https://github.com/valiotti/leftjoin/raw/master/motion-chart-big-mac/population.csv')
dgp_df = read_raw_file('https://github.com/valiotti/leftjoin/raw/master/motion-chart-big-mac/gdp.csv')
continents_df = read_raw_file('https://github.com/valiotti/leftjoin/raw/master/motion-chart-big-mac/continents.csv')

From The Economist dataset we will need these columns: country name, local price, dollar exchange rate, country code (iso_a3) and record date. Take the timeline from 2005 to 2020, as the records are most complete for this span. And divide the local price by the exchange rate to calculate the price of one Big Mac in US dollars.

bigmac_df = bigmac_df[['name', 'local_price', 'dollar_ex', 'iso_a3', 'date']]
bigmac_df = bigmac_df[bigmac_df['date'] >= '2005-01-01']
bigmac_df = bigmac_df[bigmac_df['date'] < '2020-01-01']
bigmac_df['date'] = pd.DatetimeIndex(bigmac_df['date']).year
bigmac_df = bigmac_df.drop_duplicates(['date', 'name'])
bigmac_df = bigmac_df.reset_index(drop=True)
bigmac_df['dollar_price'] = bigmac_df['local_price'] / bigmac_df['dollar_ex']

Take a look at the result:

Next, let’s try adding a new column called continents. To ease the task, leave only two columns containing country code and continent name. Then we need to iterate through the bigmac_df[‘iso_a3’] column, adding a continent name for the corresponding values. However some cases may raise an error, because it’s not really clear, whether a country belongs to Europe or Asia, we will consider such cases as Europe by default.

continents_df = continents_df[['Continent_Name', 'Three_Letter_Country_Code']]
continents_list = []
for country in bigmac_df['iso_a3']:
    try:
        continents_list.append(continents_df.loc[continents_df['Three_Letter_Country_Code'] == country]['Continent_Name'].item())
    except ValueError:
        continents_list.append('Europe')
bigmac_df['continent'] = continents_list

Now we can drop unnecessary columns, apply sorting by country names and date, convert values in the date column into integers, and view the current result:

bigmac_df = bigmac_df.drop(['local_price', 'iso_a3', 'dollar_ex'], axis=1)
bigmac_df = bigmac_df.sort_values(by=['name', 'date'])
bigmac_df['date'] = bigmac_df['date'].astype(int)

Then we need to fill up missing values for The Big Mac index with zeros and remove the Republic of China, since this partially recognized state is not included in the World Bank datasets. The UAE occurs several times, this can lead to issues.

countries_list = list(bigmac_df['name'].unique())
years_set = {i for i in range(2005, 2020)}
for country in countries_list:
    if len(bigmac_df[bigmac_df['name'] == country]) < 15:
        this_continent = bigmac_df[bigmac_df['name'] == country].continent.iloc[0]
        years_of_country = set(bigmac_df[bigmac_df['name'] == country]['date'])
        diff = years_set - years_of_country
        dict_to_df = pd.DataFrame({
                      'name':[country] * len(diff),
                      'date':list(diff),
                      'dollar_price':[0] * len(diff),
                      'continent': [this_continent] * len(diff)
                     })
        bigmac_df = bigmac_df.append(dict_to_df)
bigmac_df = bigmac_df[bigmac_df['name'] != 'Taiwan']
bigmac_df = bigmac_df[bigmac_df['name'] != 'United Arab Emirates']

Next, let’s augment the data with GDP per capita and population from other datasets. Both datasets have differences in country names, so we need to specify such cases explicitly and replace them.

years = [str(i) for i in range(2005, 2020)]

countries_replace_dict = {
    'Russian Federation': 'Russia',
    'Egypt, Arab Rep.': 'Egypt',
    'Hong Kong SAR, China': 'Hong Kong',
    'United Kingdom': 'Britain',
    'Korea, Rep.': 'South Korea',
    'United Arab Emirates': 'UAE',
    'Venezuela, RB': 'Venezuela'
}
for key, value in countries_replace_dict.items():
    population_df['Country Name'] = population_df['Country Name'].replace(key, value)
    gdp_df['Country Name'] = gdp_df['Country Name'].replace(key, value)

Finally, extract population data and GDP for the given years, adding the data to the bigmac_df DataFrame:

countries_list = list(bigmac_df['name'].unique())

population_list = []
gdp_list = []
for country in countries_list:
    population_for_country_df = population_df[population_df['Country Name'] == country][years]
    population_list.extend(list(population_for_country_df.values[0]))
    gdp_for_country_df = gdp_df[gdp_df['Country Name'] == country][years]
    gdp_list.extend(list(gdp_for_country_df.values[0]))
    
bigmac_df['population'] = population_list
bigmac_df['gdp'] = gdp_list
bigmac_df['gdp_per_capita'] = bigmac_df['gdp'] / bigmac_df['population']

And here is our final dataset:

Creating a chart in Plotly

The population in China or India, on average, is 10 times more than in other countries. That’s why we need to transform X-axis to Log Scale, to make the chart easier for interpreting. The log-transformation is a common way to address skewness in data.

fig_dict = {
    "data": [],
    "layout": {},
    "frames": []
}

fig_dict["layout"]["xaxis"] = {"title": "Population", "type": "log"}
fig_dict["layout"]["yaxis"] = {"title": "GDP per capita (in $)", "range":[-10000, 120000]}
fig_dict["layout"]["hovermode"] = "closest"
fig_dict["layout"]["updatemenus"] = [
    {
        "buttons": [
            {
                "args": [None, {"frame": {"duration": 500, "redraw": False},
                                "fromcurrent": True, "transition": {"duration": 300,
                                                                    "easing": "quadratic-in-out"}}],
                "label": "Play",
                "method": "animate"
            },
            {
                "args": [[None], {"frame": {"duration": 0, "redraw": False},
                                  "mode": "immediate",
                                  "transition": {"duration": 0}}],
                "label": "Pause",
                "method": "animate"
            }
        ],
        "direction": "left",
        "pad": {"r": 10, "t": 87},
        "showactive": False,
        "type": "buttons",
        "x": 0.1,
        "xanchor": "right",
        "y": 0,
        "yanchor": "top"
    }
]

We will also add a slider to filter data within a certain range:

sliders_dict = {
    "active": 0,
    "yanchor": "top",
    "xanchor": "left",
    "currentvalue": {
        "font": {"size": 20},
        "prefix": "Year: ",
        "visible": True,
        "xanchor": "right"
    },
    "transition": {"duration": 300, "easing": "cubic-in-out"},
    "pad": {"b": 10, "t": 50},
    "len": 0.9,
    "x": 0.1,
    "y": 0,
    "steps": []
}

By default, the chart will display data for 2005 before we click on the “Play” button.

continents_list_from_df = list(bigmac_df['continent'].unique())
year = 2005
for continent in continents_list_from_df:
    dataset_by_year = bigmac_df[bigmac_df["date"] == year]
    dataset_by_year_and_cont = dataset_by_year[dataset_by_year["continent"] == continent]
    
    data_dict = {
        "x": dataset_by_year_and_cont["population"],
        "y": dataset_by_year_and_cont["gdp_per_capita"],
        "mode": "markers",
        "text": dataset_by_year_and_cont["name"],
        "marker": {
            "sizemode": "area",
            "sizeref": 200000,
            "size":  np.array(dataset_by_year_and_cont["dollar_price"]) * 20000000
        },
        "name": continent,
        "customdata": np.array(dataset_by_year_and_cont["dollar_price"]).round(1),
        "hovertemplate": '<b>%{text}</b>' + '<br>' +
                         'GDP per capita: %{y}' + '<br>' +
                         'Population: %{x}' + '<br>' +
                         'Big Mac price: %{customdata}$' +
                         '<extra></extra>'
    }
    fig_dict["data"].append(data_dict)

Next, we need to fill up the frames field, which will be used for animating the data. Each frame represents a certain data point from 2005 to 2019.

for year in years:
    frame = {"data": [], "name": str(year)}
    for continent in continents_list_from_df:
        dataset_by_year = bigmac_df[bigmac_df["date"] == int(year)]
        dataset_by_year_and_cont = dataset_by_year[dataset_by_year["continent"] == continent]

        data_dict = {
            "x": list(dataset_by_year_and_cont["population"]),
            "y": list(dataset_by_year_and_cont["gdp_per_capita"]),
            "mode": "markers",
            "text": list(dataset_by_year_and_cont["name"]),
            "marker": {
                "sizemode": "area",
                "sizeref": 200000,
                "size": np.array(dataset_by_year_and_cont["dollar_price"]) * 20000000
            },
            "name": continent,
            "customdata": np.array(dataset_by_year_and_cont["dollar_price"]).round(1),
            "hovertemplate": '<b>%{text}</b>' + '<br>' +
                             'GDP per capita: %{y}' + '<br>' +
                             'Population: %{x}' + '<br>' +
                             'Big Mac price: %{customdata}$' +
                             '<extra></extra>'
        }
        frame["data"].append(data_dict)

    fig_dict["frames"].append(frame)
    slider_step = {"args": [
        [year],
        {"frame": {"duration": 300, "redraw": False},
         "mode": "immediate",
         "transition": {"duration": 300}}
    ],
        "label": year,
        "method": "animate"}
    sliders_dict["steps"].append(slider_step)

Just a few finishing touches left, instantiate the chart, set colors, fonts and title.

fig_dict["layout"]["sliders"] = [sliders_dict]

fig = go.Figure(fig_dict)

fig.update_layout(
    title = 
        {'text':'<b>Motion chart</b><br><span style="color:#666666">The Big Mac index from 2005 to 2019</span>'},
    font={
        'family':'Open Sans, light',
        'color':'black',
        'size':14
    },
    plot_bgcolor='rgba(0,0,0,0)'
)
fig.update_yaxes(nticks=4)
fig.update_xaxes(tickfont=dict(family='Open Sans, light', color='black', size=12), nticks=4, gridcolor='lightgray', gridwidth=0.5)
fig.update_yaxes(tickfont=dict(family='Open Sans, light', color='black', size=12), nticks=4, gridcolor='lightgray', gridwidth=0.5)

fig.show()

Bingo! The Motion Chart is done:

View the code on GitHub

 No comments    266   9 mon   Analytics engineering   data analytics   plotly

How to build a dashboard with Bootstrap 4 from scratch (Part 2)

Estimated read time – 12 min

Previously we shared how to use Bootstrap components in building dashboard layout and designed a simple yet flexible dashboard with a scatter plot and Russian map. In today’s material, we will continue adding more information, explore how to make Bootstrap tables responsive, and cover some complex callbacks for data acquisition.

Constructing Data Tables

All the code for populating our tables with data will be stored in get_tables.py , while the layout components areoutlined in  application.py. This article will cover the process of creating the table with top Russian Breweries, however, you can find the code for creating the other three on Github.

Data in the Top Breweries table can be filtered by city name in the dropdown menu, but the data collected in Untappd is not equally structured. Some city names are written in Latin, others in Cyrillic. So the challenge is to make the names equal for SQL queries, and here is where Google Translate comes to the rescue. Though we sill have to manually create a dictionary of city names, since for example “Москва” can be written as “Moskva” and not “Moscow”. This dictionary will be used later for mapping our DataFrame before transforming it into a Bootstrap table.

import pandas as pd
import dash_bootstrap_components as dbc
from clickhouse_driver import Client
import numpy as np
from googletrans import Translator

translator = Translator()

client = Client(host='12.34.56.78', user='default', password='', port='9000', database='')

city_names = {
   'Moskva': 'Москва',
   'Moscow': 'Москва',
   'СПБ': 'Санкт-Петербург',
   'Saint Petersburg': 'Санкт-Петербург',
   'St Petersburg': 'Санкт-Петербург',
   'Nizhnij Novgorod': 'Нижний Новгород',
   'Tula': 'Тула',
   'Nizhniy Novgorod': 'Нижний Новгород',
}

Top Breweries Table

This table displays top 10 Russian breweries and their position change according to the rating. Simply put, we need to compare data for two periods, that’s [30 days ago; today] and [60 days ago; 30 days ago]. With this in mind, we will need the following headers: ranking, brewery name, position change, and number of check-ins.
Create the  get_top_russian_breweries function that would make queries to the Clickhouse DB, sort the data and return a refined Pandas DataFrame. Let’s send the following queries to obtain data for the past 30 and 60 days, ordering the results by the number of check-ins.


Querying data from the Database

def get_top_russian_breweries(checkins_n=250):
   top_n_brewery_today = client.execute(f'''
      SELECT  rt.brewery_id,
              rt.brewery_name,
              beer_pure_average_mult_count/count_for_that_brewery as avg_rating,
              count_for_that_brewery as checkins FROM (
      SELECT           
              brewery_id,
              dictGet('breweries', 'brewery_name', toUInt64(brewery_id)) as brewery_name,
              sum(rating_score) AS beer_pure_average_mult_count,
              count(rating_score) AS count_for_that_brewery
          FROM beer_reviews t1
          ANY LEFT JOIN venues AS t2 ON t1.venue_id = t2.venue_id
          WHERE isNotNull(venue_id) AND (created_at >= (today() - 30)) AND (venue_country = 'Россия') 
          GROUP BY           
              brewery_id,
              brewery_name) rt
      WHERE (checkins>={checkins_n})
      ORDER BY avg_rating DESC
      LIMIT 10
      '''
   )

top_n_brewery_n_days = client.execute(f'''
  SELECT  rt.brewery_id,
          rt.brewery_name,
          beer_pure_average_mult_count/count_for_that_brewery as avg_rating,
          count_for_that_brewery as checkins FROM (
  SELECT           
          brewery_id,
          dictGet('breweries', 'brewery_name', toUInt64(brewery_id)) as brewery_name,
          sum(rating_score) AS beer_pure_average_mult_count,
          count(rating_score) AS count_for_that_brewery
      FROM beer_reviews t1
      ANY LEFT JOIN venues AS t2 ON t1.venue_id = t2.venue_id
      WHERE isNotNull(venue_id) AND (created_at >= (today() - 60) AND created_at <= (today() - 30)) AND (venue_country = 'Россия')
      GROUP BY           
          brewery_id,
          brewery_name) rt
  WHERE (checkins>={checkins_n})
  ORDER BY avg_rating DESC
  LIMIT 10
  '''
)

Creating two DataFrames with the received data:

top_n = len(top_n_brewery_today)
column_names = ['brewery_id', 'brewery_name', 'avg_rating', 'checkins']

top_n_brewery_today_df = pd.DataFrame(top_n_brewery_today, columns=column_names).replace(np.nan, 0)
top_n_brewery_today_df['brewery_pure_average'] = round(top_n_brewery_today_df.avg_rating, 2)
top_n_brewery_today_df['brewery_rank'] = list(range(1, top_n + 1))

top_n_brewery_n_days = pd.DataFrame(top_n_brewery_n_days, columns=column_names).replace(np.nan, 0)
top_n_brewery_n_days['brewery_pure_average'] = round(top_n_brewery_n_days.avg_rating, 2)
top_n_brewery_n_days['brewery_rank'] = list(range(1, len(top_n_brewery_n_days) + 1))

And then calculate the position change over the period of time for each brewery received. With the try-except block, we will handle exceptions, in case, if a brewery was not yet in our database 60 days ago.

rank_was_list = []
for brewery_id in top_n_brewery_today_df.brewery_id:
   try:
       rank_was_list.append(
           top_n_brewery_n_days[top_n_brewery_n_days.brewery_id == brewery_id].brewery_rank.item())
   except ValueError:
       rank_was_list.append('–')
top_n_brewery_today_df['rank_was'] = rank_was_list

Now we iterate over the columns with current and former positions. If there is no hyphen contained in, we will append an up or down arrow depending on the change.

diff_rank_list = []
for rank_was, rank_now in zip(top_n_brewery_today_df['rank_was'], top_n_brewery_today_df['brewery_rank']):
   if rank_was != '–':
       difference = rank_was - rank_now
       if difference > 0:
           diff_rank_list.append(f'↑ +{difference}')
       elif difference < 0:
           diff_rank_list.append(f'↓ {difference}')
       else:
           diff_rank_list.append('–')
   else:
       diff_rank_list.append(rank_was)

Finally, replace DataFrame headers, inserting the column with current ranking positions, where the top 3 will be displayed with the trophy emoji.

df = top_n_brewery_today_df[['brewery_name', 'avg_rating', 'checkins']].round(2)
df.insert(2, 'Position change', diff_rank_list)
df.columns = ['NAME', 'RATING', 'POSITION CHANGE', 'CHECK-INS']
df.insert(0, 'RANKING', list('🏆 ' + str(i) if i in [1, 2, 3] else str(i) for i in range(1, len(df) + 1)))

return df

Filtering data by city name

One of the main tasks we set before creating this dashboard was to find out what are the most liked breweries in a certain city. The user chooses a city in the dropdown menu and gets the results. Sound pretty simple, but is it that easy?
Our next step is to write a script that would update data for each city and store it in separate CSV files. As we mentioned earlier, the city names are not equally structured, so we need to use Google Translator within the if-else block, and since it may not convert some names to Cyrillic we need to explicitly specify such cases:

en_city = venue_city
if en_city == 'Nizhnij Novgorod':
      ru_city = 'Нижний Новгород'
elif en_city == 'Perm':
      ru_city = 'Пермь'
elif en_city == 'Sergiev Posad':
      ru_city = 'Сергиев Посад'
elif en_city == 'Vladimir':
      ru_city = 'Владимир'
elif en_city == 'Yaroslavl':
      ru_city = 'Ярославль'
else:
      ru_city = translator.translate(en_city, dest='ru').text

Then we need to add both city names in English and Russian to the SQL query, to receive all check-ins sent from this city.

WHERE (rt.venue_city='{ru_city}' OR rt.venue_city='{en_city}')

Finally, we export received data into a CSV file in the following directory – data/cities.

df = top_n_brewery_today_df[['brewery_name', 'venue_city', 'avg_rating', 'checkins']].round(2)
df.insert(3, 'Position Change', diff_rank_list)
df.columns = ['NAME', 'CITY', 'RATING', 'POSITION CHANGE', 'CHECK-INS']
# MAPPING
df['CITY'] = df['CITY'].map(lambda x: city_names[x] if (x in city_names) else x)
# TRANSLATING
df['CITY'] = df['CITY'].map(lambda x: translator.translate(x, dest='en').text)
df.to_csv(f'data/cities/{en_city}.csv', index=False)
print(f'{en_city}.csv updated!')

Scheduling Updates

We will use the apscheduler library to automatically run the script and refresh data for each city in all_cities every day at 10:30 am (UTC).

from apscheduler.schedulers.background import BackgroundScheduler
from get_tables import update_best_breweries

all_cities = sorted(['Vladimir', 'Voronezh', 'Ekaterinburg', 'Kazan', 'Red Pakhra', 'Krasnodar',
             'Kursk', 'Moscow', 'Nizhnij Novgorod', 'Perm', 'Rostov-on-Don', 'Saint Petersburg',
             'Sergiev Posad', 'Tula', 'Yaroslavl'])

scheduler = BackgroundScheduler()
@scheduler.scheduled_job('cron', hour=10, misfire_grace_time=30)
def update_data():
   for city in all_cities:
       update_best_breweries(city)
scheduler.start()

Table from DataFrame

get_top_russian_breweries_table(venue_city, checkins_n=250)  will accept venue_city and checkins_n generating a Bootstrap Table with the top breweries. The second parameter value, checkins_n can be changed with the slider. If the city name is not specified, the function will return top Russian breweries table.

if venue_city == None: 
      selected_df = get_top_russian_breweries(checkins_n)
else: 
      en_city = venue_city

In other case the DataFrame will be constructed from a CSV file stored in data/cities/. Since the city column still may contain different names we should apply mapping and use a lambda expression with the map() method. The lambda function will compare values in the column against keys in city_names and if there is a match, the column value will be overwritten.
For instance, if df[‘CITY’] contains “СПБ”, a frequent acronym for Saint Petersburg, the value will be replaced, while for “Воронеж” it will remain unchanged.
And last but not least, we need to remove all duplicate rows from the table, add a column with a ranking position and return the first 10 rows. These would be the most liked breweries in a selected city.

df = pd.read_csv(f'data/cities/{en_city}.csv')     
df = df.loc[df['CHECK-INS'] >= checkins_n]
df.drop_duplicates(subset=['NAME', 'CITY'], keep='first', inplace=True)  
df.insert(0, 'RANKING', list('🏆 ' + str(i) if i in [1, 2, 3] else str(i) for i in range(1, len(df) + 1)))
selected_df = df.head(10)

After all DataFrame manipulations, the function returns a simply styled Bootstrap table of top breweries.


Bootstrap table layout in DBC

table = dbc.Table.from_dataframe(selected_df, striped=False,
                                bordered=False, hover=True,
                                size='sm',
                                style={'background-color': '#ffffff',
                                       'font-family': 'Proxima Nova Regular',
                                       'text-align':'center',
                                       'fontSize': '12px'},
                                className='table borderless'
                                )

return table

Layout structure

Add a Slider and a Dropdown menu with city names in application.py

To learn more about the Dashboard layout structure, please refer to our previous guide

checkins_slider_tab_1 = dbc.CardBody(
                           dbc.FormGroup(
                               [
                                   html.H6('Number of check-ins', style={'text-align': 'center'})),
                                   dcc.Slider(
                                       id='checkin_n_tab_1',
                                       min=0,
                                       max=250,
                                       step=25,
                                       value=250,  
                                       loading_state={'is_loading': True},
                                       marks={i: i for i in list(range(0, 251, 25))}
                                   ),
                               ],
                           ),
                           style={'max-height': '80px', 
                                  'padding-top': '25px'
                                  }
                       )

top_breweries = dbc.Card(
       [
           dbc.CardBody(
               [
                   dbc.FormGroup(
                       [
                           html.H6('Filter by city', style={'text-align': 'center'}),
                           dcc.Dropdown(
                               id='city_menu',
                               options=[{'label': i, 'value': i} for i in all_cities],
                               multi=False,
                               placeholder='Select city',
                               style={'font-family': 'Proxima Nova Regular'}
                           ),
                       ],
                   ),
                   html.P(id="tab-1-content", className="card-text"),
               ],
           ),
   ],
)

We’ll also need to add a callback function to update the table by dropdown menu and slider values:

@app.callback(
   Output("tab-1-content", "children"), [Input("city_menu", "value"),
                                         Input("checkin_n_tab_1", "value")]
)
def table_content(city, checkin_n):
   return get_top_russian_breweries_table(city, checkin_n)

Tada, the main table is ready! The dashboard can be used to receive up-to-date info about best Russian breweries, beers, and its rating across different regions, and help to make a better choice for an enjoyable tasting experience.

View the code on GitHub

 No comments    243   9 mon   BI-tools   bootstrap   dash   plotly   python

VIsualizing COVID-19 in Russia with Plotly

Estimated read time – 9 min

Maps are widely used in data visualization, it’s a great tool to display statistics for certain areas, regions, and cities. Before displaying the map we need to encode each region or any other administrative unit. Choropleth map gets divided into polygons and multipolygons with latitude and longitude coordinates. Plotly has a built-in solution for plotting choropleth map for America and Europe regions, however, Russia is not included yet. So we decided to use an existing GeoJSON file to map administrative regions of Russia and display the latest COVID-19 stats with Plotly.

from urllib.request import urlopen
import json
import requests
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import plotly.graph_objects as go

Modifying GeoJSON

First, we need to download a public GeoJSON file with the boundaries for the Federal subjects of Russia. The file already contains some information, such as region names, but it’s still doesn’t fit the required format and missing region identifiers.

with urlopen('https://raw.githubusercontent.com/codeforamerica/click_that_hood/master/public/data/russia.geojson') as response:
    counties = json.load(response)

Besides that, there are slight differences in the namings. For example, Bashkortostan on стопкоронавирус.рф, the site we are going to scrape data from, it’s listed as “The Republic of Bashkortostan”, while in our GeoJSON file it’s simply named “Bashkortostan”. These differences should be eliminated to avoid possible confusion. Also, the names should start with a capital.

regions_republic_1 = ['Бурятия', 'Тыва', 'Адыгея', 'Татарстан', 'Марий Эл',
                      'Чувашия', 'Северная Осетия – Алания', 'Алтай',
                      'Дагестан', 'Ингушетия', 'Башкортостан']
regions_republic_2 = ['Удмуртская республика', 'Кабардино-Балкарская республика',
                      'Карачаево-Черкесская республика', 'Чеченская республика']
for k in range(len(counties['features'])):
    counties['features'][k]['id'] = k
    if counties['features'][k]['properties']['name'] in regions_republic_1:
        counties['features'][k]['properties']['name'] = 'Республика ' + counties['features'][k]['properties']['name']
    elif counties['features'][k]['properties']['name'] == 'Ханты-Мансийский автономный округ - Югра':
        counties['features'][k]['properties']['name'] = 'Ханты-Мансийский АО'
    elif counties['features'][k]['properties']['name'] in regions_republic_2:
        counties['features'][k]['properties']['name'] = counties['features'][k]['properties']['name'].title()

It’s time to create a DataFrame from the resulting GeoJSON file with the regions of Russia, we’ll take the identifiers and names.

region_id_list = []
regions_list = []
for k in range(len(counties['features'])):
    region_id_list.append(counties['features'][k]['id'])
    regions_list.append(counties['features'][k]['properties']['name'])
df_regions = pd.DataFrame()
df_regions['region_id'] = region_id_list
df_regions['region_name'] = regions_list

As a result, our DataFrame looks like the following:

Data Scraping

We need to scrape the data stored in this table:

Let’s use the Selenium library for this task. We need to navigate to the webpage and convert it into a BeautifulSoup object

driver = webdriver.Chrome()
driver.get('https://стопкоронавирус.рф/information/')
source_data = driver.page_source
soup = bs(source_data, 'lxml')

The region names are wrapped with <th> tags, while the latest data is stored in table cells, each one is defined with a <td> tag.

divs_data = soup.find_all('td')

The divs_data list should return something like this:

The data is grouped in one line, this includes both new cases and active ones. It is noticeable that each region corresponds to five values, for Moscow these are the first five, for Moscow Region the next five and so on. We can use this pattern to create five lists and populate with values according to the index. The first value will be appended to the list with active cases, the second value to the list of new ones, etc. After every five values, the index will be reset to zero.

count = 1
for td in divs_data:
    if count == 1:
        sick_list.append(int(td.text))
    elif count == 2:
        new_list.append(int(td.text))
    elif count == 3:
        cases_list.append(int(td.text))
    elif count == 4:
        healed_list.append(int(td.text))
    elif count == 5:
        died_list.append(int(td.text))
        count = 0
    count += 1

The next step is to extract the region names from the table, they are stored within the col-region class. We also need to clean up the data by eliminating extra white spaces and line breaks.

divs_region_names = soup.find_all('th', {'class':'col-region'})
region_names_list = []
for i in range(1, len(divs_region_names)):
    region_name = divs_region_names[i].text
    region_name = region_name.replace('\n', '').replace('  ', '')
    region_names_list.append(region_name)

Create a DataFrame:

df = pd.DataFrame()
df['region_name'] = region_names_list
df['sick'] = sick_list
df['new'] = new_list
df['cases'] = cases_list
df['healed'] = healed_list
df['died'] = died_list

After reviewing our data once again we detected white space under the index 10. This should be fixed immediately, otherwise, we may run into problems.

df.loc[10, 'region_name'] = df[df.region_name == 'Челябинская область '].region_name.item().strip(' ')

Finally, we can merge our DataFrame on the region_name column, so that the resulted table will include a column with region id, which is required to make a choropleth map.

df = df.merge(df_regions, on='region_name')

Creating a choropleth map with Plotly

Let’s create a new figure and pass a choroplethmapbox object to it. The geojson parameter will accept the counties variable with the GeoJSON file, assign the region_id to locations. The z parameter represents the data to be color-coded, in this example we’re passing the number of new cases for each region. Assign the region names to text. The colorscale parameter accepts lists with values ranging from 0 to 1 and RGB color codes. Here, the palette changes from green to yellow and then red, depending on the number of active cases. By passing the values stored in customdata we can change our hovertemplate.

fig = go.Figure(go.Choroplethmapbox(geojson=counties,
                           locations=df['region_id'],
                           z=df['new'],
                           text=df['region_name'],
                           colorscale=[[0, 'rgb(34, 150, 79)'],
                                       [0.2, 'rgb(249, 247, 174)'],
                                       [0.8, 'rgb(253, 172, 99)'],
                                       [1, 'rgb(212, 50, 44)']],
                           colorbar_thickness=20,
                           customdata=np.stack([df['cases'], df['died'], df['sick'], df['healed']], axis=-1),
                           hovertemplate='<b>%{text}</b>'+ '<br>' +
                                         'New cases: %{z}' + '<br>' +
                                         'Active cases: %{customdata[0]}' + '<br>' +
                                         'Deaths: %{customdata[1]}' + '<br>' +
                                         'Total cases: %{customdata[2]}' + '<br>' +
                                         'Recovered: %{customdata[3]}' +
                                         '<extra></extra>',
                           hoverinfo='text, z'))

Let’s customize the map, we will use a ready-to-go neutral template, called carto-positron. Set the parameters and display the map:
mapbox_zoom: responsible for zooming;
mapbox_center: centers the map;
marker_line_width: border width (we removed the borders by setting this parameter to 0);
margin: usually accepts 0 values to make the map wider.

fig.update_layout(mapbox_style="carto-positron",
                  mapbox_zoom=1, mapbox_center = {"lat": 66, "lon": 94})
fig.update_traces(marker_line_width=0)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

And here is our map. According to the plot, we can say that the highest number of cases per day is happening in Moscow – 608 new cases. It’s really high compared to the other regions, and especially to Nenets Autonomous Okrug, where this number is surprisingly low.

View the code on GitHub

 No comments    1014   11 mon   dash   data analytics   plotly   python

Building a Plotly Dashboard with dynamic sliders in Python

Estimated read time – 2 min

Recently we discussed how to use Plotly and built a scatter plot to display the ratio between the number of reviews and the average rating for Russian Breweries registered on Untappd. Each marker on the plot has two properties, the registration period and the beer range. And today we are going to introduce you to Dash, a Python framework for building analytical web applications. First, create a new file name app.py with a get_scatter_plot(n_days, top_n) function from the previous article.

import dash
import dash_core_components as dcc
import dash_html_components as html
from get_plots import get_scatter_plot

After importing the necessary libraries we need to load CSS styles and initiate our web app:

external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']
app = dash.Dash(__name__, external_stylesheets=external_stylesheets)

Create a dashboard structure:

app.layout = html.Div(children=[
       html.Div([
           dcc.Graph(id='fig1'),
       ]) ,
       html.Div([
           html.H6('Time period (days)'),
           dcc.Slider(
               id='slider-day1',
               min=0,
               max=100,
               step=1,
               value=30,
               marks={i: str(i) for i in range(0, 100, 10)}
           ),
           html.H6('Number of breweries from the top'),
           dcc.Slider(
               id='slider-top1',
               min=0,
               max=500,
               step=50,
               value=500,
               marks={i: str(i) for i in range(0, 500, 50)})
       ])
])

Now we have a plot and two sliders, each with its id and parameters: minimum value, maximum value, step, and initial value. Since the sliders data will be displayed in the plot we need to create a callback. Output is the first argument that displays our plot, the following Input parameters accept values on which the plot depends.

@app.callback(
   dash.dependencies.Output('fig1', 'figure'),
   [dash.dependencies.Input('slider-day1', 'value'),
    dash.dependencies.Input('slider-top1', 'value')])
def output_fig(n_days, top_n):
    get_scatter_plot(n_days, top_n)

At the end of our script we will add the following line to run our code :

if __name__ == '__main__':
   app.run_server(debug=True)

Now, whenever the script is running our local IP address will be displayed in the terminal. Let’s open it in a web browser to view our interactive dashboard, it’s updated automatically when moving the sliders.

 No comments    487   1 y   dash   data analytics   plotly   python   untappd

Building a scatter plot for Untappd Breweries

Estimated read time – 7 min

Today we are going to build a scatter plot for Russian Breweries that would display the ratio between the number of reviews and their average ratings for the past 30 days. Data will be taken from check-ins left by Untappd users who rated beers. To make a plot we need markers with specified color and size. The color will depend on a brewery registration date, thus displaying it’s registration period on Untappd, while the size of a marker correlates with the range of beers represented. This article is the first part of our series dedicated to building dashboards with Plotly.

Writing a Clickhouse query

First, we need to process the data before using it in our dashboard. Here, we are using public data collected from Untappd. You can find more about this in our previous articles: Handling website buttons in Selenium and Example of using dictionaries in Clickhouse with Untappd.

from datetime import datetime, timedelta
from clickhouse_driver import Client
import plotly.graph_objects as go
import pandas as pd
import numpy as np
client = Client(host='ec1-2-34-567-89.us-east-2.compute.amazonaws.com', user='default', password='', port='9000', database='default')

Our scatter plot will depend on the  get_scatter_plot(n_days, top_n) function, which takes two arguments denoting a time span and a number of breweries to display. Let’s write a SQL query to calculate the Brewery Pure Average. It can be presented the following: multiply the beer rating by the total number of ratings and divide it by the number of brewery reviews. We will also pass a brewery name and its beer range to the query, these parameters can be fetched from our dictionary using the  dictGet function. We are only interested in those breweries that have Brewery Pure Average > 0 and the number of reviews > 100.

brewery_pure_average = client.execute(f"""
SELECT
       t1.brewery_id,
       sum(t1.beer_pure_average_mult_count / t2.count_for_that_brewery) AS brewery_pure_average,
       t2.count_for_that_brewery,
       dictGet('breweries', 'brewery_name', toUInt64(t1.brewery_id)),
       dictGet('breweries', 'beer_count', toUInt64(t1.brewery_id)),
       t3.stats_age_on_service / 365
   FROM
   (
       SELECT
           beer_id,
           brewery_id,
           sum(rating_score) AS beer_pure_average_mult_count
       FROM beer_reviews
       WHERE created_at >= today()-{n_days}
       GROUP BY
           beer_id,
           brewery_id
   ) AS t1
   ANY LEFT JOIN
   (
       SELECT
           brewery_id,
           count(rating_score) AS count_for_that_brewery
       FROM beer_reviews
       WHERE created_at >= today()-{n_days}
       GROUP BY brewery_id
   ) AS t2 ON t1.brewery_id = t2.brewery_id
   ANY LEFT JOIN
   (
       SELECT
           brewery_id,
           stats_age_on_service
       FROM brewery_info
   ) AS t3 ON t1.brewery_id = t3.brewery_id
   GROUP BY
       t1.brewery_id,
       t2.count_for_that_brewery,
       t3.stats_age_on_service
   HAVING t2.count_for_that_brewery >= 150
   ORDER BY brewery_pure_average
   LIMIT {top_n}
    """)

scatter_plot_df_with_age = pd.DataFrame(brewery_pure_average, columns=['brewery_id', 'brewery_pure_average', 'rating_count', 'brewery_name', 'beer_count'])

Working with a DataFrame

Add two dotted lines that will pass through the median values of each axis. That way we can find out which breweries are above average, the best ones will be in the upper right area.

dict_list = []
dict_list.append(dict(type="line",
                     line=dict(
                         color="#666666",
                         dash="dot"),
                     x0=0,
                     y0=np.median(scatter_plot_df_with_age.brewery_pure_average),
                     x1=7000,
                     y1=np.median(scatter_plot_df_with_age.brewery_pure_average),
                     line_width=1,
                     layer="below"))
dict_list.append(dict(type="line",
                     line=dict(
                         color="#666666",
                         dash="dot"),
                     x0=np.median(scatter_plot_df_with_age.rating_count),
                     y0=0,
                     x1=np.median(scatter_plot_df_with_age.rating_count),
                     y1=5,
                     line_width=1,
                     layer="below"))

Add annotations to display median values by hovering:

annotations_list = []
annotations_list.append(
    dict(
        x=8000,
        y=np.median(scatter_plot_df_with_age.brewery_pure_average) - 0.1,
        xref="x",
        yref="y",
        text=f"Median value: {round(np.median(scatter_plot_df_with_age.brewery_pure_average), 2)}",
        showarrow=False,
        font={
            'family':'Roboto, light',
            'color':'#666666',
            'size':12
        }
    )
)
annotations_list.append(
    dict(
        x=np.median(scatter_plot_df_with_age.rating_count) + 180,
        y=0.8,
        xref="x",
        yref="y",
        text=f"Median value: {round(np.median(scatter_plot_df_with_age.rating_count), 2)}",
        showarrow=False,
        font={
            'family':'Roboto, light',
            'color':'#666666',
            'size':12
        },
        textangle=-90
    )
)

Let’s make our plot more informative by splitting breweries into 4 groups according to the beer range. The first group will include breweries with less than 10 brands, the second group for those holding 10-30 brands, the third one for 30-50 brands, and the last one for large breweries with >50 brands. We stored marker sizes in the bucket_beer_count list.

bucket_beer_count = []
for beer_count in scatter_plot_df_with_age.beer_count:
   if beer_count < 10:
       bucket_beer_count.append(7)
   elif 10 <= beer_count <= 30:
       bucket_beer_count.append(9)
   elif 31 <= beer_count <= 50:
       bucket_beer_count.append(11)
   else:
       bucket_beer_count.append(13)
scatter_plot_df_with_age['bucket_beer_count'] = bucket_beer_count

Next step is to perform age-based splitting

bucket_age = []
for age in scatter_plot_df_with_age.age_on_service:
   if age < 4:
       bucket_age.append(0)
   elif 4 <= age <= 6:
       bucket_age.append(1)
   elif 6 < age < 8:
       bucket_age.append(2)
   else:
       bucket_age.append(3)
scatter_plot_df_with_age['bucket_age'] = bucket_age

Let’s divide our DataFrame into 4 parts to build separate scatter plots with its own color and size.

scatter_plot_df_0 = scatter_plot_df[scatter_plot_df.bucket == 0]
scatter_plot_df_1 = scatter_plot_df[scatter_plot_df.bucket == 1]
scatter_plot_df_2 = scatter_plot_df[scatter_plot_df.bucket == 2]
scatter_plot_df_3 = scatter_plot_df[scatter_plot_df.bucket == 3]

Plotting

Now we are ready to build the plot, add our 4 brewery groups one by one, setting its key parameters: name, marker color, annotation transparency and text.

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=scatter_plot_df_0.rating_count,
    y=scatter_plot_df_0.brewery_pure_average,
    name='< 4',
    mode='markers',
    opacity=0.85,
    text=scatter_plot_df_0.name_count,
    marker_color='rgb(114, 183, 178)',
    marker_size=scatter_plot_df_0.bucket_beer_count,
    textfont={"family":"Roboto, light",
              "color":"black"
             }
))

fig.add_trace(go.Scatter(
    x=scatter_plot_df_1.rating_count,
    y=scatter_plot_df_1.brewery_pure_average,
    name='4 – 6',
    mode='markers',
    opacity=0.85,
    marker_color='rgb(76, 120, 168)',
    text=scatter_plot_df_1.name_count,
    marker_size=scatter_plot_df_1.bucket_beer_count,
    textfont={"family":"Roboto, light",
              "color":"black"
             }
))

fig.add_trace(go.Scatter(
    x=scatter_plot_df_2.rating_count,
    y=scatter_plot_df_2.brewery_pure_average,
    name='6 – 8',
    mode='markers',
    opacity=0.85,
    marker_color='rgb(245, 133, 23)',
    text=scatter_plot_df_2.name_count,
    marker_size=scatter_plot_df_2.bucket_beer_count,
    textfont={"family":"Roboto, light",
              "color":"black"
             }
))

fig.add_trace(go.Scatter(
    x=scatter_plot_df_3.rating_count,
    y=scatter_plot_df_3.brewery_pure_average,
    name='8+',
    mode='markers',
    opacity=0.85,
    marker_color='rgb(228, 87, 86)',
    text=scatter_plot_df_3.name_count,
    marker_size=scatter_plot_df_3.bucket_beer_count,
    textfont={"family":"Roboto, light",
              "color":"black"
             }
))

fig.update_layout(
    title=f"The ratio between the number of reviews and the average brewery rating for the past <br> {n_days} days, top {top_n} breweries",
    font={
            'family':'Roboto, light',
            'color':'black',
            'size':14
        },
    plot_bgcolor='rgba(0,0,0,0)',
    yaxis_title="Average rating",
    xaxis_title="Number of reviews",
    legend_title_text='Registration period<br> on Untappd in years:',
    height=750,
    shapes=dict_list,
    annotations=annotations_list
)

Voila, the scatter plot is done! Each point is a separate brewery. The color shows the brewery beer range and when hovering we will see a summary including the average rating for the past 30 days, number of reviews, brewery name, and beer range. The dotted lines are passing through the median values we calculated with NumPy, they’re showing us the best breweries in the upper right. In our next article, we are going to create a breweries dashboard with dynamic parameters.

 No comments    379   1 y   dash   plotly   python   untappd
Earlier Ctrl + ↓