LEFT JOIN: blog on analytics, visualisation & data science

Tableau Dashboard Overview

Estimated read time – 7 min

In the previous article, we focused on the problem statement, designed a layout, shared our goal to build a Tableau Dashboard for Superstore dataset. The dashboard should provide insights on most profitable regions, products, customer segments and estimate key performance indicators (KPIs) over the past time.

The data in SuperStore Sales reflect sales and profit of the retail chain in Canada. It includes information about customer orders, refunds, sales and geodata. But we’re mostly interested in sales data, as our main goal is to create an executive dashboard to understand company’s operating margin, find most and least lucrative product categories, and customer segments.

So here’s how the dashboard looks like:

All dashboard elements are placed into containers, we can easily resize or change their hierarchy, this enables to optimize the dashboard and make it more mobile/tablet friendly. We can also filter the data by time periods and choose a specific month and year in the top right corner, and all the charts will be redrawn automatically.

The next field shows key factoids on the company performance: profit, sales, orders count, average discount, customers and sales per customer. Each of the indicators displays YOY, a statistical measure to evaluate a company’s financial progress over time. If the indicator shows positive change, a green arrow will be added, if negative – red.

Below are two core charts, displaying regions (colored based on profit) and profit dynamics. We can click on a specific one to view its stats in-depth.

The green dot on the right chart represents data for a selected month this year, while the blue dot displays the same month last year. When hovering these points you can see a trend line, that facilitates assessing how the company’s doing today.

Let’s move to the second part, here we placed company’s products and customers onto 3 charts. The first one, starting from the left, called bar in bar chart, where you can easily explore product efficiency. For instance, Tables is one of the most inefficient categories, with Breford CR4500 that resulted in significant losses.

Bar in bar chart implementation

Then goes the chart with company’s customers, by default they are sorted in descending order by profitability. The chart is linked with Top Performing Provinces, so if we want to discover best or worst customers for the selected province, the data will be redrawn automatically.

Dashboard Evaluation

We evaluated this dashboard according to the criteria below. On a scale of 1 – 10, 10 being the highest, it gets the following scores from our team :

  1. Meets the tasks – 10,0

  2. Learning curve  – 5,5

  3. Tool functionality – 9,0

  4. Ease of use – 8,5

  5. Compliance of the result – 10,0

  6. Visual evaluation – 9,7

This Tableau Dashboard scored 8.8 out of 10 from the team! In our perspective, the dashboard fully meets the requirements and facilitates understanding of business performance over a reporting period. We can assess profit dynamics in general or for the selected region, and effectively leverage products and customers data in measuring monetary results. The final version is available through this link.

Please let us know your thoughts in the comments down below, how would you rate this dashboard?

 No comments    122   4 mon   BI   BI-tools   guide   tableau

How to build Animated Charts like Hans Rosling in Plotly

Estimated read time – 12 min

Hans Rosling’s work on world countries economic growth presented in 2007 at TEDTalks can be attributed to one of the most iconic data visualizations, ever created. Just check out this video, in case you don’t know what we’re talking about:

Sometimes we want to compare standards of living in other countries. One way to do this is to refer to the Big Mac index, which the Economist magazine has kept track of since 1986. The key idea this index represents is to measure purchasing power parity (PPP) in different countries, considering costs of domestic production. To make a standard burger, one would need the following ingredients: cheese, meat, bread and vegetables. Considering that all these ingredients can be produced locally, we can compare the production cost of one Big Mac in different countries, and measure purchasing power. Plus, McDonald’s is the world’s most popular franchise network, its restaurants are almost everywhere around the globe.

In today’s material, we will build a Motion Chart for the Big Mac index using Plotly. Following Hann Rosling’s idea, the chart will display country population along the X-axis and GDP per capita in US dollars along the Y. The size of the dots is going to be proportional to the Big Mac Index for a given country. And the color of the dots will represent the continent where the country is located.

Preparing Data

Even though The Economist has been updating it for over 30 years and sharing its observations publicly, the dataset contains many missing values. It also lacks continents names, but we can handle it by supplementing the data with some more datasets that can be found in our repo.

Let’s start by importing the libraries:

import pandas as pd
from pandas.errors import ParserError
import plotly.graph_objects as go
import numpy as np
import requests
import io

We can access the dataset directly from GitHub. Just use the following function to send a GET request to a CSV file and create a Pandas DataFrame. However, in some cases, this may raise a  ParseError because of the caption title, so we will add a try block:

def read_raw_file(link):
    raw_csv = requests.get(link).content
        df = pd.read_csv(io.StringIO(raw_csv.decode('utf-8')))
    except ParserError:
        df = pd.read_csv(io.StringIO(raw_csv.decode('utf-8')), skiprows=3)
    return df

bigmac_df = read_raw_file('https://github.com/valiotti/leftjoin/raw/master/motion-chart-big-mac/big-mac.csv')
population_df = read_raw_file('https://github.com/valiotti/leftjoin/raw/master/motion-chart-big-mac/population.csv')
dgp_df = read_raw_file('https://github.com/valiotti/leftjoin/raw/master/motion-chart-big-mac/gdp.csv')
continents_df = read_raw_file('https://github.com/valiotti/leftjoin/raw/master/motion-chart-big-mac/continents.csv')

From The Economist dataset we will need these columns: country name, local price, dollar exchange rate, country code (iso_a3) and record date. Take the timeline from 2005 to 2020, as the records are most complete for this span. And divide the local price by the exchange rate to calculate the price of one Big Mac in US dollars.

bigmac_df = bigmac_df[['name', 'local_price', 'dollar_ex', 'iso_a3', 'date']]
bigmac_df = bigmac_df[bigmac_df['date'] >= '2005-01-01']
bigmac_df = bigmac_df[bigmac_df['date'] < '2020-01-01']
bigmac_df['date'] = pd.DatetimeIndex(bigmac_df['date']).year
bigmac_df = bigmac_df.drop_duplicates(['date', 'name'])
bigmac_df = bigmac_df.reset_index(drop=True)
bigmac_df['dollar_price'] = bigmac_df['local_price'] / bigmac_df['dollar_ex']

Take a look at the result:

Next, let’s try adding a new column called continents. To ease the task, leave only two columns containing country code and continent name. Then we need to iterate through the bigmac_df[‘iso_a3’] column, adding a continent name for the corresponding values. However some cases may raise an error, because it’s not really clear, whether a country belongs to Europe or Asia, we will consider such cases as Europe by default.

continents_df = continents_df[['Continent_Name', 'Three_Letter_Country_Code']]
continents_list = []
for country in bigmac_df['iso_a3']:
        continents_list.append(continents_df.loc[continents_df['Three_Letter_Country_Code'] == country]['Continent_Name'].item())
    except ValueError:
bigmac_df['continent'] = continents_list

Now we can drop unnecessary columns, apply sorting by country names and date, convert values in the date column into integers, and view the current result:

bigmac_df = bigmac_df.drop(['local_price', 'iso_a3', 'dollar_ex'], axis=1)
bigmac_df = bigmac_df.sort_values(by=['name', 'date'])
bigmac_df['date'] = bigmac_df['date'].astype(int)

Then we need to fill up missing values for The Big Mac index with zeros and remove the Republic of China, since this partially recognized state is not included in the World Bank datasets. The UAE occurs several times, this can lead to issues.

countries_list = list(bigmac_df['name'].unique())
years_set = {i for i in range(2005, 2020)}
for country in countries_list:
    if len(bigmac_df[bigmac_df['name'] == country]) < 15:
        this_continent = bigmac_df[bigmac_df['name'] == country].continent.iloc[0]
        years_of_country = set(bigmac_df[bigmac_df['name'] == country]['date'])
        diff = years_set - years_of_country
        dict_to_df = pd.DataFrame({
                      'name':[country] * len(diff),
                      'dollar_price':[0] * len(diff),
                      'continent': [this_continent] * len(diff)
        bigmac_df = bigmac_df.append(dict_to_df)
bigmac_df = bigmac_df[bigmac_df['name'] != 'Taiwan']
bigmac_df = bigmac_df[bigmac_df['name'] != 'United Arab Emirates']

Next, let’s augment the data with GDP per capita and population from other datasets. Both datasets have differences in country names, so we need to specify such cases explicitly and replace them.

years = [str(i) for i in range(2005, 2020)]

countries_replace_dict = {
    'Russian Federation': 'Russia',
    'Egypt, Arab Rep.': 'Egypt',
    'Hong Kong SAR, China': 'Hong Kong',
    'United Kingdom': 'Britain',
    'Korea, Rep.': 'South Korea',
    'United Arab Emirates': 'UAE',
    'Venezuela, RB': 'Venezuela'
for key, value in countries_replace_dict.items():
    population_df['Country Name'] = population_df['Country Name'].replace(key, value)
    gdp_df['Country Name'] = gdp_df['Country Name'].replace(key, value)

Finally, extract population data and GDP for the given years, adding the data to the bigmac_df DataFrame:

countries_list = list(bigmac_df['name'].unique())

population_list = []
gdp_list = []
for country in countries_list:
    population_for_country_df = population_df[population_df['Country Name'] == country][years]
    gdp_for_country_df = gdp_df[gdp_df['Country Name'] == country][years]
bigmac_df['population'] = population_list
bigmac_df['gdp'] = gdp_list
bigmac_df['gdp_per_capita'] = bigmac_df['gdp'] / bigmac_df['population']

And here is our final dataset:

Creating a chart in Plotly

The population in China or India, on average, is 10 times more than in other countries. That’s why we need to transform X-axis to Log Scale, to make the chart easier for interpreting. The log-transformation is a common way to address skewness in data.

fig_dict = {
    "data": [],
    "layout": {},
    "frames": []

fig_dict["layout"]["xaxis"] = {"title": "Population", "type": "log"}
fig_dict["layout"]["yaxis"] = {"title": "GDP per capita (in $)", "range":[-10000, 120000]}
fig_dict["layout"]["hovermode"] = "closest"
fig_dict["layout"]["updatemenus"] = [
        "buttons": [
                "args": [None, {"frame": {"duration": 500, "redraw": False},
                                "fromcurrent": True, "transition": {"duration": 300,
                                                                    "easing": "quadratic-in-out"}}],
                "label": "Play",
                "method": "animate"
                "args": [[None], {"frame": {"duration": 0, "redraw": False},
                                  "mode": "immediate",
                                  "transition": {"duration": 0}}],
                "label": "Pause",
                "method": "animate"
        "direction": "left",
        "pad": {"r": 10, "t": 87},
        "showactive": False,
        "type": "buttons",
        "x": 0.1,
        "xanchor": "right",
        "y": 0,
        "yanchor": "top"

We will also add a slider to filter data within a certain range:

sliders_dict = {
    "active": 0,
    "yanchor": "top",
    "xanchor": "left",
    "currentvalue": {
        "font": {"size": 20},
        "prefix": "Year: ",
        "visible": True,
        "xanchor": "right"
    "transition": {"duration": 300, "easing": "cubic-in-out"},
    "pad": {"b": 10, "t": 50},
    "len": 0.9,
    "x": 0.1,
    "y": 0,
    "steps": []

By default, the chart will display data for 2005 before we click on the “Play” button.

continents_list_from_df = list(bigmac_df['continent'].unique())
year = 2005
for continent in continents_list_from_df:
    dataset_by_year = bigmac_df[bigmac_df["date"] == year]
    dataset_by_year_and_cont = dataset_by_year[dataset_by_year["continent"] == continent]
    data_dict = {
        "x": dataset_by_year_and_cont["population"],
        "y": dataset_by_year_and_cont["gdp_per_capita"],
        "mode": "markers",
        "text": dataset_by_year_and_cont["name"],
        "marker": {
            "sizemode": "area",
            "sizeref": 200000,
            "size":  np.array(dataset_by_year_and_cont["dollar_price"]) * 20000000
        "name": continent,
        "customdata": np.array(dataset_by_year_and_cont["dollar_price"]).round(1),
        "hovertemplate": '<b>%{text}</b>' + '<br>' +
                         'GDP per capita: %{y}' + '<br>' +
                         'Population: %{x}' + '<br>' +
                         'Big Mac price: %{customdata}$' +

Next, we need to fill up the frames field, which will be used for animating the data. Each frame represents a certain data point from 2005 to 2019.

for year in years:
    frame = {"data": [], "name": str(year)}
    for continent in continents_list_from_df:
        dataset_by_year = bigmac_df[bigmac_df["date"] == int(year)]
        dataset_by_year_and_cont = dataset_by_year[dataset_by_year["continent"] == continent]

        data_dict = {
            "x": list(dataset_by_year_and_cont["population"]),
            "y": list(dataset_by_year_and_cont["gdp_per_capita"]),
            "mode": "markers",
            "text": list(dataset_by_year_and_cont["name"]),
            "marker": {
                "sizemode": "area",
                "sizeref": 200000,
                "size": np.array(dataset_by_year_and_cont["dollar_price"]) * 20000000
            "name": continent,
            "customdata": np.array(dataset_by_year_and_cont["dollar_price"]).round(1),
            "hovertemplate": '<b>%{text}</b>' + '<br>' +
                             'GDP per capita: %{y}' + '<br>' +
                             'Population: %{x}' + '<br>' +
                             'Big Mac price: %{customdata}$' +

    slider_step = {"args": [
        {"frame": {"duration": 300, "redraw": False},
         "mode": "immediate",
         "transition": {"duration": 300}}
        "label": year,
        "method": "animate"}

Just a few finishing touches left, instantiate the chart, set colors, fonts and title.

fig_dict["layout"]["sliders"] = [sliders_dict]

fig = go.Figure(fig_dict)

    title = 
        {'text':'<b>Motion chart</b><br><span style="color:#666666">The Big Mac index from 2005 to 2019</span>'},
        'family':'Open Sans, light',
fig.update_xaxes(tickfont=dict(family='Open Sans, light', color='black', size=12), nticks=4, gridcolor='lightgray', gridwidth=0.5)
fig.update_yaxes(tickfont=dict(family='Open Sans, light', color='black', size=12), nticks=4, gridcolor='lightgray', gridwidth=0.5)


Bingo! The Motion Chart is done:

View the code on GitHub

Animating sports data in Tableau

Estimated read time – 4 min

Previously we shared how to visualize your sports data from the SwingVision app in Tableau , using custom background and shapes. This time we are going to animate our dashboard to watch how landing locations of tennis shots changed over the match. Such an animation can be exported into a video file for later use. That’s what our result looked like in Tableau earlier:

The chart shows landing coordinates of tennis shots on the court. Forehand shots are marked in red, backhands are in orange, the x marks for shots went into the net. We can also use filtering and get expanded tooltip info on hover.
Tableau enables us to create pages to flip through members of a field, changing and animating the analysis. In this case, all we need to is simply drag-and-drop the Shots table to the Pages shelf and click on the Play button.

Let’s switch to the dashboard and try adding the Pages shelf, just click on Worksheet -> Show cards and apply to the current page.

Next, create a new vertical container, drag the panel and minimize the view:

Now after clicking on the Play button, the first part is done:

If you’re a macOS user, it won’t be a problem to make a video from this animation by pressing ⌘ + Shift + 5 and choosing a specific part of your screen. In other cases, you may need to download third-party software for screen recording.

 No comments    49   4 mon   animation   BI-tools   dashboard   tableau

Custom visualization of sports data in Tableau

Estimated read time – 8 min

Being a tennis fan, I recently discovered a new app created to help players to assess their game skills – SwingVision. The app can recognize tennis shots in real-time and display its landing coordinates. The author of this app is Swupnil Sahai, currently he is a Lecturer at UC Berkley.

My tennis stats, shown by the app

SwingVision also allows you to view your “rallies” and specific shots, assess the average shot speed and error rate. Moreover one can easily export its stats as an Excel Table.

Example of exported table

In today’s material, we are going to create a custom Tableau chart that would reproduce stats from SwingVision and display the landing location of my shots on the court. First, we need to find a suitable tennis court image (top view), like this one.

Next, we need to import the data stored as an Excel Table into Tableau, set values for both coordinates using the Shot Placement (x), and Shot Placement (y) columns, and remove the aggregation of measures to get something like this:

After filtering shots by player the chart somewhat resembles the upside-down version of the actual image:

To reverse the image, we need to change the values of current x and y from positive to negative by creating new measures, add some color and everything will start to line up:

The X marks on the chart represent all shots that hit the net, we can hide them from view and set a constant value for Y =- 11,89, which corresponds to the length of a half-court.
Then when we try adding the background image, however, this will cause a warning, because the image is not scaled properly:

This means that we need to calculate the ratio of our image to the real-size court. In our case, for instance, the image is 913px in width, while the court itself is 10.97 meters wide, so by calculating 913 over 10.97, the ratio for x will be 83.227.

The middle of the court will be considered as the origin (0, 0), and will divide the court vertically into halves of 456.5px.
Remember that the image itself has margins, both to the right and left that are equal to 143.3px each. Just create new measures for x and y, substituting with the following values:

After these steps, our image should be as follows:

As finishing touches, we set a custom icon for each point on the chart and add filtering options.

To sum up, the dashboard displays everything we need: landing location of shots, their speed, types of strokes and expanded tooltip info on hover:

 No comments    47   4 mon   BI-tools   dashboard   tableau

Calculating User Retention by 24-hour windows and calendar days in SQL

Estimated read time – 3 min

Yesterday I got a message from one of our blog readers, asking:

Let’s say today’s Monday, and an app was downloaded 187 times. If I want to find out what’s the Retention rate was on day 1, what day of the week should I start taking the count with?

He is referring to the blog post on Сalculating Retention Rate. I would like to clarify that since retention rate can be calculated by 24-hour windows, as well as by calendar days. In our case, day 0 will be Monday, and day 1 will be Tuesday. However, there is one little hiccup...

For example, if we started promoting our product on Monday, October 12 at 23:59, then all downloads of this day will have the retention rate of day 1. It’s a problem of performing calendar calculations. To address this, some data analysts calculate retention rate not only by calendar days but also by 24-hour windows.

Let’s apply this idea to the above case:

  • The Retention rate for day 0 can be calculated using the number of downloads from October 5, 23:59 to October 6, 23:59.
  • The Retention rate for day 1: from October 6, 23:59 to October 7, 23:59
  • And so on rolling 24-hour window.

How to calculate Retention Rate by 24-hour windows in SQL?

Let’s recall one of the queries from our previous post. It was written to calculate the difference between the download date and the user activity date. We need to change the query so that user activity is calculated by 24-hour windows. To accomplish this just change the calculation for datediff to 24-hour windows, updating the lines in bold.

SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
   floor((cast(cs.created_at as int)-cast(installed_at as int))/(24*3600)) as date_diff,
   ndv(user.id) AS ret_base
   LEFT JOIN client_session cs ON cs.user_id=user.id
   WHERE 1=1
     AND floor((cast(cs.created_at as int)-cast(installed_at as int))/(24*3600)) between 0 and 30
     AND from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1,2

Updated query:

       cohort.date_diff AS day_difference,
       avg(reg.users) AS cohort_size,
       avg(cohort.ret_base) AS retention_base,
       avg(cohort.ret_base)/avg(reg.users)*100 AS retention_rate
  (SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          ndv(user.id) AS users
   WHERE from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1) reg
  (SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
    floor((cast(cs.created_at as int)-cast(installed_at as int))/(24*3600)) as date_diff,
          ndv(user.id) AS ret_base
   LEFT JOIN client_session cs ON cs.user_id=user.id
    WHERE 1=1
     AND floor((cast(cs.created_at as int)-cast(installed_at as int))/(24*3600)) between 0 and 30
     AND from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1,2
  ) cohort ON reg.reg_date=cohort.reg_date
    GROUP BY 1        
    ORDER BY 1

Final output:

Compare it with the previous one:

And as you see, the retention rate calculated using 24-hour windows is slightly lower in the first days.

 No comments    102   4 mon   Q&A   redash   retention
Earlier Ctrl + ↓