LEFT JOIN: blog on analytics, visualisation & data science

Data normalization with SQL

Estimated read time – 5 min

According to GIGO (garbage in, garbage out) principle, errors in input data lead to erroneous analysis results. The results of our work directly depend on the quality of data preparation.

For instance, when we need to prepare data to use in ML algorithms (like k-NN, k-means, logistic regression, etc.), features in the original dataset may vary in scale like the age and height of a person. This may lead to the incorrect performance of the algorithm. Thus, such data needs to be rescaled first.

In this tutorial, we will consider the ways to scale the data using SQL query: min-max normalization, min-max normalization for an arbitrary range, and z-score normalization. For each of these methods we have prepared two SQL query options: one using a SELECT subquery and another using a window function OVER().

We will work with the simple table students that contains the data on the height of the students:

name height
Ivan 174
Peter 181
Dan 199
Kate 158
Mike 179
Silvia 165
Giulia 152
Robert 188
Steven 177
Sophia 165

Min-max rescaling

Min-max scaling approach scales the data using the fixed range from 0 to 1. In this case, all the data is on the same scale which will exclude the impact of outliers on the conclusions.

The formula for the min-max scaling is given as:

We multiply the numerator by 1.0 in order to get a floating point number at the end.

SQL-query with a subquery:

SELECT height, 
       1.0 * (height-t1.min_height)/(t1.max_height - t1.min_height) AS scaled_minmax
  FROM students, 
      (SELECT min(height) as min_height, 
              max(height) as max_height 
         FROM students
      ) as t1;

SQL-query with a window function:

SELECT height, 
       (height - MIN(height) OVER ()) * 1.0 / (MAX(height) OVER () - MIN(height) OVER ()) AS scaled_minmax
  FROM students;

As a result, we get the values in [0, 1] range where 0 is the height of the shortest student and 1 is the height of the tallest one.

name height scaled_minmax
Ivan 174 0.46809
Peter 181 0.61702
Dan 199 1
Kate 158 0.12766
Mike 179 0.57447
Silvia 165 0.2766
Giulia 152 0
Robert 188 0.76596
Steven 177 0.53191
Sophia 165 0.2766

Rescaling within a given range

This is an option of min-max normalization between an arbitrary set of values. When it comes to data scaling, the values do not always need to be in the range between 0 and 1. In these cases, the following formula is applied.

This allows us to scale the data to an arbitrary scale. In our example, let’s assume a=10.0 and b=20.0.

SQL-query with subquery:

SELECT height, 
       ((height - min_height) * (20.0 - 10.0) / (max_height - min_height)) + 10 AS scaled_ab
  FROM students,
      (SELECT MAX(height) as max_height, 
              MIN(height) as min_height
         FROM students  
      ) t1;

SQL-query with a window function:

SELECT height, 
       ((height - MIN(height) OVER() ) * (20.0 - 10.0) / (MAX(height) OVER() - MIN(height) OVER())) + 10.0 AS scaled_ab
  FROM students;

We receive similar results as before, but with data spread between 10 and 20.

name height scaled_ab
Ivan 174 14.68085
Peter 181 16.17021
Dan 199 20
Kate 158 11.2766
Mike 179 15.74468
Silvia 165 12.76596
Giulia 152 10
Robert 188 17.65957
Steven 177 15.31915
Sophia 165 12.76596

Z-score normalization

Using Z-score normalization, the data will be scaled so that it has the properties of a standard normal distribution where the mean (μ) is equal to 0 and the standard deviation (σ) to 1.

Z-score is calculated using the formula:

SQL-query with a subquery:

SELECT height, 
       (height - t1.mean) * 1.0 / t1.sigma AS zscore
  FROM students,
      (SELECT AVG(height) AS mean, 
              STDDEV(height) AS sigma
         FROM students
        ) t1;

SQL-query with a window function:

SELECT height, 
       (height - AVG(height) OVER()) * 1.0 / STDDEV(height) OVER() AS z-score
  FROM students;

As a result, we can easily notice the outliers that exceed the standard deviation.

name height zscore
Ivan 174 0.01488
Peter 181 0.53582
Dan 199 1.87538
Kate 158 -1.17583
Mike 179 0.38698
Silvia 165 -0.65489
Giulia 152 -1.62235
Robert 188 1.05676
Steven 177 0.23814
Sophia 165 -0.65489
 No comments    103   2 mon   Analytics engineering   sql

SAP Analytics Cloud Dashboard Overview

Estimated read time – 8 min

Our last guide on BI tools was dedicated to QlikSense and today we will have a look at SAP Analytics Cloud. The dashboard has been prepared for us by the head BI consultant at SAPRUN Alexey Salynin.

In this overview, we will touch the following topics: creating a new data source in SAP, adding filters, making a Tree map,
setting up a preview on mobile devices, working with the Smart Insight assistant, multi language feature.

This dashboard has the same structure as our previous dashboards. At the top, we can see the KPI cards and a date filter. Below there is a heatmap that shows top provinces and a chart that shows profit dynamics on the right. In the second part, we can see the products and customers analysis.

Creating a new data source in SAP

To add a data source in SAP, we can either work with models or datasets. Models allow us to work with several data sources at the same time and carry out planning in order to solve complex problems. In our case, it’s a simple Excel file, so we used a normal dataset. As we have chosen only the orders sheet. Here is how the dataset looks:

Some of the columns were created, such as products hierarchy, geo hierarchy and some additional calculations.

Adding a filter

There are a number of ways to add a filter starting from default filtration methods to input fields to a timeline. If you click the edit button, you can see the filter format.

A tree map shows the top provinces by sales. Each province contains information on variation similarly to the KPI cards. The map allows filtering by provinces.

On the right of the tree map there is a profit dynamics chart. The chart has a dynamic axis which allows choosing the time period in time series charts. Time series graphs in SAP can show a forecast. The forecast is done automatically using built-in algorithms. In our case, triple exponential smoothing was chosen. When hovering over the forecasted point, we can see upper and lower confidence bounds.

In the next part of the dashboard we can see the products and customers analysis section.

Here you can see that products are displayed in a list without any hierarchy. To activate the hierarchy we need to change the dimension properties and set the hierarchy.

By default, the first level is displayed. We can change the level and see the products by category. Moreover, the bars are filters for the table on the right.

Smart Insight

One of the interesting features in SAP is smart insight. SAP automatically analyses the value and builds charts to provide more insights. For example, if I want to see what contributed to high level of furniture sales, I can click on smart insights and see the following results:

The feature is very handy as the insights are not simple graphs, but they are connected to each other and show a story. The insights allow us to interact with the data and explore it deeper. We can later add these insights into our visualization. This feature can also be used on a new dataset and empty dashboard to explore the data without building anything.

The whole dashboard was built in a responsive view mode so it will be automatically adapted to different screens. Some visualizations like tables are not adapted for the mobile screen, so we can hide them for mobile screens.

We can save the dashboard in the catalog and add it to favourites. Unfortunately, the functionality of the system doesn’t allow posting dashboards on public sources. But we hope that with new releases the feature will appear.

Last but not least, the dashboard has been created in Russian and when we needed to translate it into English SAP multi language feature came very handy. We can choose the target language and translate all the text. Then the dashboard will be displayed in a language based on the user’s location.

We have evaluated the dashboard together with Alexey (his scores in brackets) and received the following results:
Meets the tasks – 9.8 (10)
Learning curve  – 6.5 (7)
Tool functionality – 9.8 (9)
Ease of use – 8.2 (8)
Compliance with the layout – 9.3 (10)
Visual evaluation – 8.6 (9)

Overall: 8.71 out of 10. Have a look at the final results:

 No comments    71   2 mon   BI-tools   SAP

QlikSense Dashboard Overview

Estimated read time – 4 min

Our last overview on BI instruments was dedicated to Redash, but today we will deal with the dashboard in QlikSense using the SuperStore Sales dataset. The dashboard was built for us by Alexey Grinenko, a head of BI section, Business Solutions Development Department at Softline.

Data Sources

To load data in QlikSense we need to use load scripts. This is how the script looks:

Basically, the script lists the columns that we need to import and the link to a file. [Calendar_Order] script makes a custom calendar as all our data is before 2012 and the default calendar creates measures such as *current month* or *previous month* which we don’t need.

Dashboard Structure

As you can see, the dashboard has the same structure as in Tableau with some peculiarities of QlikSense.

At the top, there is a line with KPI cards. The cards compare KPI values in the chosen year with the values in the previous one and show the change in percentage. If the change is positive, the percentage is displayed in green, if the change is negative, values are red.

Then you can see a treemap which shows provinces by profit. The color shows profit distribution in the provinces. As all charts in QlikSense, the treemap is clickable and we can choose several provinces and see only the data related to them. On the right of the treemap, you can see area graphs that show profits by month in 4 different years.

At the top right corner of the dashboard, we have a drop down menu where we can choose a month and a year. As QlikSense doesn’t have this feature as default, we had to customly create such a calendar in a load script. First, we loaded the Month and the Year columns and created a third column where we concatenated Month and Year with a dash. This measure will later help us in switching the periods.

In the next part of the dashboard, we can see the analysis of products and customers. On the right, there is a bar chart that shows profit and sales by categories. On the left, the bar chart displays the Top 15 products by profit. In the settings, we can change this number and display fewer products if we wish.

Last but not least, we have a bubble chart which shows the distribution of customers by profit and sales. Each bubble represents a customer and the size of the bubble shows the number of orders. The color indicates the size of the discount.

Conclusions

Together with Alexey (his scores are in brackets) we have evaluated the dashboard on 10 points scale (10 being the highest score) and received the following results:

  1. Meets the tasks – 10
  2. Learning curve  – 6.9 (8)
  3. Tool functionality – 9.0 (7)
  4. Ease of use – 7.3 (8)
  5. Compliance with the layout – 9.8 (9)
  6. Visual evaluation – 7.5 (10)

Overall: 8.4 out of 10.

 No comments    64   2 mon   BI-tools   dashboard   QlikSense

Redash Dashboard Overview

Estimated read time – 7 min

Redash is an open-source tool that is available in two versions: self-hosted and cloud. If you use the first one, Redash is free of charge, but for a cloud option you’ll have to pay. In both versions you can connect to a numerous amount of databases (including Clickhouse) or other sources like Google Sheets by API.

Redash is a SQL query editor that allows building visualizations. To make a dashboard, we first need to run a SQL query and then build a visualization. After that, we can add the queries with their visualizations to the dashboard. The process makes data investigations easy and simple.

Data preparation

To start working with data we need to click settings and create a new data source.

When working with Superstore dataset we could not directly connect a .xlsx file as redash works only with databases. Thus, we uploaded our .xlsx into a MySQL database.

Building reports and visualizations

In the beginning of the dashboard, you can see the filters. This is the only way to apply filters to a dashboard in Redash and to choose parameters like province interactively.

First of all, we added KPI cards to the dashboard. The functionality of Redash is extremely limited in terms of visualizations, so the only way to build a KPI card was by using so-called counters. A counter is a type of visualization that allows displaying a current and a target value, however, in our case we used the previous year value instead of the target.

For the KPI cards, we used the query below.

It’s a simple query that returns the sum of the profit in current month and the sum of the profit in the previous one. In each query we need to define Year and Month and Province in the WHERE statement, so that all our visualizations are filtered based on the chosen year, month and province. The results are two numbers, curr and prev. Then we click on New Visualization, choose counter as a type and assign curr to the counter value and prev to the target value. We can also change format by adding a dollar sign prefix.

In our previous reviews of BI-tools, we displayed all KPI cards in a single line. In Redash, however, the numbers were too small when displayed in a single line and unnecessary information like the query name and the last update time were cluttering the view. This is another disadvantage of Redash, so we had to make bigger cards and display them in two lines.

To display top performing provinces we used word clouds. The query returned the sum of sales by provinces. Then the sum of sales was used as a frequency column to define the size of the province names.

For Profit Dynamics visualization we used a simple line graph. The query below returned a table with total profits for each month as well as two additional columns that display profit in the current month and the previous one.

select date_format(orders.OrderDate, '%Y-%m-01') as month, sum(orders.Profit) as profit, curr.curr_profit, prev.prev_profit
from orders
left join (
    select date_format(OrderDate, '%Y-%m-01') as month, sum(Profit) as curr_profit
    from orders
    where MONTH(OrderDate)=MONTH('{{Year and Month}}') and YEAR(OrderDate)=YEAR('{{Year and Month}}')
        and ('{{Province}}'='0. All' or Province = '{{Province}}')
    group by 1
) curr on curr.month=date_format(orders.OrderDate, '%Y-%m-01')
left join (
    select date_format(OrderDate, '%Y-%m-01') as month, sum(Profit) as prev_profit
    from orders
    where MONTH(OrderDate)=MONTH('{{Year and Month}}') and YEAR(OrderDate)=YEAR('{{Year and Month}}')-1
        and ('{{Province}}'='0. All' or Province = '{{Province}}')
    group by 1
) prev on prev.month=date_format(orders.OrderDate, '%Y-%m-01')
where ('{{Province}}'='0. All' or Province = '{{Province}}')
group by 1,3,4
order by 1

We then used the a line graph to display the profit column and a scatter plot to display curr_profit and prev_profit columns as both of them had only one observation.

Profit and Sales by Category visualization shows a SQL query table that returns profits and sales by category and sub-category of products.

Last but not least, we have pivot tables for top products and top customers by profit. Pivot tables in Redash allow grouping elements by using aggregate functions. In our case we grouped products by profit. I do not recommend using this feature for a large amount of data as if you change the aggregations on the fly in the browser, the browser might slow down and even crash.

Conclusions

You can find the final dashboard here.
Our team has evaluated the dashboard and the following scores on 1-10 scale 10 being the highest were given:

  1. Meets the tasks – 7.3
  2. Learning curve  – 7.5
  3. Tool functionality – 5.5
  4. Ease of use – 7.5
  5. Compliance with the layout – 6.0
  6. Visual evaluation – 5.2

Overall: 6.5 out of 10.

 No comments    72   2 mon   BI-tools   redash

PowerBI Dashboard Overview

Estimated read time – 5 min

We continue the series of materials on BI-systems and today we will have a look at the dashboard prepared in PowerBI using the SuperStore Sales dataset. We will cover how to connect the data to the system, set custom colors for visualizations and create new measures, implement switching between charts using bookmarks and we will discuss the challenges that we faced when building the dashboard.

This is the how the final dashboard looks like:

The most notable feature of the dashboard is data cards that show the company’s KPI. The cards compare the parameters to the same period in the previous year and show the previous year’s dynamics in the background.

Below we can see the chart that shows top-performing provinces. The bluer the rectangle the more profitable the province, the more orange the rectangle the more losses the province sustains. The size of the rectangle corresponds to the quantity of sales. We can click on rectangles to see more detailed information about profits and sales dynamics in the region on the graph on the left and their KPI at the top. On the graph, there are green and blue points that indicate the month of the current year and the previous year respectively. Hovering over these points, you can see a trend line.

The next part of the dashboard shows product and customer analysis. This part allows us to answer questions such as “which products were the most profitable or unprofitable” or “which customers contributed to most of the profits or most of the losses”.

Data collection

To connect the data we used an Excel file. PowerBI offers a number of sources to connect your data from such as Excel, csv, json files and various databases.

Configuring reports and visualizations

When building a dashboard in PowerBI we wanted to copy the color themes from Tableau. To do this, we have created a JSON file with the list of colors that we want to use. You can see the content of our file below. Then in the views tab, we clicked on the “browse for themes” button and uploaded our colors.

{
	"name":"Orange-Blue Diverging",
	"dataColors": [
		
		"#1c5998",
		"#1c73b1",
		"#3a87b7",
		"#67add4",
		"#7bc8e2",
		"#cacaca",
		"#fdab67",
		"#fd8938",
		"#f06511",
		"#d74401",
		"#a33202",
		"#7b3014",
		"#F07C28",
		"#2B5C8A",
		"#94C6E1",
		"#87d180",
	]
}

Then we have created a separate table called Calendar and populated it with all order dates. After that, we created a column with just a month and a year to create a filter based on it.

Creating necessary measures

When creating a dashboard with PowerBI we often need to create new measures. For the data cards, we created such measures as Total Profit, Total Sales, Total Orders, Total Clients and so on. The arrows that you can see in the data cards are also customized and a measure was created for each of them. To apply the color to arrows we formatted the color by rules and indicated red if the value is less than 0, green if the color is more than 0.

Adding bookmarks to switch between charts

To switch between charts, we added bookmarks for sales and profits. For the sales chart, the profits bookmark is hidden and vice versa. The button was downloaded from the internet and added to the respective bookmarks.

Interesting features and challenges we faced when building the dashboard

We have created custom data cards for KPI which are different from the default ones offered by PowerBI. The original features of cards include the background trend, the name and value while the arrows and changes are a custom feature. Another interesting feature that we used is cross filtration which allowed us to apply the filter to both the profits/sales chart and KPI cards.

One of the challenges that we have faced was the inability to build a bar chart with 2 categories. This feature was not implemented in PowerBI at the moment of writing this overview (maybe it is implemented now), so we had to create a table and add bar charts into it. Similarly, we inserted bar charts into the Top Customers table.

Conclusion

Our team has evaluated the dashboard and has given the following scores from 1-10 scale (10 being the highest) to this dashboard:

  1. Meets the tasks – 9.8
  2. Learning curve  – 3.0
  3. Tool functionality – 9.5
  4. Ease of use – 7.5
  5. Compliance with the layout – 9.5
  6. Visual evaluation – 8.8

Overall: 8.0 out of 10. Have a look at the final dashboard here.

 No comments    43   2 mon   analysis   BI   BI-tools   powerbi
Earlier Ctrl + ↓