LEFT JOIN: blog on analytics, visualisation & data science

QlikSense Dashboard Overview

Estimated read time – 4 min

Our last overview on BI instruments was dedicated to Redash, but today we will deal with the dashboard in QlikSense using the SuperStore Sales dataset. The dashboard was built for us by Alexey Grinenko, a head of BI section, Business Solutions Development Department at Softline.

Data Sources

To load data in QlikSense we need to use load scripts. This is how the script looks:

Basically, the script lists the columns that we need to import and the link to a file. [Calendar_Order] script makes a custom calendar as all our data is before 2012 and the default calendar creates measures such as *current month* or *previous month* which we don’t need.

Dashboard Structure

As you can see, the dashboard has the same structure as in Tableau with some peculiarities of QlikSense.

At the top, there is a line with KPI cards. The cards compare KPI values in the chosen year with the values in the previous one and show the change in percentage. If the change is positive, the percentage is displayed in green, if the change is negative, values are red.

Then you can see a treemap which shows provinces by profit. The color shows profit distribution in the provinces. As all charts in QlikSense, the treemap is clickable and we can choose several provinces and see only the data related to them. On the right of the treemap, you can see area graphs that show profits by month in 4 different years.

At the top right corner of the dashboard, we have a drop down menu where we can choose a month and a year. As QlikSense doesn’t have this feature as default, we had to customly create such a calendar in a load script. First, we loaded the Month and the Year columns and created a third column where we concatenated Month and Year with a dash. This measure will later help us in switching the periods.

In the next part of the dashboard, we can see the analysis of products and customers. On the right, there is a bar chart that shows profit and sales by categories. On the left, the bar chart displays the Top 15 products by profit. In the settings, we can change this number and display fewer products if we wish.

Last but not least, we have a bubble chart which shows the distribution of customers by profit and sales. Each bubble represents a customer and the size of the bubble shows the number of orders. The color indicates the size of the discount.

Conclusions

Together with Alexey (his scores are in brackets) we have evaluated the dashboard on 10 points scale (10 being the highest score) and received the following results:

  1. Meets the tasks – 10
  2. Learning curve  – 6.9 (8)
  3. Tool functionality – 9.0 (7)
  4. Ease of use – 7.3 (8)
  5. Compliance with the layout – 9.8 (9)
  6. Visual evaluation – 7.5 (10)

Overall: 8.4 out of 10.

 No comments    7   2 d   BI-tools   dashboard   QlikSense

Redash Dashboard Overview

Estimated read time – 7 min

Redash is an open-source tool that is available in two versions: self-hosted and cloud. If you use the first one, Redash is free of charge, but for a cloud option you’ll have to pay. In both versions you can connect to a numerous amount of databases (including Clickhouse) or other sources like Google Sheets by API.

Redash is a SQL query editor that allows building visualizations. To make a dashboard, we first need to run a SQL query and then build a visualization. After that, we can add the queries with their visualizations to the dashboard. The process makes data investigations easy and simple.

Data preparation

To start working with data we need to click settings and create a new data source.

When working with Superstore dataset we could not directly connect a .xlsx file as redash works only with databases. Thus, we uploaded our .xlsx into a MySQL database.

Building reports and visualizations

In the beginning of the dashboard, you can see the filters. This is the only way to apply filters to a dashboard in Redash and to choose parameters like province interactively.

First of all, we added KPI cards to the dashboard. The functionality of Redash is extremely limited in terms of visualizations, so the only way to build a KPI card was by using so-called counters. A counter is a type of visualization that allows displaying a current and a target value, however, in our case we used the previous year value instead of the target.

For the KPI cards, we used the query below.

It’s a simple query that returns the sum of the profit in current month and the sum of the profit in the previous one. In each query we need to define Year and Month and Province in the WHERE statement, so that all our visualizations are filtered based on the chosen year, month and province. The results are two numbers, curr and prev. Then we click on New Visualization, choose counter as a type and assign curr to the counter value and prev to the target value. We can also change format by adding a dollar sign prefix.

In our previous reviews of BI-tools, we displayed all KPI cards in a single line. In Redash, however, the numbers were too small when displayed in a single line and unnecessary information like the query name and the last update time were cluttering the view. This is another disadvantage of Redash, so we had to make bigger cards and display them in two lines.

To display top performing provinces we used word clouds. The query returned the sum of sales by provinces. Then the sum of sales was used as a frequency column to define the size of the province names.

For Profit Dynamics visualization we used a simple line graph. The query below returned a table with total profits for each month as well as two additional columns that display profit in the current month and the previous one.

select date_format(orders.OrderDate, '%Y-%m-01') as month, sum(orders.Profit) as profit, curr.curr_profit, prev.prev_profit
from orders
left join (
    select date_format(OrderDate, '%Y-%m-01') as month, sum(Profit) as curr_profit
    from orders
    where MONTH(OrderDate)=MONTH('{{Year and Month}}') and YEAR(OrderDate)=YEAR('{{Year and Month}}')
        and ('{{Province}}'='0. All' or Province = '{{Province}}')
    group by 1
) curr on curr.month=date_format(orders.OrderDate, '%Y-%m-01')
left join (
    select date_format(OrderDate, '%Y-%m-01') as month, sum(Profit) as prev_profit
    from orders
    where MONTH(OrderDate)=MONTH('{{Year and Month}}') and YEAR(OrderDate)=YEAR('{{Year and Month}}')-1
        and ('{{Province}}'='0. All' or Province = '{{Province}}')
    group by 1
) prev on prev.month=date_format(orders.OrderDate, '%Y-%m-01')
where ('{{Province}}'='0. All' or Province = '{{Province}}')
group by 1,3,4
order by 1

We then used the a line graph to display the profit column and a scatter plot to display curr_profit and prev_profit columns as both of them had only one observation.

Profit and Sales by Category visualization shows a SQL query table that returns profits and sales by category and sub-category of products.

Last but not least, we have pivot tables for top products and top customers by profit. Pivot tables in Redash allow grouping elements by using aggregate functions. In our case we grouped products by profit. I do not recommend using this feature for a large amount of data as if you change the aggregations on the fly in the browser, the browser might slow down and even crash.

Conclusions

You can find the final dashboard here.
Our team has evaluated the dashboard and the following scores on 1-10 scale 10 being the highest were given:

  1. Meets the tasks – 7.3
  2. Learning curve  – 7.5
  3. Tool functionality – 5.5
  4. Ease of use – 7.5
  5. Compliance with the layout – 6.0
  6. Visual evaluation – 5.2

Overall: 6.5 out of 10.

 No comments    9   6 d   BI-tools   redash

PowerBI Dashboard Overview

Estimated read time – 5 min

We continue the series of materials on BI-systems and today we will have a look at the dashboard prepared in PowerBI using the SuperStore Sales dataset. We will cover how to connect the data to the system, set custom colors for visualizations and create new measures, implement switching between charts using bookmarks and we will discuss the challenges that we faced when building the dashboard.

This is the how the final dashboard looks like:

The most notable feature of the dashboard is data cards that show the company’s KPI. The cards compare the parameters to the same period in the previous year and show the previous year’s dynamics in the background.

Below we can see the chart that shows top-performing provinces. The bluer the rectangle the more profitable the province, the more orange the rectangle the more losses the province sustains. The size of the rectangle corresponds to the quantity of sales. We can click on rectangles to see more detailed information about profits and sales dynamics in the region on the graph on the left and their KPI at the top. On the graph, there are green and blue points that indicate the month of the current year and the previous year respectively. Hovering over these points, you can see a trend line.

The next part of the dashboard shows product and customer analysis. This part allows us to answer questions such as “which products were the most profitable or unprofitable” or “which customers contributed to most of the profits or most of the losses”.

Data collection

To connect the data we used an Excel file. PowerBI offers a number of sources to connect your data from such as Excel, csv, json files and various databases.

Configuring reports and visualizations

When building a dashboard in PowerBI we wanted to copy the color themes from Tableau. To do this, we have created a JSON file with the list of colors that we want to use. You can see the content of our file below. Then in the views tab, we clicked on the “browse for themes” button and uploaded our colors.

{
	"name":"Orange-Blue Diverging",
	"dataColors": [
		
		"#1c5998",
		"#1c73b1",
		"#3a87b7",
		"#67add4",
		"#7bc8e2",
		"#cacaca",
		"#fdab67",
		"#fd8938",
		"#f06511",
		"#d74401",
		"#a33202",
		"#7b3014",
		"#F07C28",
		"#2B5C8A",
		"#94C6E1",
		"#87d180",
	]
}

Then we have created a separate table called Calendar and populated it with all order dates. After that, we created a column with just a month and a year to create a filter based on it.

Creating necessary measures

When creating a dashboard with PowerBI we often need to create new measures. For the data cards, we created such measures as Total Profit, Total Sales, Total Orders, Total Clients and so on. The arrows that you can see in the data cards are also customized and a measure was created for each of them. To apply the color to arrows we formatted the color by rules and indicated red if the value is less than 0, green if the color is more than 0.

Adding bookmarks to switch between charts

To switch between charts, we added bookmarks for sales and profits. For the sales chart, the profits bookmark is hidden and vice versa. The button was downloaded from the internet and added to the respective bookmarks.

Interesting features and challenges we faced when building the dashboard

We have created custom data cards for KPI which are different from the default ones offered by PowerBI. The original features of cards include the background trend, the name and value while the arrows and changes are a custom feature. Another interesting feature that we used is cross filtration which allowed us to apply the filter to both the profits/sales chart and KPI cards.

One of the challenges that we have faced was the inability to build a bar chart with 2 categories. This feature was not implemented in PowerBI at the moment of writing this overview (maybe it is implemented now), so we had to create a table and add bar charts into it. Similarly, we inserted bar charts into the Top Customers table.

Conclusion

Our team has evaluated the dashboard and has given the following scores from 1-10 scale (10 being the highest) to this dashboard:

  1. Meets the tasks – 9.8
  2. Learning curve  – 3.0
  3. Tool functionality – 9.5
  4. Ease of use – 7.5
  5. Compliance with the layout – 9.5
  6. Visual evaluation – 8.8

Overall: 8.0 out of 10. Have a look at the final dashboard here.

 No comments    8   8 d   analysis   BI   BI-tools   powerbi

Free math courses for analysts and data engineers

Estimated read time – 4 min

Nowadays, the Internet offers a huge number of paid courses that promise to make you a data analyst. Some of them are really great and you get a valuable set of skills upon completion. However, most of them don’t focus on fundamental math and programming skills that are crucial to make it in the field.

Some people believe that an analyst doesn’t need SQL or Python. Others argue that an analyst can solve problems without deep knowledge in math using only hard skills. In my opinion, that’s a big delusion. Apart from hard skills, a good data analyst should have a strong background in math and computer science. If you think that’s a big deal, read on as I have a solution for you.

In my opinion, it’s difficult to reflect on the probability of the outflow without understanding the probability theory. It’s difficult to discuss the mean and the normal distribution without understanding statistics. It’s impossible to grasp SVD without knowledge in linear algebra or to find a gradient without understanding calculus. Some people may argue that an analyst doesn’t need this. Tools like Python / R / Matlab allow building models without taking care of the math. In the beginning, this might even work. You can use a ready-made algorithm, add a couple of commands, and voila, you have built a regression model. But what do you do next? How do you change specific parameters of the model without understanding the math behind it?

Nowadays, the Internet provides us with an incredible opportunity to get an ivy league level education free of charge. A beginner data analyst should benefit from this opportunity before buying online data analysis courses. Just recently, I have completed university-level math and programming courses and I want to share them with you. Although I took advanced math 15 years ago at the university, it was still worth revising (we tend to forget a great deal in 15 years). An additional benefit of such courses is the development of the highly desired analytical thinking skills.

Here is the list of the free online courses from the eminent US universities that I want to share with you. These courses will definitely help you to start your learning journey in data analytics.

Calculus (M.I.T.)

This is an amazing set of courses both by content and interpretation offered by MIT in three parts:

  1. Differentiation
  2. Integration
  3. Coordinate systems and infinite series

Linear Algebra (Georgia Tech)

A course in four parts from one of the leading world universities in Computer Science: Georgia Tech.

  1. Linear equations
  2. Matrix algebra
  3. Determinants and eigenvalues
  4. Orthogonality, symmetric matrices and SVD

Probability theory and mathematical statistics (Georgia Tech)

A course in four parts from one of the leading world universities in Computer Science: Georgia Tech. (в русскоязычной версии тоже эта строчка полностью совпадает с курсом выше)

  1. A gentle introduction to probability
  2. Random variables
  3. A gentle introduction to statistics
  4. Confidence intervals and hypothesis tests

Calculations in Python (Harvard)

A course in 7 parts from a Harvard professor

  1. R basics
  2. Visualization
  3. Probability theory
  4. Inference and modeling
  5. Productivity tools
  6. Wrangling
  7. Linear regression
  8. Machine learning
  9. Capstone

Building frequency counts and bigrams using the posts of traders

Estimated read time – 9 min

Stocktwits is the largest social network for investors and traders of all levels which allows us to see what is happening in the financial markets. Today we will build a frequency dictionary and bigrams of the users’ posts and divide them by the number of followers. This will allow us to see the difference between the posts of different types of traders.

This is how the feed on the CCIV security looks at Stocktwits:

Some users have the status of officials:

Scraping the posts

Stocktwits has an API that allows getting 30 posts at a time. The API request returns a JSON file, so we will write a get_30_messages function that reads the JSON file and writes all the entries into the list called rows. The information about posts already contains the information about users, so we will not create separate tables and will save everything in one DataFrame. For this purpose, we will create a list with the names of columns and initiate an empty list called rows where we will append all the scraped posts.

Some posts don’t have a “likes” key in the JSON file which results in KeyError. To avoid the error, we will assign 0 to the “likes” in such posts.

cols = ['post_id', 'text', 'created_at', 'user_id', 'likes', 'sentiment', 'identity','followers', 'following', 'ideas', 'watchlist_stocks_count', 'like_count', 'plus_tier']
rows = []
 
def get_30_messages(data):
    for p in data['messages']:
        try:
            likes = p['likes']['total']
        except KeyError:
            likes = 0
        rows.append({'id': p['id'], 
                    'text': p['body'], 
                    'created_at': p['created_at'], 
                    'user_id': p['user']['id'], 
                    'likes': likes,
                    'sentiment': p['entities']['sentiment'], 
                    'symbol': symbol,
                    'identity': p['user']['identity'],
                    'followers': p['user']['followers'], 
                    'following': p['user']['following'], 
                    'ideas': p['user']['ideas'], 
                    'watchlist_stocks_count': p['user']['watchlist_stocks_count'], 
                    'like_count': p['user']['like_count'], 
                    'plus_tier': p['user']['like_count']
                    })

We will scrap the posts from the pages of 16 most trending securities.

symbols = ['DIA', 'SPY', 'QQQ', 'INO', 'OCGN', 'BTC.X', 'SNAP', 'INTC', 'VXX', 'ASTS', 'SKLZ', 'RIOT', 'DJIA', 'GOLD', 'GGII', 'COIN']

As the API request returns only 30 most recent posts, to get older posts, we need to save the id of the last post into a dictionary and insert it as the max parameter during the next request. Unfortunately, the API allows us to make only 200 requests per hour, so in order to stay within the limits, we will run the for loop for each security only 11 times.

last_id_values = dict()
        
for symbol in symbols:
    file = requests.get(f"https://api.stocktwits.com/api/2/streams/symbol/{symbol}.json")
    data = json.loads(file.content)
    
    for i in range(10):
        get_30_messages(data)
            
        last_id = data['cursor']['max']
        last_id_values[symbol] = last_id
        
        file = requests.get(f"https://api.stocktwits.com/api/2/streams/symbol/{symbol}.json?max={last_id}")
        data = json.loads(file.content)
    
    get_30_messages(data)

Thus, we have collected only about 6000 posts, which is not enough for the analysis. That’s why, we will create a timer to run the same code after 1 hour and 5 minutes for 11 cycles.

def get_older_posts():
    for symbol in symbols:
        for i in range(12):
            file = requests.get(f"https://api.stocktwits.com/api/2/streams/symbol/{symbol}.json?max={last_id_values[symbol]}")
            data = json.loads(file.content)        
            get_30_messages(data)
 
            last_id = data['cursor']['max']
            last_id_values[symbol] = last_id
 
for i in range(11):
    time.sleep(3900)
    get_older_posts()

After all the data is collected, let’s create a DataFrame.

df = pd.DataFrame(rows, columns = cols)

The resulting table will look like this:

It is important to check that the post_id doesn’t have duplicate values. By looking at the number of unique values and the number of total values in posts_id we can notice that we have about 10000 duplicate values.

df.posts_id.nunique(), len(df.posts_id)

This happened because some posts get posted on multiple pages. So the last step will be dropping the duplicate values.

df.drop_duplicates(subset="posts_id", inplace=True)

Frequency counts and bigrams

First of all, let’s create a frequency count for posts without dividing them into groups.

df.text.str.split(expand=True).stack().value_counts()

We can see that articles, conjunctions, and prepositions prevail over the other words:

Thus, we need to remove them from the dataset. However, even if the dataset is cleaned, the results will look like this. Apart from the fact that 39 is the most frequent word, the data is not very informative and it’s difficult to make any conclusions based on it.

In this case, we will need to build bigrams. One bigram is a sequence of two elements, that is two words standing next to each other. There are many algorithms for building n-grams with different optimization levels. We will use a built-in function in nltk to create a bigram for one group. First, let’s import the additional libraries, download stop words for the English language, and clean the data. Then we will add more stop words including the names of the stock tickers that are used in every post.

import nltk
from nltk.corpus import stopwords
from string import punctuation
import unicodedata
import collections
import nltk
from nltk.stem import WordNetLemmatizer
 
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

english_stopwords = stopwords.words("english")
symbols = ['DIA', 'SPY', 'QQQ', 'INO', 'OCGN', 'BTC.X', 'SNAP', 'INTC', 'VXX', 'ASTS', 'SKLZ', 'RIOT', 'DJIA', 'GOLD', 'GGII', 'COIN']
symbols_lower = [sym.lower() for sym in symbols]
append_stopword = ['https', 'www', 'btc', 'x', 's', 't', 'p', 'amp', 'utm', 'm', 'gon', 'na', '’', '2021', '04', 'stocktwits', 'com', 'spx', 'ndx', 'gld', 'slv', 'es', 'f', '...', '--', 'cqginc', 'cqgthom', 'gt']
english_stopwords.extend(symbols_lower)
english_stopwords.extend(append_stopword)

Let’s define a function to prepare the text that will translate all the words to lowercase, bring them to the base form and remove stop words and punctuation.

wordnet_lemmatizer = WordNetLemmatizer()
 
def preprocess_text(text):
    tokens = nltk.word_tokenize(text.lower())
    tokens = [wordnet_lemmatizer.lemmatize(token) for token in tokens if token not in english_stopwords\
              and token != " " \
              and token.strip() not in punctuation]
    
    text = " ".join(tokens)
    
    return text
    
    df.text = df.text.apply(process_text)

For example, let’s take the group of the least popular users with less than 300 followers, build bigrams and output the most frequent ones.

non_pop_df = df[(df['followers'] < 300)]
 
non_pop_counts = collections.Counter()
for sent in non_pop_df.text:
    words = nltk.word_tokenize(sent)
    non_pop_counts.update(nltk.bigrams(words))
non_pop_counts.most_common()

Results of the bigrams study

Users with less than 300 followers mostly write about their personal plans on making money. This is shown by the collocations like short term, long term, and make money.
Less than 300 followers:
1. look like, 439
2. next week, 422
3. let 39, 364
4. capital gain, 306
5. long term, 274
6. let go, 261
7. stock market, 252
8. buy dip, 252
9. gain tax, 221
10. make money, 203
11. short term, 201
12. buy buy, 192

More popular users with 300 to 3000 followers discuss more abstract issues like sweep premium, stock price and artificial intelligence.
From 300 to 3000 followers:
1. sweep premium, 166
2. price target, 165
3. total day, 140
4. stock market, 139
5. ask premium, 132
6. stock price, 129
7. current stock, 117
8. money trade, 114
9. trade option, 114
10. activity alert, 113
11. trade volume, 113
12. artificial intelligence, 113

Popular users that have below 30000 followers discuss their observations as well as promote their accounts or articles.
From 3000 to 30000 followers:
1. unusual option, 632
2. print size, 613
3. option activity, 563
4. large print, 559
5. activity alerted, 355
6. observed unusual, 347
7. sweepcast observed, 343
8. |🎯 see, 311
9. see profile, 253
10. profile link, 241
11. call expiring, 235
12. new article, 226

Very popular traders with more than 30000 followers mostly act as information sources and post about changes at the stock market. This is indicated by the frequent up and down arrows and collocations like “stock x-day” or “moving average”.
Users with more than 30000 followers:
1. dow stock, 69
2. elliottwave trading, 53
3. ⇩ indexindicators.com, 51
4. ⇧ indexindicators.com, 50
5. u stock, 47
6. stock 5-day, 36
7. moving average, 29
8. stock moving, 28
9. stock x-day, 27
10. ⇧ 10-day, 26
11. stock daily, 25
12. daily rsi, 25

We have also built the bigrams of officials, but the results turned out to be very similar to the most popular users.

 No comments    19   15 d   data analytics   nltk
Earlier Ctrl + ↓