Later Ctrl + ↑

Redash Dashboard Overview

Estimated read time – 7 min

Redash is an open-source tool that is available in two versions: self-hosted and cloud. If you use the first one, Redash is free of charge, but for a cloud option you’ll have to pay. In both versions you can connect to a numerous amount of databases (including Clickhouse) or other sources like Google Sheets by API.

Redash is a SQL query editor that allows building visualizations. To make a dashboard, we first need to run a SQL query and then build a visualization. After that, we can add the queries with their visualizations to the dashboard. The process makes data investigations easy and simple.

Data preparation

To start working with data we need to click settings and create a new data source.

When working with Superstore dataset we could not directly connect a .xlsx file as redash works only with databases. Thus, we uploaded our .xlsx into a MySQL database.

Building reports and visualizations

In the beginning of the dashboard, you can see the filters. This is the only way to apply filters to a dashboard in Redash and to choose parameters like province interactively.

First of all, we added KPI cards to the dashboard. The functionality of Redash is extremely limited in terms of visualizations, so the only way to build a KPI card was by using so-called counters. A counter is a type of visualization that allows displaying a current and a target value, however, in our case we used the previous year value instead of the target.

For the KPI cards, we used the query below.

It’s a simple query that returns the sum of the profit in current month and the sum of the profit in the previous one. In each query we need to define Year and Month and Province in the WHERE statement, so that all our visualizations are filtered based on the chosen year, month and province. The results are two numbers, curr and prev. Then we click on New Visualization, choose counter as a type and assign curr to the counter value and prev to the target value. We can also change format by adding a dollar sign prefix.

In our previous reviews of BI-tools, we displayed all KPI cards in a single line. In Redash, however, the numbers were too small when displayed in a single line and unnecessary information like the query name and the last update time were cluttering the view. This is another disadvantage of Redash, so we had to make bigger cards and display them in two lines.

To display top performing provinces we used word clouds. The query returned the sum of sales by provinces. Then the sum of sales was used as a frequency column to define the size of the province names.

For Profit Dynamics visualization we used a simple line graph. The query below returned a table with total profits for each month as well as two additional columns that display profit in the current month and the previous one.

select date_format(orders.OrderDate, '%Y-%m-01') as month, sum(orders.Profit) as profit, curr.curr_profit, prev.prev_profit
from orders
left join (
    select date_format(OrderDate, '%Y-%m-01') as month, sum(Profit) as curr_profit
    from orders
    where MONTH(OrderDate)=MONTH('{{Year and Month}}') and YEAR(OrderDate)=YEAR('{{Year and Month}}')
        and ('{{Province}}'='0. All' or Province = '{{Province}}')
    group by 1
) curr on curr.month=date_format(orders.OrderDate, '%Y-%m-01')
left join (
    select date_format(OrderDate, '%Y-%m-01') as month, sum(Profit) as prev_profit
    from orders
    where MONTH(OrderDate)=MONTH('{{Year and Month}}') and YEAR(OrderDate)=YEAR('{{Year and Month}}')-1
        and ('{{Province}}'='0. All' or Province = '{{Province}}')
    group by 1
) prev on prev.month=date_format(orders.OrderDate, '%Y-%m-01')
where ('{{Province}}'='0. All' or Province = '{{Province}}')
group by 1,3,4
order by 1

We then used the a line graph to display the profit column and a scatter plot to display curr_profit and prev_profit columns as both of them had only one observation.

Profit and Sales by Category visualization shows a SQL query table that returns profits and sales by category and sub-category of products.

Last but not least, we have pivot tables for top products and top customers by profit. Pivot tables in Redash allow grouping elements by using aggregate functions. In our case we grouped products by profit. I do not recommend using this feature for a large amount of data as if you change the aggregations on the fly in the browser, the browser might slow down and even crash.

Conclusions

You can find the final dashboard here.
Our team has evaluated the dashboard and the following scores on 1-10 scale 10 being the highest were given:

  1. Meets the tasks – 7.3
  2. Learning curve  – 7.5
  3. Tool functionality – 5.5
  4. Ease of use – 7.5
  5. Compliance with the layout – 6.0
  6. Visual evaluation – 5.2

Overall: 6.5 out of 10.

 No comments    113   4 mon   BI-tools   redash

PowerBI Dashboard Overview

Estimated read time – 5 min

We continue the series of materials on BI-systems and today we will have a look at the dashboard prepared in PowerBI using the SuperStore Sales dataset. We will cover how to connect the data to the system, set custom colors for visualizations and create new measures, implement switching between charts using bookmarks and we will discuss the challenges that we faced when building the dashboard.

This is the how the final dashboard looks like:

The most notable feature of the dashboard is data cards that show the company’s KPI. The cards compare the parameters to the same period in the previous year and show the previous year’s dynamics in the background.

Below we can see the chart that shows top-performing provinces. The bluer the rectangle the more profitable the province, the more orange the rectangle the more losses the province sustains. The size of the rectangle corresponds to the quantity of sales. We can click on rectangles to see more detailed information about profits and sales dynamics in the region on the graph on the left and their KPI at the top. On the graph, there are green and blue points that indicate the month of the current year and the previous year respectively. Hovering over these points, you can see a trend line.

The next part of the dashboard shows product and customer analysis. This part allows us to answer questions such as “which products were the most profitable or unprofitable” or “which customers contributed to most of the profits or most of the losses”.

Data collection

To connect the data we used an Excel file. PowerBI offers a number of sources to connect your data from such as Excel, csv, json files and various databases.

Configuring reports and visualizations

When building a dashboard in PowerBI we wanted to copy the color themes from Tableau. To do this, we have created a JSON file with the list of colors that we want to use. You can see the content of our file below. Then in the views tab, we clicked on the “browse for themes” button and uploaded our colors.

{
	"name":"Orange-Blue Diverging",
	"dataColors": [
		
		"#1c5998",
		"#1c73b1",
		"#3a87b7",
		"#67add4",
		"#7bc8e2",
		"#cacaca",
		"#fdab67",
		"#fd8938",
		"#f06511",
		"#d74401",
		"#a33202",
		"#7b3014",
		"#F07C28",
		"#2B5C8A",
		"#94C6E1",
		"#87d180",
	]
}

Then we have created a separate table called Calendar and populated it with all order dates. After that, we created a column with just a month and a year to create a filter based on it.

Creating necessary measures

When creating a dashboard with PowerBI we often need to create new measures. For the data cards, we created such measures as Total Profit, Total Sales, Total Orders, Total Clients and so on. The arrows that you can see in the data cards are also customized and a measure was created for each of them. To apply the color to arrows we formatted the color by rules and indicated red if the value is less than 0, green if the color is more than 0.

Adding bookmarks to switch between charts

To switch between charts, we added bookmarks for sales and profits. For the sales chart, the profits bookmark is hidden and vice versa. The button was downloaded from the internet and added to the respective bookmarks.

Interesting features and challenges we faced when building the dashboard

We have created custom data cards for KPI which are different from the default ones offered by PowerBI. The original features of cards include the background trend, the name and value while the arrows and changes are a custom feature. Another interesting feature that we used is cross filtration which allowed us to apply the filter to both the profits/sales chart and KPI cards.

One of the challenges that we have faced was the inability to build a bar chart with 2 categories. This feature was not implemented in PowerBI at the moment of writing this overview (maybe it is implemented now), so we had to create a table and add bar charts into it. Similarly, we inserted bar charts into the Top Customers table.

Conclusion

Our team has evaluated the dashboard and has given the following scores from 1-10 scale (10 being the highest) to this dashboard:

  1. Meets the tasks – 9.8
  2. Learning curve  – 3.0
  3. Tool functionality – 9.5
  4. Ease of use – 7.5
  5. Compliance with the layout – 9.5
  6. Visual evaluation – 8.8

Overall: 8.0 out of 10. Have a look at the final dashboard here.

 No comments    69   4 mon   analysis   BI   BI-tools   powerbi

Free math courses for analysts and data engineers

Estimated read time – 4 min

Nowadays, the Internet offers a huge number of paid courses that promise to make you a data analyst. Some of them are really great and you get a valuable set of skills upon completion. However, most of them don’t focus on fundamental math and programming skills that are crucial to make it in the field.

Some people believe that an analyst doesn’t need SQL or Python. Others argue that an analyst can solve problems without deep knowledge in math using only hard skills. In my opinion, that’s a big delusion. Apart from hard skills, a good data analyst should have a strong background in math and computer science. If you think that’s a big deal, read on as I have a solution for you.

In my opinion, it’s difficult to reflect on the probability of the outflow without understanding the probability theory. It’s difficult to discuss the mean and the normal distribution without understanding statistics. It’s impossible to grasp SVD without knowledge in linear algebra or to find a gradient without understanding calculus. Some people may argue that an analyst doesn’t need this. Tools like Python / R / Matlab allow building models without taking care of the math. In the beginning, this might even work. You can use a ready-made algorithm, add a couple of commands, and voila, you have built a regression model. But what do you do next? How do you change specific parameters of the model without understanding the math behind it?

Nowadays, the Internet provides us with an incredible opportunity to get an ivy league level education free of charge. A beginner data analyst should benefit from this opportunity before buying online data analysis courses. Just recently, I have completed university-level math and programming courses and I want to share them with you. Although I took advanced math 15 years ago at the university, it was still worth revising (we tend to forget a great deal in 15 years). An additional benefit of such courses is the development of the highly desired analytical thinking skills.

Here is the list of the free online courses from the eminent US universities that I want to share with you. These courses will definitely help you to start your learning journey in data analytics.

Calculus (M.I.T.)

This is an amazing set of courses both by content and interpretation offered by MIT in three parts:

  1. Differentiation
  2. Integration
  3. Coordinate systems and infinite series

Linear Algebra (Georgia Tech)

A course in four parts from one of the leading world universities in Computer Science: Georgia Tech.

  1. Linear equations
  2. Matrix algebra
  3. Determinants and eigenvalues
  4. Orthogonality, symmetric matrices and SVD

Probability theory and mathematical statistics (Georgia Tech)

A course in four parts from one of the leading world universities in Computer Science: Georgia Tech. (в русскоязычной версии тоже эта строчка полностью совпадает с курсом выше)

  1. A gentle introduction to probability
  2. Random variables
  3. A gentle introduction to statistics
  4. Confidence intervals and hypothesis tests

Calculations in Python (Harvard)

A course in 7 parts from a Harvard professor

  1. R basics
  2. Visualization
  3. Probability theory
  4. Inference and modeling
  5. Productivity tools
  6. Wrangling
  7. Linear regression
  8. Machine learning
  9. Capstone

Building frequency counts and bigrams using the posts of traders

Estimated read time – 9 min

Stocktwits is the largest social network for investors and traders of all levels which allows us to see what is happening in the financial markets. Today we will build a frequency dictionary and bigrams of the users’ posts and divide them by the number of followers. This will allow us to see the difference between the posts of different types of traders.

This is how the feed on the CCIV security looks at Stocktwits:

Some users have the status of officials:

Scraping the posts

Stocktwits has an API that allows getting 30 posts at a time. The API request returns a JSON file, so we will write a get_30_messages function that reads the JSON file and writes all the entries into the list called rows. The information about posts already contains the information about users, so we will not create separate tables and will save everything in one DataFrame. For this purpose, we will create a list with the names of columns and initiate an empty list called rows where we will append all the scraped posts.

Some posts don’t have a “likes” key in the JSON file which results in KeyError. To avoid the error, we will assign 0 to the “likes” in such posts.

cols = ['post_id', 'text', 'created_at', 'user_id', 'likes', 'sentiment', 'identity','followers', 'following', 'ideas', 'watchlist_stocks_count', 'like_count', 'plus_tier']
rows = []
 
def get_30_messages(data):
    for p in data['messages']:
        try:
            likes = p['likes']['total']
        except KeyError:
            likes = 0
        rows.append({'id': p['id'], 
                    'text': p['body'], 
                    'created_at': p['created_at'], 
                    'user_id': p['user']['id'], 
                    'likes': likes,
                    'sentiment': p['entities']['sentiment'], 
                    'symbol': symbol,
                    'identity': p['user']['identity'],
                    'followers': p['user']['followers'], 
                    'following': p['user']['following'], 
                    'ideas': p['user']['ideas'], 
                    'watchlist_stocks_count': p['user']['watchlist_stocks_count'], 
                    'like_count': p['user']['like_count'], 
                    'plus_tier': p['user']['like_count']
                    })

We will scrap the posts from the pages of 16 most trending securities.

symbols = ['DIA', 'SPY', 'QQQ', 'INO', 'OCGN', 'BTC.X', 'SNAP', 'INTC', 'VXX', 'ASTS', 'SKLZ', 'RIOT', 'DJIA', 'GOLD', 'GGII', 'COIN']

As the API request returns only 30 most recent posts, to get older posts, we need to save the id of the last post into a dictionary and insert it as the max parameter during the next request. Unfortunately, the API allows us to make only 200 requests per hour, so in order to stay within the limits, we will run the for loop for each security only 11 times.

last_id_values = dict()
        
for symbol in symbols:
    file = requests.get(f"https://api.stocktwits.com/api/2/streams/symbol/{symbol}.json")
    data = json.loads(file.content)
    
    for i in range(10):
        get_30_messages(data)
            
        last_id = data['cursor']['max']
        last_id_values[symbol] = last_id
        
        file = requests.get(f"https://api.stocktwits.com/api/2/streams/symbol/{symbol}.json?max={last_id}")
        data = json.loads(file.content)
    
    get_30_messages(data)

Thus, we have collected only about 6000 posts, which is not enough for the analysis. That’s why, we will create a timer to run the same code after 1 hour and 5 minutes for 11 cycles.

def get_older_posts():
    for symbol in symbols:
        for i in range(12):
            file = requests.get(f"https://api.stocktwits.com/api/2/streams/symbol/{symbol}.json?max={last_id_values[symbol]}")
            data = json.loads(file.content)        
            get_30_messages(data)
 
            last_id = data['cursor']['max']
            last_id_values[symbol] = last_id
 
for i in range(11):
    time.sleep(3900)
    get_older_posts()

After all the data is collected, let’s create a DataFrame.

df = pd.DataFrame(rows, columns = cols)

The resulting table will look like this:

It is important to check that the post_id doesn’t have duplicate values. By looking at the number of unique values and the number of total values in posts_id we can notice that we have about 10000 duplicate values.

df.posts_id.nunique(), len(df.posts_id)

This happened because some posts get posted on multiple pages. So the last step will be dropping the duplicate values.

df.drop_duplicates(subset="posts_id", inplace=True)

Frequency counts and bigrams

First of all, let’s create a frequency count for posts without dividing them into groups.

df.text.str.split(expand=True).stack().value_counts()

We can see that articles, conjunctions, and prepositions prevail over the other words:

Thus, we need to remove them from the dataset. However, even if the dataset is cleaned, the results will look like this. Apart from the fact that 39 is the most frequent word, the data is not very informative and it’s difficult to make any conclusions based on it.

In this case, we will need to build bigrams. One bigram is a sequence of two elements, that is two words standing next to each other. There are many algorithms for building n-grams with different optimization levels. We will use a built-in function in nltk to create a bigram for one group. First, let’s import the additional libraries, download stop words for the English language, and clean the data. Then we will add more stop words including the names of the stock tickers that are used in every post.

import nltk
from nltk.corpus import stopwords
from string import punctuation
import unicodedata
import collections
import nltk
from nltk.stem import WordNetLemmatizer
 
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

english_stopwords = stopwords.words("english")
symbols = ['DIA', 'SPY', 'QQQ', 'INO', 'OCGN', 'BTC.X', 'SNAP', 'INTC', 'VXX', 'ASTS', 'SKLZ', 'RIOT', 'DJIA', 'GOLD', 'GGII', 'COIN']
symbols_lower = [sym.lower() for sym in symbols]
append_stopword = ['https', 'www', 'btc', 'x', 's', 't', 'p', 'amp', 'utm', 'm', 'gon', 'na', '’', '2021', '04', 'stocktwits', 'com', 'spx', 'ndx', 'gld', 'slv', 'es', 'f', '...', '--', 'cqginc', 'cqgthom', 'gt']
english_stopwords.extend(symbols_lower)
english_stopwords.extend(append_stopword)

Let’s define a function to prepare the text that will translate all the words to lowercase, bring them to the base form and remove stop words and punctuation.

wordnet_lemmatizer = WordNetLemmatizer()
 
def preprocess_text(text):
    tokens = nltk.word_tokenize(text.lower())
    tokens = [wordnet_lemmatizer.lemmatize(token) for token in tokens if token not in english_stopwords\
              and token != " " \
              and token.strip() not in punctuation]
    
    text = " ".join(tokens)
    
    return text
    
    df.text = df.text.apply(process_text)

For example, let’s take the group of the least popular users with less than 300 followers, build bigrams and output the most frequent ones.

non_pop_df = df[(df['followers'] < 300)]
 
non_pop_counts = collections.Counter()
for sent in non_pop_df.text:
    words = nltk.word_tokenize(sent)
    non_pop_counts.update(nltk.bigrams(words))
non_pop_counts.most_common()

Results of the bigrams study

Users with less than 300 followers mostly write about their personal plans on making money. This is shown by the collocations like short term, long term, and make money.
Less than 300 followers:
1. look like, 439
2. next week, 422
3. let 39, 364
4. capital gain, 306
5. long term, 274
6. let go, 261
7. stock market, 252
8. buy dip, 252
9. gain tax, 221
10. make money, 203
11. short term, 201
12. buy buy, 192

More popular users with 300 to 3000 followers discuss more abstract issues like sweep premium, stock price and artificial intelligence.
From 300 to 3000 followers:
1. sweep premium, 166
2. price target, 165
3. total day, 140
4. stock market, 139
5. ask premium, 132
6. stock price, 129
7. current stock, 117
8. money trade, 114
9. trade option, 114
10. activity alert, 113
11. trade volume, 113
12. artificial intelligence, 113

Popular users that have below 30000 followers discuss their observations as well as promote their accounts or articles.
From 3000 to 30000 followers:
1. unusual option, 632
2. print size, 613
3. option activity, 563
4. large print, 559
5. activity alerted, 355
6. observed unusual, 347
7. sweepcast observed, 343
8. |🎯 see, 311
9. see profile, 253
10. profile link, 241
11. call expiring, 235
12. new article, 226

Very popular traders with more than 30000 followers mostly act as information sources and post about changes at the stock market. This is indicated by the frequent up and down arrows and collocations like “stock x-day” or “moving average”.
Users with more than 30000 followers:
1. dow stock, 69
2. elliottwave trading, 53
3. ⇩ indexindicators.com, 51
4. ⇧ indexindicators.com, 50
5. u stock, 47
6. stock 5-day, 36
7. moving average, 29
8. stock moving, 28
9. stock x-day, 27
10. ⇧ 10-day, 26
11. stock daily, 25
12. daily rsi, 25

We have also built the bigrams of officials, but the results turned out to be very similar to the most popular users.

 No comments    47   4 mon   data analytics   nltk

Comparing Tableau and PowerBI training programs

Estimated read time – 7 min

This year I succeeded in becoming a Tableau Desktop Certified Associate. When I was thinking about how to prepare for the exam, I came across e-learning courses from Tableau that turned out to be free for 90 days.

I decided not to waste such an opportunity and complete all the 3 modules in Fundamentals at a fast pace. When I got certified, I was wondering which programs are offered by other producers of BI tools. First things first, I decided to study training materials on PowerBI. In this small article, I would like to compare Tableau and PowerBI training programs.

Disclaimer: in the end, I have formed an unfairly prejudiced and positive attitude towards Tableau, so PowerBI supporters may not like this article and find it biased (in all fairness, there are also words of praise for PowerBI).

After having studied the training materials, I can finally state the reasons why I am definitely in favor of Tableau as a tool for data analysis and visualization.

First of all, there is a huge gap in the approach to materials and the assessment of their understanding. Although Tableau training materials are more technical and pay less attention to design, by studying through their videos you can do excellent visualization. After completing all three steps of Tableau training, a strong desire to create new stunning reports with the use of LOD Expressions, Filter Actions, and make convenient interfaces arises. However, after watching all the materials on Power BI the only question that remains is why did I waste my time?

Emotions aside, there are several key points that turned out to be important after having studied the material.

This is a good dashboard according to Microsoft

The quality of content and training examples

If you consider the way training videos are presented in Tableau and the questions in a quiz format that are posed at the end of the covered material, you start understanding the idea of the software. But in the case of Power BI, you will be totally disappointed. Have a look for instance at the material for identifying outliers, here Microsoft suggests building a scatter plot and visually identifying all the outliers.

Design of reports and dashboards

There is some objective criticism towards Tableau training materials on the topic of graph design and control elements, but they are still neatly and beautifully made. Now have a look at the dreadful thing that Microsoft suggests as the result of the analyst’s work. And this is a well-built dashboard according to Microsoft.

Assessment of the knowledge gained during the training

During the training at Tableau, immediately after a small lecture, you learn by applying the part of the studied material in practice. You need to click certain buttons in the interface to solve a problem. Power BI offers “labs” that are supposed to be launched from a remote machine. I didn’t manage to start a single lab; I wrote to the support 3 times and the support couldn’t solve my problem so I didn’t manage to experiment over the PowerBI tasks.

The results of the analyst’s work according to Microsoft.

Other points are mostly related to the software rather than the training program.

Cross-platform support

I have been working with Tableau for a long time and 4 years ago I switched to Mac. After the transition from Windows, my experience of using Tableau did not change. In fact, Tableau was developing and I was developing with it, but the team did not change the key elements of the interface. I have been experimenting with building reports in PowerBI, but I was uncomfortable with different Microsoft archaisms like publications through some share-portal where you need to have an MS account and configure something through the administrator. All of this was a terrible headache.

However, what struck me so much was that I could not use PowerBI on a Mac. There is absolutely no way and this is a principled stance of Microsoft which is not expected to change in the future. From my point of view, such software belongs to a B2B segment in the field of analytics, assumes the connection to all kinds of DBMS, but denies the existence of an alternative operating system which could be used by a number of potential consultants that could use and promote PowerBI as an analytical tool.

Most certainly, there are some rational reasons why any software from Microsoft doesn’t work very well on Mac, but the simple truth is that for me the software remains inaccessible. Nevertheless, I wasn’t looking for an easy way out and installed PowerBI through Parallels in order to honestly consider the tools again taking into account the training materials.

Visualization options

Both Tableau and PowerBI offer stunning visualization options. In fact, in this regard, PowerBI offers a video with a little more information than usual. So, on this matter, the tools are presented equally well.

Functionality

Here I want to give credits to the functionality of PowerBI. In fact, the variety of tools is extremely wide even without connecting third party libraries. For example, automatic clustering, decomposition tree, data profiler and setting filters on a graph.

Internal language syntax

To work with PowerBI you need to learn DAX. It is not a programming language, but a functional language. You won’t be able to write your own code, however, you won’t even need it – all the functions are already implemented, so you should only learn how to use them. Microsoft tells about DAX quite well in the manual. Definition of a new measure in DAX looks like this:

Revenue YoY % =
DIVIDE(
	[Revenue]
		- CALCULATE(
			[Revenue],
			SAMEPERIODLASTYEAR('Date'[Date])
	),
	CALCULATE(
		[Revenue],
		SAMEPERIODLASTYEAR('Date'[Date])
	)
)

Preparing data for the analysis

Inside PowerBI there is a Unpivot feature that allows bringing the data in columns with time periods into the form that is convenient to use in pivot tables.

However, an ETL tool for data cleaning and wrangling in Tableau Prep also has this feature implemented.

Conclusions:

1) The training programs are built in completely different ways, the methodology of immersion into Tableau tools is more elaborate and efficient. There is an opportunity to get practical experience of solving problems and get feedback (albeit automatic).
2) Reports and dashboards design in training materials from Microsoft hardly look professional while Tableau’s implementation is much better.
3) Knowledge assessment at Microsoft is implemented at the abysmal level (absolutely perfunctory tests like in a bad school) while at Tableau it’s much better implemented, you dive into the problem, think about the answer and solve it.
4) Cross-platform support is not PowerBI’s strongest point, however in the case of Tableau it’s an excellent competitive advantage.
5) The functionality and capabilities of the tools are certainly at the highest level, and in some points, PowerBI wins.

Have a look at our dashboard reviews in Tableau and other BI tools.

 No comments    24   4 mon   BI   BI-tools   powerbi   tableau
Earlier Ctrl + ↓