21 post tagged

Later Ctrl + ↑

# Building an interactive waterfall chart in Python

Estimated read time – 4 min

Back in 2014, we built a waterfall chart in Excel, widely known in the consulting world, for one of our presentations about the e-commerce market in Ulmart. It’s been a while and today we are going to draw one in Python and the Plotly library. This type of charts is oftentimes used to illustrate changes with the appearance of a new positive or negative factor. In the latter article about data visualization, we explained how to build a beautiful Bar Chart with bars that resemble thermometers, it’s especially useful when we want to compare planned targets with actual values.

We are using the Ulmart data on the e-commerce market growth from 2013 to 2014. Data on the X-axis is chart captions, on the Y-axis we displayed the initial and final values, as well as their change. With the sum() function calculate the total and add it to the end of our list. The <br> tag in the x_list shows a line break in text.

import plotly.graph_objects as go

x_list = ['2013','The Russian <br>Macroeconomy', 'Decline in working age<br>population','Internet usage growth','Development of<br>cross-border trade', 'National companies', '2014']
y_list = [738.5, 48.7, -7.4, 68.7, 99.7, 48.0]
total = round(sum(y_list))
y_list.append(total)

Let’s create a list with column values, we called it text_list. The values will be taken from the y_list, but first we need to transform them. Convert all numerical values into strings and if it’s not the first or the last column, add a plus sign for clarity. In case it’s a positive change, the color will be green, otherwise red. Highlight the first and the last values with the <b> tag;

text_list = []
for index, item in enumerate(y_list):
if item > 0 and index != 0 and index != len(y_list) - 1:
text_list.append(f'+{str(y_list[index])}')
else:
text_list.append(str(y_list[index]))
for index, item in enumerate(text_list):
if item[0] == '+' and index != 0 and index != len(text_list) - 1:
text_list[index] = '<span style="color:#2ca02c">' + text_list[index] + '</span>'
elif item[0] == '-' and index != 0 and index != len(text_list) - 1:
text_list[index] = '<span style="color:#d62728">' + text_list[index] + '</span>'
if index == 0 or index == len(text_list) - 1:
text_list[index] = '<b>' + text_list[index] + '</b>'

Let’s set parameters for the dashed lines we want to add. Create a list of dictionaries and fill it with light-gray dashed lines, passing the following:

dict_list = []
for i in range(0, 1200, 200):
dict_list.append(dict(
type="line",
line=dict(
color="#666666",
dash="dot"
),
x0=-0.5,
y0=i,
x1=6,
y1=i,
line_width=1,
layer="below"))

Now, create a graph object with the Waterfall() method. Each column in our table can be of a certain type: total, absolute (both with final values) or relative (holds intermediate values). Then we need to set colors, make the connecting line transparent, positive changes will be green, while negative ones are red, and the final columns are purple. Here we are using the Open Sans font.

fig = go.Figure(go.Waterfall(
name = "e-commerce", orientation = "v",
measure = ["absolute", "relative", "relative", "relative", "relative", "relative", "total"],
x = x_list,
y = y_list,
text = text_list,
textposition = "outside",
connector = {"line":{"color":'rgba(0,0,0,0)'}},
increasing = {"marker":{"color":"#2ca02c"}},
decreasing = {"marker":{"color":"#d62728"}},
totals={'marker':{"color":"#9467bd"}},
textfont={"family":"Open Sans, light",
"color":"black"
}
))

Finally, add the title with the description, hide the legend, set the Y label and add dashed lines to our chart.

fig.update_layout(
title =
{'text':'<b>Waterfall chart</b><br><span style="color:#666666">E-commerce market growth from 2013 to 2014</span>'},
showlegend = False,
height=650,
font={
'family':'Open Sans, light',
'color':'black',
'size':14
},
plot_bgcolor='rgba(0,0,0,0)',
yaxis_title="млрд руб.",
shapes=dict_list
)
fig.update_xaxes(tickangle=-45, tickfont=dict(family='Open Sans, light', color='black', size=14))
fig.update_yaxes(tickangle=0, tickfont=dict(family='Open Sans, light', color='black', size=14))

fig.show()

And here it is:

No comments    166   11 mon   data analytics   plotly   python   visualisation

# Example of using dictionaries in Clickhouse with Untappd

Estimated read time – 12 min

In Clickhouse we can use internal dictionaries as well as external dictionaries, they can be an alternative to JSON that doesn’t always work fine. DIctionaries store information in memory and can be invoked with the dictGet method. Let’s review how we can create one in Clickhouse and use it for our queries.

We will illustrate an example of data using the Untappd API. Untappd is a social network for everyone who loves craft beer. We are going to use сheck-ins of Russian-based craft breweries and start collecting information about them to analyze this data later on and to draw some conclusions. in today’s article, we will analyze how to receive metadata on Russian breweries with Untappd and store it in a Clickhouse dictionary.

## Collecting data with Untappd

First off, we need to create a new app to receive client_id and  client_secret_key to make API calls. Follow  this link and fill in the fields:

Usually, it takes about 1 to 3 weeks to wait for approval.

import requests
import pandas as pd
import time

We’ll be using the requests library to make API calls, view results in a Pandas DataFrame, and save them in a CSV file before sending it to a Clickhouse dictionary. Untappd has strict limits on the number of requests, prohibiting us to make more than 100 calls per hour. Therefore, we need to make our script wait for 38 seconds using the Python time module.

client_id = 'your_client_id'
client_secret = 'your_client_secret'
all_brewery_of_russia = []

We want to get data for one thousand Russian breweries. One request to the Brewery Search method enables us to view up to 50 breweries. The website gave us 3369 breweries when searching the word “Russia” manually.

Let’s check this, scroll down to the bottom, and open the page code.

Each brewery received is stored in the beer-item class. This means we can the number of references to beer-item:

And as it turned out, we have exactly 1000 breweries, not 3369. When searching the word “Russia” manually, the results also contain some American breweries. So, we need to make 20 calls, getting 50 breweries at a time:

for offset in range(0, 1000, 50):
try:
print('offset = ', offset)
print('remained:', 1000 - offset, '\n')
response = requests.get(f'https://api.untappd.com/v4/search/brewery?client_id={client_id}&client_secret={client_secret}',
params={
'q':'Russia',
'offset':offset,
'limit':50
})
item = response.json()
print(item, '\n')
all_brewery_of_russia.append(item)
time.sleep(37)
except Exception:
print(Exception)
continue

The Brewery Search method includes several parameters, q – a string with a country name (specify specify “Russia” to get all the breweries based in Russia), offset – allows us to shift by 50 lines in the search to get the next list of breweries, limit – restricts the number of breweries received and can not be more than 50. Convert the answer to JSON and append data sotred in the item object to the  all_brewery_of_russia list.

Our data may also include breweries from other countries. That’s why we need to filter the data. Iterate through the all_brewery_of_russia list and keep only those breweires, which country_name is Russia.

brew_list = []
for element in all_brewery_of_russia:
brew = element['response']['brewery']
for i in range(brew['count']):
if brew['items'][i]['brewery']['country_name'] == 'Russia':
brew_list.append(brew['items'][i])

Print out the first element in our brew_list:

print(brew_list[0])

Create a DataFrame with the following columns: brewery_id, beer_count, brewery_name, brewery_slug, brewery_page_url, brewery_city, lat и  lng. And several lists to sort out the data stored in the brewery_list:

df = pd.DataFrame()
brewery_id_list = []
beer_count_list = []
brewery_name_list = []
brewery_slug_list = []
brewery_page_url_list = []
brewery_location_city = []
brewery_location_lat = []
brewery_location_lng = []
for brewery in brew_list:
brewery_id_list.append(brewery['brewery']['brewery_id'])
beer_count_list.append(brewery['brewery']['beer_count'])
brewery_name_list.append(brewery['brewery']['brewery_name'])
brewery_slug_list.append(brewery['brewery']['brewery_slug'])
brewery_page_url_list.append(brewery['brewery']['brewery_page_url'])
brewery_location_city.append(brewery['brewery']['location']['brewery_city'])
brewery_location_lat.append(brewery['brewery']['location']['lat'])
brewery_location_lng.append(brewery['brewery']['location']['lng'])

Assign them as column values:

df['brewery_id'] = brewery_id_list
df['beer_count'] = beer_count_list
df['brewery_name'] = brewery_name_list
df['brewery_slug'] = brewery_slug_list
df['brewery_page_url'] = brewery_page_url_list
df['brewery_city'] = brewery_location_city
df['brewery_lat'] = brewery_location_lat
df['brewery_lng'] = brewery_location_lng

And view our DataFrame:

Let’s sort the values by brewery_id and store our DataFrame as a CSV file without index column and headings:

df = df.sort_values(by='brewery_id')

## Creating a Clickhouse dictionary

You can create Clickouse dictionaries in many different ways. We will try to structure it in an XML file, configure the server files, and access it through our client. The XML file structure will be the following:

<yandex>
<dictionary>
<name>breweries</name>
<source>
<file>
<path>/home/ubuntu/brewery_data.csv</path>
<format>CSV</format>
</file>
</source>
<layout>
<flat />
</layout>
<structure>
<id>
<name>brewery_id</name>
</id>
<attribute>
<name>beer_count</name>
<type>UInt64</type>
<null_value>Null</null_value>
</attribute>
<attribute>
<name>brewery_name</name>
<type>String</type>
<null_value>Null</null_value>
</attribute>
<attribute>
<name>brewery_slug</name>
<type>String</type>
<null_value>Null</null_value>
</attribute>
<attribute>
<name>brewery_page_url</name>
<type>String</type>
<null_value>Null</null_value>
</attribute>
<attribute>
<name>brewery_city</name>
<type>String</type>
<null_value>Null</null_value>
</attribute>
<attribute>
<name>lat</name>
<type>String</type>
<null_value>Null</null_value>
</attribute>
<attribute>
<name>lng</name>
<type>String</type>
<null_value>Null</null_value>
</attribute>
</structure>
</dictionary>
</yandex>

name is a dictionary name, attribute holds the properties of the columns, id is a key field, file stores file path and format. We are going to store our file in this directory: /home/ubuntu.

Let’s upload our CSV and XML files to the server, it can be done using an FTP like FileZilla. We explained how to deploy Clickhouse on an Amazon instance in our previous article, this time need to do the same. Open your FileZilla client and go to SFTP settings to add a private key:

Connect to your server address, it can be found in the EC2 management console. Specify SFTP as a protocol, your Host, and Ubuntu as a username.

After connecting we will wind up in this location /home/ubuntu. Let’s put the files in that folder and connect via SSH using Termius. Then we need to move the files to /etc/clickhouse-server to view them in Clickhouse:

Learn how you can connect to an AWS server using SSH client from our previous material Installing Clickhouse on AWS

sudo mv breweries_dictionary.xml /etc/clickhouse server/

Go to the config file:

cd /etc/clickhouse-server
sudo nano config.xml

We need the  tag, it’s the path to a file that describes the dictionaries structure. Specify the path to our XML file:

<dictionaries_config>/etc/clickhouse-server/breweries_dictionary.xml</dictionaries_config>

Save our file and run the Clickhouse client:

clickhouse client

Let’s check that the dictionary really loaded:

SELECT * FROM system.dictionaries\G

In case of success you will get the following:

Now, let’s write a query with the  dictGet function to get the name of the brewery with ID 999. Pass in the dictionary name, as the first argument, then the filed name and ID.

SELECT dictGet('breweries', 'brewery_name', toUInt64(999))

And our query returns this:

Similarly, we could use this function to get a beer name, when the table contains only IDs.

No comments    601   1 y   Amazon Web Services   Analytics engineering   clickhouse   data analytics   python

# QR code recognition for sales receipts with Skimage

Estimated read time – 7 min

When we want to scan a QR code the image quality matters, and oftentimes the image may look blurred and defocused. To address this problem and suppress unwanted distortions we can use image pre-processing. In today’s article we will discover how to improve QR code recognition with the help of scikit-image library.

from matplotlib import pyplot as plt
import skimage
from skimage import util, exposure, io, measure, feature
from scipy import ndimage as ndi
import numpy as np
import cv2

## Subject

Let’s try to scan this till receipt from our preceding article Collecting data from hypermarket receipts on Python. Use the  imread() function to read our image and then display it.

plt.imshow(img)

It seems hardly possible to read any letter from this blurred image. Let’s try to do this again with a predefined function from the  opencv library:

detector = cv2.QRCodeDetector()
data, bbox, _ = detector.detectAndDecode(img)
if data:
print(data)
else:
print('Ooops! Nothing here...')

Scan our image once again:

Ooops! Nothing here...

That’s not surprising, the abundance of pixels makes it difficult for the scanner to recognize the QR code. Nevertheless, we can simplify the task by specifying the edges of the QR image.

## Solution

First, let us remove all the unnecessary pixels, find the coordinates of the QR code and pass it to the qr_reader function. First off, remove noise in the image using the median filter and convert our RGB image to grayscale, as QR-codes are composed of only two colors.

image = ndi.median_filter(util.img_as_float(img), size=9)
image = skimage.color.rgb2gray(image)
plt.imshow(image, cmap='gray')

The median filter blurred our image, and the scattered pixels have become less clear, while the QR code looks much better now. Apply the  adjust_gamma function to our image. This function exponentiates the gamma value of each pixel, less gamma means that the pixel will be closer to white color. We will set gamma to 0.5.

plt.imshow(pores_gamma, cmap='gray')

We can see clear improvements, the QR code is now much distinct than previously. Let’s take advantage of it and set all pixels with a value of less than 0.3 to 0, while others to 1.

thresholded = (pores_gamma <= 0.3)
plt.imshow(thresholded, cmap='gray')

Now, let’s apply the Canny filter to our thresholded image. This filter smoothes the image and calculate the gradients, the edges are where the gradient at maximum. With the increasing sigma parameter, the canny filter stops discerning less clear edges.

edge = feature.canny(thresholded, sigma=6)
plt.imshow(edge)

Outline the QR code with the coordinates of the edges. We can calculate them with the find_contours method and draw them atop the image. Coordinates are stored in the contours array.

contours = measure.find_contours(edge, 0.5)
plt.imshow(edge)
for contour in contours:
plt.plot(contour[:,1], contour[:,0], linewidth=2)

We will take minimum and maximum coordinates for X and Y axes, thus drawing a visible rectangle.

positions = np.concatenate(contours, axis=0)
min_pos_x = int(min(positions[:,1]))
max_pos_x = int(max(positions[:,1]))
min_pos_y = int(min(positions[:,0]))
max_pos_y = int(max(positions[:,0]))

Having the coordinates, we can ensquare the code area:

start = (min_pos_x, min_pos_y)
end = (max_pos_x, max_pos_y)
cv2.rectangle(img, start, end, (255, 0, 0), 5)
io.imshow(img)

Let’s try to cut this area according to our coordinates:

new_img = img[min_pos_y:max_pos_y, min_pos_x:max_pos_x]
plt.imshow(new_img)

Pass the new image to the qr_reader function:

And it returns this:

t=20190320T2303&s=5803.00&fn=9251440300007971&i=141637&fp=4087570038&n=1

That’s exactly what we need! Of course, the script is not universal and every image is unique, some may have too much noise or low contrast, while others may not. The sequence of actions depends on the case. Next time, we will show the subsequent stage of image processing using a well-established python library.

No comments    140   1 y   data analytics   python   skimage

# Predicting category of products by name from Russian Food Stores

Estimated read time – 9 min

This article is a continuation of our series about analyzing data on consumer products: «Collecting data from hypermarket receipts on Python» and «Parsing the data of site’s catalog, using Beautiful Soup and Selenium». We are going to build a model that would classify products by name in a till receipt. Till receipts contain data for each product bought, but it doesn’t provide us a summary of how much were spent on Sweets or Dairy Foods in total.

## Data Wrangling

Load data from our .csv file to a Pandas DataFrame and see how it looks:

Did you know that we can emulate human behavior to parse data from a web-catalog? More details about it are in this article: «Parsing the data of site’s catalog, using Beautiful Soup and Selenium»

import pandas as pd

As you can see, the DataFrame contains even more than we need for predicting the category of products by name. So we can drop() columns with prices and weights, and rename() the remaining ones:

sku.drop(columns=['Unnamed: 0', 'Weight','Price'],inplace=True)
sku.rename(columns={"SKU": "SKU", "Category": "Group"},inplace=True)

Group the products by its category and count them up with the following methods:

sku.groupby('Group').agg(['count'])

We will train our predictive model on this data so that it could identify the product category by name. Since the DataFrame includes product names mainly in Russian, the model won’t make predictions properly. The Russian language contains a lot of prepositions, conjunctions, and specific speech patterns. We want our model to distinguish that «Мангал с ребрами жесткости» («Brazier with strengthening ribs» ) and «Мангал с 6 шампурами» («Brazier with 6 skewers») belongs to the same category. With this is my we need to clean up all the product names, removing conjunctions, preposition, interjections, particles and keep only word bases with the help of stemming.

A stemmer is a tool that operates on the principle of recognizing “stem” words embedded in other words.

import nltk
from nltk.corpus import stopwords
from pymystem3 import Mystem
from string import punctuation

In our case will be using the pymystem3 library developed by Yandex. Product names in our DataFrame may vary from those ones you could find in supermarkets today. So first, let’s improve the list of stop words that our predictive model will ignore.

mystem = Mystem()
russian_stopwords = stopwords.words("russian")
russian_stopwords.extend(['лента','ассорт','разм','арт','что', 'это', 'так', 'вот', 'быть', 'как', 'в', '—', 'к', 'на'])

Write a function that would preprocess our data and extract the word base, remove punctuation, numerical signs, and stop words. The following code snippet belongs to one Kaggle kernel.

def preprocess_text(text):
text = str(text)
tokens = mystem.lemmatize(text.lower())
tokens = [token for token in tokens if token not in russian_stopwords\
and token != " " \
and len(token)>=3 \
and token.strip() not in punctuation \
and token.isdigit()==False]
text = " ".join(tokens)
return text

See how it works:

An extract from Borodino (Russian: Бородино), a poem by Russian poet Mikhail Lermontov which describes the Battle of Borodino.

preprocess_text("Мой дядя самых честных правил, Когда не в шутку занемог, Он уважать себя заставил И лучше выдумать не мог.")

Transformed into:

'дядя самый честный правило шутка занемогать уважать заставлять выдумывать мочь'

Everything works as expected – the result includes only word stems in lower case with no punctuation, prepositions or conjunctions. Let’s apply this function to a product name from our DataFrame:

print(‘Before:’, sku['SKU'][0])
print(‘After:’, preprocess_text(sku['SKU'][0]))

Preprocessed text:

Before: Фисташки соленые жареные ТМ 365 дней
After: фисташка соленый жареный день

The function works fine and now we can apply it to the whole column, and create a new one with processed names:

sku['processed']=sku['SKU'].apply(preprocess_text)

## Building our Predictive Model

We will be using CountVectorizer to predict the product category, and Naive Bayes Classifier.
CountVectorizer will tokenize our text and build a vocabulary of known words, while Naive Bayes Classifier allows us to train our model on a DataFrame with multiple classes. We will also need TfidfTransformer for computing words count (term frequency). As we want to chain these steps, let’s import the Pipeline library:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from imblearn.pipeline import Pipeline

Separate our targets, Y (categories) from the predictors, X (processed product names). And split the DataFrame into Test and Training sets, allocating 33% of samples for testing.

x = sku.processed
y = sku.Group
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33)

Add the following methods to our pipeline:

• CountVectorizer() – returns a matrix of token counts
• TfidfTransformer() – transforms a matrix into a normalized tf-idf representation
• MultinomialNB() – an algorithm for predicting product category
text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1,2))),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])

Fit our model to the Training Dataset and make predictions for the Test Dataset:

text_clf = text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)

Evaluate our predictive model:

print('Score:', text_clf.score(X_test, y_test))

The model predicts correctly 90% of the time.

Score: 0.923949864498645

## Validate our model with the real-world data

Let’s test how good our model performs on real-world data. We’ll refer to the DataFrame from our previous article, «Collecting data from hypermarket receipts on Python», and preprocess the product names:

my_products['processed']=my_products['name'].apply(preprocess_text)

Pass the processed text to the model and create a new column that would hold our predictions:

prediction = text_clf.predict(my_products['processed'])
my_products['prediction']=prediction
my_products[['name', 'prediction']]

Now, the DataFrame looks the following:

Calculate the spendings for each product category:

my_products.groupby('prediction').sum()

Overall, the model seems to be robust in predicting that sausages fall under meat products, quark is a dairy product, baguette belongs to bread and pastries. But sometimes it misclassifies kiwi as a dairy product and pear as an eco-product. This is probably because these categories include many products are «with the taste of pear» or «with the taste of kiwi», and the algorithm makes predictions based on the prevailing group of products. This is a well-known issue of unbalanced classes, but it can be addressed by resampling the DataSet or choosing proper weights for our model.

No comments    165   1 y   data analytics   machine learning   python

# Beautiful Bar Charts with Python and Matplotlib

Estimated read time – 5 min

The Matplotlib library provides a wide range of tools for Data Visualisation, allowing us to create compelling, expressive visualizations. But why then so many plots look so bland and boring? Back in 2011 we built a simple yet decent diagram for a telecommunication company report and named it ‘Thermometer’. Later this type of bars was exposed to a wide audience on  Chandoo, which was a popular blog on Excel. By the way, here’s what it looks like:

Times change, and today we’ll recall the way to plot this type of diagrams with the help of Matplotlib

When should one use this type of diagram?
The best way to plot this type of diagrams is when comparing target values with actual values because it reflects underfulfilment and overfulfilment of planned targets. A diagram may reflect data in percentages as well as in real figures. Let’s view an example using the latter.

We’ll use data stored in an excel file and already familiar python libraries for data analysis:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Read our file as a DataFrame:

That’s what it looks like:

Now, we need to extract columns from the table. The first column called «Sales» will be displayed under each bar. Some values may be of a string type if there is a comma between two values. We need to convert these type of values by replacing a comma with a dot and converting them to floats.

xticks = df.iloc[:,0]
try:
bars2 = df.iloc[:,1].str.replace(',','.').astype('float')
except AttributeError:
bars2 = df.iloc[:,1].astype('float')
try:
bars1 = df.iloc[:,2].str.replace(',','.').astype('float')
except AttributeError:
bars1 = df.iloc[:,2].astype('float')

As we don’t know for sure if the table includes such values, our actions may cause an  AttributeError . Fortunatelly for us, Python has a built-in try – except
method for handling such errors.

Let’s plot a simple side-by-side bar graph, setting a distance between two related values using a NumPy array:

barWidth = 0.2
r1 = np.arange(len(bars1))
r2 = [x + barWidth for x in r1]

plt.bar(r1, bars1, width=barWidth)
plt.bar(r2, bars2, width=barWidth)

And see what happens:

Obviously, this is not what we expected. Let’s try to set a different bar width to make bars overlapping each other.

barWidth1 = 0.065
barWidth2 = 0.032
x_range = np.arange(len(bars1) / 8, step=0.125)

We can plot the bars and set its coordinates, color, width, legend and signatures in advance:

plt.bar(x_range, bars1, color='#dce6f2', width=barWidth1/2, edgecolor='#c3d5e8', label='Target')
plt.bar(x_range, bars2, color='#ffc001', width=barWidth2/2, edgecolor='#c3d5e8', label='Actual Value')
for i, bar in enumerate(bars2):
plt.text(i / 8 - 0.015, bar + 1, bar, fontsize=14)

Add some final touches – remove the frames, ticks, add a grey line under the bars, adjust font size and layout, make a plot a bit wider and save it as a .png file.

plt.xticks(x_range, xticks)
plt.tick_params(
bottom=False,
left=False,
labelsize=15
)
plt.rcParams['figure.figsize'] = [25, 7]
plt.axhline(y=0, color='gray')
plt.legend(frameon=False, loc='lower center', bbox_to_anchor=(0.25, -0.3, 0.5, 0.5), prop={'size':20})
plt.box(False)
plt.savefig('plt', bbox_inches = "tight")
plt.show()

And here’s the final result:

No comments    396   1 y   data analytics   matplotlib   python   visualisation
Earlier Ctrl + ↓