27 posts tagged

python

Later Ctrl + ↑

Building an interactive waterfall chart in Python

⏱ Время чтения текста – 4 минуты

Back in 2014, we built a waterfall chart in Excel, widely known in the consulting world, for one of our presentations about the e-commerce market in Ulmart. It’s been a while and today we are going to draw one in Python and the Plotly library. This type of charts is oftentimes used to illustrate changes with the appearance of a new positive or negative factor. In the latter article about data visualization, we explained how to build a beautiful Bar Chart with bars that resemble thermometers, it’s especially useful when we want to compare planned targets with actual values.

We are using the Ulmart data on the e-commerce market growth from 2013 to 2014. Data on the X-axis is chart captions, on the Y-axis we displayed the initial and final values, as well as their change. With the sum() function calculate the total and add it to the end of our list. The <br> tag in the x_list shows a line break in text.

import plotly.graph_objects as go

x_list = ['2013','The Russian <br>Macroeconomy', 'Decline in working age<br>population','Internet usage growth','Development of<br>cross-border trade', 'National companies', '2014']
y_list = [738.5, 48.7, -7.4, 68.7, 99.7, 48.0]
total = round(sum(y_list))
y_list.append(total)

Let’s create a list with column values, we called it text_list. The values will be taken from the y_list, but first we need to transform them. Convert all numerical values into strings and if it’s not the first or the last column, add a plus sign for clarity. In case it’s a positive change, the color will be green, otherwise red. Highlight the first and the last values with the <b> tag;

text_list = []
for index, item in enumerate(y_list):
    if item > 0 and index != 0 and index != len(y_list) - 1:
        text_list.append(f'+{str(y_list[index])}')
    else:
        text_list.append(str(y_list[index]))
for index, item in enumerate(text_list):
    if item[0] == '+' and index != 0 and index != len(text_list) - 1:
        text_list[index] = '<span style="color:#2ca02c">' + text_list[index] + '</span>'
    elif item[0] == '-' and index != 0 and index != len(text_list) - 1:
        text_list[index] = '<span style="color:#d62728">' + text_list[index] + '</span>'
    if index == 0 or index == len(text_list) - 1:
        text_list[index] = '<b>' + text_list[index] + '</b>'

Let’s set parameters for the dashed lines we want to add. Create a list of dictionaries and fill it with light-gray dashed lines, passing the following:

dict_list = []
for i in range(0, 1200, 200):
    dict_list.append(dict(
            type="line",
            line=dict(
                 color="#666666",
                 dash="dot"
            ),
            x0=-0.5,
            y0=i,
            x1=6,
            y1=i,
            line_width=1,
            layer="below"))

Now, create a graph object with the Waterfall() method. Each column in our table can be of a certain type: total, absolute (both with final values) or relative (holds intermediate values). Then we need to set colors, make the connecting line transparent, positive changes will be green, while negative ones are red, and the final columns are purple. Here we are using the Open Sans font.

Learn more about how to choose the right fonts for your data visualization from this article: “Choosing Fonts for Your Data Visualization”

fig = go.Figure(go.Waterfall(
    name = "e-commerce", orientation = "v",
    measure = ["absolute", "relative", "relative", "relative", "relative", "relative", "total"],
    x = x_list,
    y = y_list,
    text = text_list,
    textposition = "outside",
    connector = {"line":{"color":'rgba(0,0,0,0)'}},
    increasing = {"marker":{"color":"#2ca02c"}},
    decreasing = {"marker":{"color":"#d62728"}},
    totals={'marker':{"color":"#9467bd"}},
    textfont={"family":"Open Sans, light",
              "color":"black"
             }
))

Finally, add the title with the description, hide the legend, set the Y label and add dashed lines to our chart.

fig.update_layout(
    title = 
        {'text':'<b>Waterfall chart</b><br><span style="color:#666666">E-commerce market growth from 2013 to 2014</span>'},
    showlegend = False,
    height=650,
    font={
        'family':'Open Sans, light',
        'color':'black',
        'size':14
    },
    plot_bgcolor='rgba(0,0,0,0)',
    yaxis_title="млрд руб.",
    shapes=dict_list
)
fig.update_xaxes(tickangle=-45, tickfont=dict(family='Open Sans, light', color='black', size=14))
fig.update_yaxes(tickangle=0, tickfont=dict(family='Open Sans, light', color='black', size=14))

fig.show()

And here it is:

No comments 986 2020 data analytics plotly python visualisation

Handling website buttons in Selenium

⏱ Время чтения текста – 8 минут

In our previous article, “Parsing the data of site’s catalogue, using Beautiful Soup and Selenium” we have addressed the problem of working with dynamic pages, but sometimes this method doesn’t work, as with “Show more” buttons. Today we will show how you can imitate button click with Selenium to load a whole page, collect beer IDs, ratings, and send the data to Clickhouse.

Webpage structure

Let’s take a random brewery that has 105 check-ins, or customer feedbacks. One page with check-ins displays up to 25 records and looks like this:

If we try to scroll down to the bottom, we will encounter the same button that prevents us from getting all 105 records at once:

First off, to address this task, let’s find out the button class and just click it until it works. Since Selenium launches the browser and the next “Show more” button may not be loaded in time, that’s why we set 2-second intervals between the clicks. As soon as the page is loaded we will take its content and parse the relevant data.
Let’s view the source code and find the button, it’s assigned to the more_checkins class.

The button has style attributes, such as display. When the button is displayed this attribute takes the block value. But when we scroll the page to the buttom and there is nothing left to display, the attribute takes the none value and we can stop clicking.

Writing our code

Let’s import the necessary libraries

import time
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import re
from datetime import datetime
from clickhouse_driver import Client

Chromedriver is used to run Selenium tests on Chrome and can be downloaded from the official website

Connect to the database and create cookies:

client = Client(host='ec1-23-456-789-10.us-east-2.compute.amazonaws.com', user='', password='', port='9000', database='')
count = 0
cookies = {
    'domain':'untappd.com',
    'expiry':1594072726,
    'httpOnly':True,
    'name':'untappd_user_v3_e',
    'path':'/',
    'secure':False,
    'value':'your_value'
}

You can find out more about working with cookies in Selenium from “Parsing the data of site’s catalogue, using Beautiful Soup and Selenium”. We will need the untappd_user_v3_e parameter.

As we are going to work with pages that have more than hundreds of thousands of records, it’s pretty heavy and our instance may be overloaded. To prevent this, we will shut down unnecessary parts and then enable authentication cookie:

options = webdriver.ChromeOptions()
prefs = {'profile.default_content_setting_values': {'images': 2, 
                            'plugins': 2, 'fullscreen': 2}}
options.add_experimental_option('prefs', prefs)
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
driver = webdriver.Chrome(options=options)
driver.get('https://untappd.com/TooSunnyBrewery')
driver.add_cookie(cookies)

We will need a function that would take a link, open it in the browser, load a whole page and return a soup object to be parsed. Get the display attribute, assign it to the more_checkins: variable and click the button until the attribute is none. Let’s set 2-second intervals between the clicks, to wait for the page to load. As soon as we received the page, converth it into a soup object using the bs4 library.

def get_html_page(url):
    driver.get(url)
    driver.maximize_window()
    more_checkins = driver.execute_script("var more_checkins=document.getElementsByClassName('more_checkins_logged')[0].style.display;return more_checkins;")
    print(more_checkins)
    while more_checkins != "none":
        driver.execute_script("document.getElementsByClassName('more_checkins_logged')[0].click()")
        time.sleep(2)
        more_checkins = driver.execute_script("var more_checkins=document.getElementsByClassName('more_checkins_logged')[0].style.display;return more_checkins;")
        print(more_checkins)
    source_data = driver.page_source
    soup = bs(source_data, 'lxml')
    return soup

Write the following function that will take a page url, pass it in the get_html_page and receive a soup object to parse. The function returns zipped lists with beer IDs and ratings.

See how you can use Beautiful Soup to retrieve data from a website catalogue

def parse_html_page(url):
    soup = get_html_page(url)
    brewery_id = soup.find_all('a', {'class':'label',
                                     'href':re.compile('https://untappd.com/brewery/*')})[0]['href'][28:]
    items = soup.find_all('div', {'class':'item',
                                  'id':re.compile('checkin_*')})
    checkin_rating_list = []
    beer_id_list = []
    count = 0
    print('Filling the lists')
    for checkin in items:
        print(count, '/', len(items))
        try:
            checkin_rating_list.append(float(checkin.find('div', {'class':'caps'})['data-rating']))
        except Exception:
            checkin_rating_list.append('cast(Null as Nullable(Float32))')
        try:
            beer_id_list.append(int(checkin.find('a', {'class':'label'})['href'][-7:]))
        except Exception:
            beer_id_list.append('cast(Null as Nullable(UInt64))')
        count += 1 
    return zip(checkin_rating_list, beer_id_list)

Finally, write a function call for the breweries. We’ve covered how to receive a list of Russian brewery IDs in this article: Example of using dictionaries in Clickhouse with Untappd.
Let’s fetch it from the Clickhouse table.

brewery_list = client.execute('SELECT brewery_id FROM brewery_info')

If we print out the brewery_list, we will find out that the data is stored in a list of tuples.

Let’s make it a bit prettier with the lambda expression:

flatten = lambda lst: [item for sublist in lst for item in sublist]
brewery_list = flatten(brewery_list)

That’s much better:

Create a url for each brewery in the list, it includes a standard link and a brewery ID in the end. Pass it to the parse_html_page function that fetches the get_html_page and return lists with beer_id and rating_score. Since the lists are zipped, we can iterate throught them, create a tuple and send it to Clickhouse.

for brewery_id in brewery_list:
    print('Fetching the brewery with id', brewery_id, count, '/', len(brewery_list))
    url = 'https://untappd.com/brewery/' + str(brewery_id)
    returned_checkins = parse_html_page(url)
    for rating, beer_id in returned_checkins:
        tuple_to_insert = (rating, beer_id)
        try:
            client.execute(f'INSERT INTO beer_reviews VALUES {tuple_to_insert}')
        except errors.ServerException as E:
            print(E)
    count += 1

That’s it about the way we can handle “Show more” buttons. Over time we will form a large dataset for further analysis, to work with in our next series.

No comments 527 2020 Amazon Web Services Analytics engineering AWS clickhouse python

Example of using dictionaries in Clickhouse with Untappd

⏱ Время чтения текста – 12 минут

In Clickhouse we can use internal dictionaries as well as external dictionaries, they can be an alternative to JSON that doesn’t always work fine. DIctionaries store information in memory and can be invoked with the dictGet method. Let’s review how we can create one in Clickhouse and use it for our queries.

We will illustrate an example of data using the Untappd API. Untappd is a social network for everyone who loves craft beer. We are going to use сheck-ins of Russian-based craft breweries and start collecting information about them to analyze this data later on and to draw some conclusions. in today’s article, we will analyze how to receive metadata on Russian breweries with Untappd and store it in a Clickhouse dictionary.

Collecting data with Untappd

First off, we need to create a new app to receive client_id and client_secret_key to make API calls. Follow this link and fill in the fields:

Usually, it takes about 1 to 3 weeks to wait for approval.

import requests
import pandas as pd
import time

We’ll be using the requests library to make API calls, view results in a Pandas DataFrame, and save them in a CSV file before sending it to a Clickhouse dictionary. Untappd has strict limits on the number of requests, prohibiting us to make more than 100 calls per hour. Therefore, we need to make our script wait for 38 seconds using the Python time module.

client_id = 'your_client_id'
client_secret = 'your_client_secret'
all_brewery_of_russia = []

We want to get data for one thousand Russian breweries. One request to the Brewery Search method enables us to view up to 50 breweries. The website gave us 3369 breweries when searching the word “Russia” manually.

Let’s check this, scroll down to the bottom, and open the page code.

Each brewery received is stored in the beer-item class. This means we can the number of references to beer-item:

And as it turned out, we have exactly 1000 breweries, not 3369. When searching the word “Russia” manually, the results also contain some American breweries. So, we need to make 20 calls, getting 50 breweries at a time:

for offset in range(0, 1000, 50):
    try:
        print('offset = ', offset)
        print('remained:', 1000 - offset, '\n')
        response = requests.get(f'https://api.untappd.com/v4/search/brewery?client_id={client_id}&client_secret={client_secret}',
                               params={
                                   'q':'Russia',
                                   'offset':offset,
                                   'limit':50
                               })
        item = response.json()
        print(item, '\n')
        all_brewery_of_russia.append(item)
        time.sleep(37)
    except Exception:
        print(Exception)
        continue

The Brewery Search method includes several parameters, q – a string with a country name (specify specify “Russia” to get all the breweries based in Russia), offset – allows us to shift by 50 lines in the search to get the next list of breweries, limit – restricts the number of breweries received and can not be more than 50. Convert the answer to JSON and append data sotred in the item object to the all_brewery_of_russia list.

Our data may also include breweries from other countries. That’s why we need to filter the data. Iterate through the all_brewery_of_russia list and keep only those breweires, which country_name is Russia.

brew_list = []
for element in all_brewery_of_russia:
    brew = element['response']['brewery']
    for i in range(brew['count']):
        if brew['items'][i]['brewery']['country_name'] == 'Russia':
            brew_list.append(brew['items'][i])

Print out the first element in our brew_list:

print(brew_list[0])

Create a DataFrame with the following columns: brewery_id, beer_count, brewery_name, brewery_slug, brewery_page_url, brewery_city, lat и lng. And several lists to sort out the data stored in the brewery_list:

df = pd.DataFrame()
brewery_id_list = []
beer_count_list = []
brewery_name_list = []
brewery_slug_list = []
brewery_page_url_list = []
brewery_location_city = []
brewery_location_lat = []
brewery_location_lng = []
for brewery in brew_list:
    brewery_id_list.append(brewery['brewery']['brewery_id'])
    beer_count_list.append(brewery['brewery']['beer_count'])
    brewery_name_list.append(brewery['brewery']['brewery_name'])
    brewery_slug_list.append(brewery['brewery']['brewery_slug'])
    brewery_page_url_list.append(brewery['brewery']['brewery_page_url'])
 brewery_location_city.append(brewery['brewery']['location']['brewery_city'])
    brewery_location_lat.append(brewery['brewery']['location']['lat'])
    brewery_location_lng.append(brewery['brewery']['location']['lng'])

Assign them as column values:

df['brewery_id'] = brewery_id_list
df['beer_count'] = beer_count_list
df['brewery_name'] = brewery_name_list
df['brewery_slug'] = brewery_slug_list
df['brewery_page_url'] = brewery_page_url_list
df['brewery_city'] = brewery_location_city
df['brewery_lat'] = brewery_location_lat
df['brewery_lng'] = brewery_location_lng

And view our DataFrame:

df.head()

Let’s sort the values by brewery_id and store our DataFrame as a CSV file without index column and headings:

df = df.sort_values(by='brewery_id')
df.to_csv('brewery_data.csv', index=False, header=False)

Creating a Clickhouse dictionary

You can create Clickouse dictionaries in many different ways. We will try to structure it in an XML file, configure the server files, and access it through our client. The XML file structure will be the following:

Learn more about other ways you can create Clickhouse dictionaries in the documentation

<yandex>
<dictionary>
        <name>breweries</name>
        <source>
                <file>
                        <path>/home/ubuntu/brewery_data.csv</path>
                        <format>CSV</format>
                </file>
        </source>
        <layout>
                <flat />
        </layout>
        <structure>
                <id>
                        <name>brewery_id</name>
                </id>
                <attribute>
                        <name>beer_count</name>
                        <type>UInt64</type>
                        <null_value>Null</null_value>
                </attribute>
                <attribute>
                        <name>brewery_name</name>
                        <type>String</type>
                        <null_value>Null</null_value>
                </attribute>
                <attribute>
                        <name>brewery_slug</name>
                        <type>String</type>
                        <null_value>Null</null_value>
                </attribute>
                <attribute>
                        <name>brewery_page_url</name>
                        <type>String</type>
                        <null_value>Null</null_value>
                </attribute>
                <attribute>
                        <name>brewery_city</name>
                        <type>String</type>
                        <null_value>Null</null_value>
                </attribute>
                <attribute>
                        <name>lat</name>
                        <type>String</type>
                        <null_value>Null</null_value>
                </attribute>
                <attribute>
                        <name>lng</name>
                        <type>String</type>
                        <null_value>Null</null_value>
                </attribute>
        </structure>
        <lifetime>300</lifetime>
</dictionary>
</yandex>

name is a dictionary name, attribute holds the properties of the columns, id is a key field, file stores file path and format. We are going to store our file in this directory: /home/ubuntu.

Let’s upload our CSV and XML files to the server, it can be done using an FTP like FileZilla. We explained how to deploy Clickhouse on an Amazon instance in our previous article, this time need to do the same. Open your FileZilla client and go to SFTP settings to add a private key:

Connect to your server address, it can be found in the EC2 management console. Specify SFTP as a protocol, your Host, and Ubuntu as a username.

Your Public DNS may change in case of overload

After connecting we will wind up in this location /home/ubuntu. Let’s put the files in that folder and connect via SSH using Termius. Then we need to move the files to /etc/clickhouse-server to view them in Clickhouse:

Learn how you can connect to an AWS server using SSH client from our previous material “Installing Clickhouse on AWS”

sudo mv breweries_dictionary.xml /etc/clickhouse server/

Go to the config file:

cd /etc/clickhouse-server
sudo nano config.xml

We need the tag, it’s the path to a file that describes the dictionaries structure. Specify the path to our XML file:

<dictionaries_config>/etc/clickhouse-server/breweries_dictionary.xml</dictionaries_config>

Save our file and run the Clickhouse client:

clickhouse client

Let’s check that the dictionary really loaded:

SELECT * FROM system.dictionaries\G

In case of success you will get the following:

Now, let’s write a query with the dictGet function to get the name of the brewery with ID 999. Pass in the dictionary name, as the first argument, then the filed name and ID.

SELECT dictGet('breweries', 'brewery_name', toUInt64(999))

And our query returns this:

Similarly, we could use this function to get a beer name, when the table contains only IDs.

No comments 2106 2020 Amazon Web Services Analytics engineering clickhouse data analytics python

QR code recognition for sales receipts with Skimage

⏱ Время чтения текста – 7 минут

When we want to scan a QR code the image quality matters, and oftentimes the image may look blurred and defocused. To address this problem and suppress unwanted distortions we can use image pre-processing. In today’s article we will discover how to improve QR code recognition with the help of scikit-image library.

from matplotlib import pyplot as plt
import skimage
from skimage import util, exposure, io, measure, feature
from scipy import ndimage as ndi
import numpy as np
import cv2

Subject

Let’s try to scan this till receipt from our preceding article “Collecting data from hypermarket receipts on Python”. Use the imread() function to read our image and then display it.

img = plt.imread('чек.jpg')
plt.imshow(img)

It seems hardly possible to read any letter from this blurred image. Let’s try to do this again with a predefined function from the opencv library:

def qr_reader(img):
    detector = cv2.QRCodeDetector()
    data, bbox, _ = detector.detectAndDecode(img)
    if data:
        print(data)
    else:
        print('Ooops! Nothing here...')

Scan our image once again:

qr_reader(img)

Ooops! Nothing here...

That’s not surprising, the abundance of pixels makes it difficult for the scanner to recognize the QR code. Nevertheless, we can simplify the task by specifying the edges of the QR image.

Solution

First, let us remove all the unnecessary pixels, find the coordinates of the QR code and pass it to the qr_reader function. First off, remove noise in the image using the median filter and convert our RGB image to grayscale, as QR-codes are composed of only two colors.

image = ndi.median_filter(util.img_as_float(img), size=9)
image = skimage.color.rgb2gray(image)
plt.imshow(image, cmap='gray')

The median filter blurred our image, and the scattered pixels have become less clear, while the QR code looks much better now. Apply the adjust_gamma function to our image. This function exponentiates the gamma value of each pixel, less gamma means that the pixel will be closer to white color. We will set gamma to 0.5.

pores_gamma = exposure.adjust_gamma(image, gamma=0.5)
plt.imshow(pores_gamma, cmap='gray')

We can see clear improvements, the QR code is now much distinct than previously. Let’s take advantage of it and set all pixels with a value of less than 0.3 to 0, while others to 1.

thresholded = (pores_gamma <= 0.3)
plt.imshow(thresholded, cmap='gray')

Now, let’s apply the Canny filter to our thresholded image. This filter smoothes the image and calculate the gradients, the edges are where the gradient at maximum. With the increasing sigma parameter, the canny filter stops discerning less clear edges.

edge = feature.canny(thresholded, sigma=6)
plt.imshow(edge)

Outline the QR code with the coordinates of the edges. We can calculate them with the find_contours method and draw them atop the image. Coordinates are stored in the contours array.

contours = measure.find_contours(edge, 0.5)
plt.imshow(edge)
for contour in contours:
    plt.plot(contour[:,1], contour[:,0], linewidth=2)

We will take minimum and maximum coordinates for X and Y axes, thus drawing a visible rectangle.

positions = np.concatenate(contours, axis=0)
min_pos_x = int(min(positions[:,1]))
max_pos_x = int(max(positions[:,1]))
min_pos_y = int(min(positions[:,0]))
max_pos_y = int(max(positions[:,0]))

Having the coordinates, we can ensquare the code area:

start = (min_pos_x, min_pos_y)
end = (max_pos_x, max_pos_y)
cv2.rectangle(img, start, end, (255, 0, 0), 5)
io.imshow(img)

Let’s try to cut this area according to our coordinates:

new_img = img[min_pos_y:max_pos_y, min_pos_x:max_pos_x]
plt.imshow(new_img)

Pass the new image to the qr_reader function:

qr_reader(new_img)

And it returns this:

t=20190320T2303&s=5803.00&fn=9251440300007971&i=141637&fp=4087570038&n=1

That’s exactly what we need! Of course, the script is not universal and every image is unique, some may have too much noise or low contrast, while others may not. The sequence of actions depends on the case. Next time, we will show the subsequent stage of image processing using a well-established python library.

No comments 421 2020 data analytics python skimage

Predicting category of products by name from Russian Food Stores

⏱ Время чтения текста – 9 минут

This article is a continuation of our series about analyzing data on consumer products: «Collecting data from hypermarket receipts on Python» and «Parsing the data of site’s catalog, using Beautiful Soup and Selenium». We are going to build a model that would classify products by name in a till receipt. Till receipts contain data for each product bought, but it doesn’t provide us a summary of how much were spent on Sweets or Dairy Foods in total.

Data Wrangling

Load data from our .csv file to a Pandas DataFrame and see how it looks:

Did you know that we can emulate human behavior to parse data from a web-catalog? More details about it are in this article: «Parsing the data of site’s catalog, using Beautiful Soup and Selenium»

import pandas as pd
sku = pd.read_csv('SKU_igoods.csv',sep=';')
sku.head()

As you can see, the DataFrame contains even more than we need for predicting the category of products by name. So we can drop() columns with prices and weights, and rename() the remaining ones:

sku.drop(columns=['Unnamed: 0', 'Weight','Price'],inplace=True)
sku.rename(columns={"SKU": "SKU", "Category": "Group"},inplace=True)
sku.head()

Group the products by its category and count them up with the following methods:

sku.groupby('Group').agg(['count'])

We will train our predictive model on this data so that it could identify the product category by name. Since the DataFrame includes product names mainly in Russian, the model won’t make predictions properly. The Russian language contains a lot of prepositions, conjunctions, and specific speech patterns. We want our model to distinguish that «Мангал с ребрами жесткости» («Brazier with strengthening ribs» ) and «Мангал с 6 шампурами» («Brazier with 6 skewers») belongs to the same category. With this is my we need to clean up all the product names, removing conjunctions, preposition, interjections, particles and keep only word bases with the help of stemming.

A stemmer is a tool that operates on the principle of recognizing “stem” words embedded in other words.

import nltk
from nltk.corpus import stopwords
from pymystem3 import Mystem
from string import punctuation
nltk.download('stopwords')

In our case will be using the pymystem3 library developed by Yandex. Product names in our DataFrame may vary from those ones you could find in supermarkets today. So first, let’s improve the list of stop words that our predictive model will ignore.

mystem = Mystem() 
russian_stopwords = stopwords.words("russian")
russian_stopwords.extend(['лента','ассорт','разм','арт','что', 'это', 'так', 'вот', 'быть', 'как', 'в', '—', 'к', 'на'])

Write a function that would preprocess our data and extract the word base, remove punctuation, numerical signs, and stop words. The following code snippet belongs to one Kaggle kernel.

def preprocess_text(text):
    text = str(text)
    tokens = mystem.lemmatize(text.lower())
    tokens = [token for token in tokens if token not in russian_stopwords\
              and token != " " \
              and len(token)>=3 \
              and token.strip() not in punctuation \
              and token.isdigit()==False]
    text = " ".join(tokens)
    return text

See how it works:

An extract from Borodino (Russian: Бородино), a poem by Russian poet Mikhail Lermontov which describes the Battle of Borodino.

preprocess_text("Мой дядя самых честных правил, Когда не в шутку занемог, Он уважать себя заставил И лучше выдумать не мог.")

Transformed into:

'дядя самый честный правило шутка занемогать уважать заставлять выдумывать мочь'

Everything works as expected – the result includes only word stems in lower case with no punctuation, prepositions or conjunctions. Let’s apply this function to a product name from our DataFrame:

print(‘Before:’, sku['SKU'][0])
print(‘After:’, preprocess_text(sku['SKU'][0]))

Preprocessed text:

Before: Фисташки соленые жареные ТМ 365 дней
After: фисташка соленый жареный день

The function works fine and now we can apply it to the whole column, and create a new one with processed names:

sku['processed']=sku['SKU'].apply(preprocess_text)
sku.head()

Building our Predictive Model

We will be using CountVectorizer to predict the product category, and Naive Bayes Classifier.
CountVectorizer will tokenize our text and build a vocabulary of known words, while Naive Bayes Classifier allows us to train our model on a DataFrame with multiple classes. We will also need TfidfTransformer for computing words count (term frequency). As we want to chain these steps, let’s import the Pipeline library:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from imblearn.pipeline import Pipeline

Separate our targets, Y (categories) from the predictors, X (processed product names). And split the DataFrame into Test and Training sets, allocating 33% of samples for testing.

x = sku.processed
y = sku.Group
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33)

Add the following methods to our pipeline:

CountVectorizer() – returns a matrix of token counts
TfidfTransformer() – transforms a matrix into a normalized tf-idf representation
MultinomialNB() – an algorithm for predicting product category

text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1,2))),
                     ('tfidf', TfidfTransformer()), 
                    ('clf', MultinomialNB())])

Fit our model to the Training Dataset and make predictions for the Test Dataset:

text_clf = text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)

Evaluate our predictive model:

print('Score:', text_clf.score(X_test, y_test))

The model predicts correctly 90% of the time.

Score: 0.923949864498645

Validate our model with the real-world data

Let’s test how good our model performs on real-world data. We’ll refer to the DataFrame from our previous article, «Collecting data from hypermarket receipts on Python», and preprocess the product names:

my_products['processed']=my_products['name'].apply(preprocess_text)
my_products.head()

Pass the processed text to the model and create a new column that would hold our predictions:

prediction = text_clf.predict(my_products['processed'])
my_products['prediction']=prediction
my_products[['name', 'prediction']]

Now, the DataFrame looks the following:

Calculate the spendings for each product category:

my_products.groupby('prediction').sum()

Overall, the model seems to be robust in predicting that sausages fall under meat products, quark is a dairy product, baguette belongs to bread and pastries. But sometimes it misclassifies kiwi as a dairy product and pear as an eco-product. This is probably because these categories include many products are «with the taste of pear» or «with the taste of kiwi», and the algorithm makes predictions based on the prevailing group of products. This is a well-known issue of unbalanced classes, but it can be addressed by resampling the DataSet or choosing proper weights for our model.

No comments 548 2020 data analytics machine learning python

Earlier Ctrl + ↓

Your password

LEFT JOIN: blog on analytics, visualisation & data science