3 posts tagged

data science

Parsing the data of site’s catalogue, using Beautiful Soup and Selenium (part 2)

Время чтения текста – 6 минут

Follow-up of the previous article on data collection from the famous online catalogue of goods.
If analyzing the behaviour of page with goods thoroughly, one can notice, that the goods are uploaded dynamically, i.e. scrolling the page down, you will receive a new set of goods, and, thus, the code from the previous article will turn out to be useless for this task.

Dynamic uploading of goods on the page

For these cases, there is also a solution in python – Selenium library, it launches the browser’s engine and emulates human’s behaviour.

In the first part of the script we will assemble a tree of categories similarly to the previous article, but already using Selenium.

import time
from selenium import webdriver
from bs4 import BeautifulSoup as bs

browser = webdriver.Chrome()
browser.get("https://i****.ru/products?category_id=1-ovoschi-frukty-griby-yagody&from_category=true")
cookies_1= {'domain': '.i****.ru', 'expiry': 1962580137, 'httpOnly': False, 'name': '_igooods_session_cross_domain', 'path': '/', 'secure': False, 'value': 'WWJFaU8wMTBMSE9uVlR2YnRLKzlvdHE3MVgyTjVlS1JKVm1qMjVNK2JSbEYxcVZNQk9OR3A4VU1LUzZwY1lCeVlTNDVsSkFmUFNSRWt3cXdUYytxQlhnYk5BbnVoZktTMUJLRWQyaWxFeXRsR1ZCVzVnSGJRU0tLVVR0MjRYR2hXbXpaZnRnYWRzV0VnbmpjdjA5T1RzZEFkallmMEVySVA3ZkV3cjU5dVVaZjBmajU5bDIxVkEwbUQvSUVyWGdqaTc5WEJyT2tvNTVsWWx1TEZhQXB1L3dKUXl5aWpOQllEV245VStIajFDdXphWFQxVGVpeGJDV3JseU9lbE1vQmxhRklLa3BsRm9XUkNTakIrWXlDc3I5ZjdZOGgwYmplMFpGRGRxKzg3QTJFSGpkNWh5RmdxZzhpTXVvTUV5SFZnM2dzNHVqWkJRaTlwdmhkclEyNVNDSHJsVkZzeVpBaGc1ZmQ0NlhlSG43YnVHRUVDL0ZmUHVIelNhRkRZSVFYLS05UkJqM24yM0d4bjFBRWFVQjlYSzJnPT0%3D--e17089851778bedd374f240c353f399027fe0fb1'}
cookies_2= {'domain': '.i****.ru', 'expiry': 1962580137, 'httpOnly': False, 'name': 'sa_current_city_coordinates_cross_domain', 'path': '/', 'secure': False, 'value': '%5B59.91815364%2C30.305578%5D'}
cookies_3= {'domain': '.i****.ru', 'expiry': 1962580137, 'httpOnly': False, 'name': 'sa_current_city_cross_domain', 'path': '/', 'secure': False, 'value': '%D0%A1%D0%B0%D0%BD%D0%BA%D1%82-%D0%9F%D0%B5%D1%82%D0%B5%D1%80%D0%B1%D1%83%D1%80%D0%B3'}
browser.add_cookie(cookies_1)
browser.add_cookie(cookies_2)
browser.add_cookie(cookies_3)
browser.get("https://i****.ru/products?category_id=1-ovoschi-frukty-griby-yagody&from_category=true")
source_data = browser.page_source
soup = bs(source_data)
categories=soup.find_all('div', {'class':['with-children']})
tree = {}
for x in categories:
    tree[x.findNext('span').text]=x.findNext('a').get('href')

In this snippet, the same way as before, by a get-request with parameters, we call the desired browser page and download the data, then we get an object of bs class, with which we make the similar operations. Thus, we receive a dictionary tree, where URL pages are stored for each category. Subsequently, we will need this dictionary for item-by-item examination in the cycle.

Let’s initiate the data collection for goods. In order to do it, we import the library pandas and create a new dataframe with four columns.

import pandas as pd
df = pd.DataFrame(columns=['SKU', 'Weight', 'Price','Category'])

Thereafter, we’ll use our dictionary tree and obtain page’s data for each category. You can see the code below. We still want to install cookie, that a user has installed, and also to perform some tricky commands for operation of browser’s engine, that can emulate cursor’s movement down the page.

for cat, link in tree.items():
    browser.maximize_window()
    browser.get('https://i****.ru'+link)
    cookies_1= {'domain': '.i****.ru', 'expiry': 1962580137, 'httpOnly': False, 'name': '_i****_session_cross_domain', 'path': '/', 'secure': False, 'value': 'WWJFaU8wMTBMSE9uVlR2YnRLKzlvdHE3MVgyTjVlS1JKVm1qMjVNK2JSbEYxcVZNQk9OR3A4VU1LUzZwY1lCeVlTNDVsSkFmUFNSRWt3cXdUYytxQlhnYk5BbnVoZktTMUJLRWQyaWxFeXRsR1ZCVzVnSGJRU0tLVVR0MjRYR2hXbXpaZnRnYWRzV0VnbmpjdjA5T1RzZEFkallmMEVySVA3ZkV3cjU5dVVaZjBmajU5bDIxVkEwbUQvSUVyWGdqaTc5WEJyT2tvNTVsWWx1TEZhQXB1L3dKUXl5aWpOQllEV245VStIajFDdXphWFQxVGVpeGJDV3JseU9lbE1vQmxhRklLa3BsRm9XUkNTakIrWXlDc3I5ZjdZOGgwYmplMFpGRGRxKzg3QTJFSGpkNWh5RmdxZzhpTXVvTUV5SFZnM2dzNHVqWkJRaTlwdmhkclEyNVNDSHJsVkZzeVpBaGc1ZmQ0NlhlSG43YnVHRUVDL0ZmUHVIelNhRkRZSVFYLS05UkJqM24yM0d4bjFBRWFVQjlYSzJnPT0%3D--e17089851778bedd374f240c353f399027fe0fb1'}
    cookies_2= {'domain': '.i****.ru', 'expiry': 1962580137, 'httpOnly': False, 'name': 'sa_current_city_coordinates_cross_domain', 'path': '/', 'secure': False, 'value': '%5B59.91815364%2C30.305578%5D'}
    cookies_3= {'domain': '.i****.ru', 'expiry': 1962580137, 'httpOnly': False, 'name': 'sa_current_city_cross_domain', 'path': '/', 'secure': False, 'value': '%D0%A1%D0%B0%D0%BD%D0%BA%D1%82-%D0%9F%D0%B5%D1%82%D0%B5%D1%80%D0%B1%D1%83%D1%80%D0%B3'}
    browser.add_cookie(cookies_1)
    browser.add_cookie(cookies_2)
    browser.add_cookie(cookies_3)
    browser.get('https://i****.ru'+link)
    
    # Script, that searches the end of the page every 3 seconds and is performed until the receipt of new data is finished.
    lenOfPage = browser.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
    match=False
    while(match==False):
        lastCount = lenOfPage
        time.sleep(3)
        lenOfPage = browser.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
        if lastCount==lenOfPage:
             match=True

Now, we have made it to the end of the page and can collect the data for work of the library beautifulsoup.

# Collecting data from the page
    source_data = browser.page_source
    soup = bs(source_data)
    skus=soup.find_all('div', {'class':['b-product-small-card']})
    last_value=len(df)+1 if len(df)>0 else 0
    for i,x in enumerate(skus):
        df.loc[last_value+i]=[x.findNext('a').contents[0],\
                   x.findNext('div',{'class':'product-weight'}).contents[0],\
                   x.findNext('div',{'class':'g-cart-action small'})['data-price'],\
                   cat]
browser.close()

In the code fragment, presented above, we look for all the elements <div>, that have class – b-product-small-card, and then, for every found good, we collect the values of the fields of weight and price.

Source code of the product card site.

We launch the script’s performance and go to have a cup of coffee. Voila, now we have pandas dataframe with data of all the goods:

DataFrame with goods, collected from the site.

Now we possess great data for training of the NLP model – names of goods and their affiliation to various categories.

 No comments    176   2019   analysis   data science   python

Parsing the data of site’s catalogue, using Beautiful Soup (part 1)

Время чтения текста – 6 минут

We continue discovering Python and receipt data. We have information on receipt records, obtained from the Tax Service site, it’s good. Now we need to learn how to define a category, corresponding to a good. We will do it in an automatic mode, having provisionally constructed a model of machine learning.

The key “pain” / complexity – we need so-called training set. Looking ahead, I’ll say, that I have already managed to test such training set on real data of one of the food chains. Unfortunately, the results were not so cheering, therefore we’ll try to assemble another fair training set, using open sources from the Internet.

For our work we’ll take a popular site for food delivery from hypermarkets. Today we’ll use simple, handy and functional library in Python for data parsing from .html pages – Beautiful Soup.

The target page, whereon the data of our interest look as follows:

Page of the site’s catalogue on food delivery from hypermarkets

However, if we visit the site for the first time, we will be redirected to the main page, since we haven’t selected the closest hypermarket.

The main page of the site on food delivery from hypermarkets

The problem is clear, most likely, it can be solved by installing cookies, like those that a user has locally saved.

Now it’s important to define the course of action and the structure of data collection.

Main task:

  1. To collect the names of all the top-levelled categories and URL of the site, relevant thereto
  2. Using the list, obtained in the clause 1, to collect the data on goods from every category’s page.

Today we will sort out the first part. Let’s begin, in theory, the following commands should be enough:

from bs4 import BeautifulSoup #importing a module of BeautifulSoup
import requests
r = requests.get('https://i****.ru/products?category_id=1-ovoschi-frukty-griby-yagody&from_category=true')
soup = BeautifulSoup(r.text)

However, if we view the contents of soup, we will find out, that we’ve received the source code of the main page, instead of the point of our interest. The main page doesn’t fit for the aims of our analysis and information collection.

We’ve received the source code of the site’s main page

So, we’ll use a method Session of a library requests, by means of which one can transfer cookies as a parameter. Therefore, our code looks as follows:

import requests
s = requests.Session()
r = s.get('https://i****.ru/products?category_id=1-ovoschi-frukty-griby-yagody&from_category=true', \
       cookies = {'_igooods_session_cross_domain':'WWJFaU8wMTBMSE9uVlR2YnRLKzlvdHE3MVgyTjVlS1JKVm1qMjVNK2JSbEYxcVZNQk9OR3A4VU1LUzZwY1lCeVlTNDVsSkFmUFNSRWt3cXdUYytxQlhnYk5BbnVoZktTMUJLRWQyaWxFeXRsR1ZCVzVnSGJRU0tLVVR0MjRYR2hXbXpaZnRnYWRzV0VnbmpjdjA5T1RzZEFkallmMEVySVA3ZkV3cjU5dVVaZjBmajU5bDIxVkEwbUQvSUVyWGdqaTc5WEJyT2tvNTVsWWx1TEZhQXB1L3dKUXl5aWpOQllEV245VStIajFDdXphWFQxVGVpeGJDV3JseU9lbE1vQmxhRklLa3BsRm9XUkNTakIrWXlDc3I5ZjdZOGgwYmplMFpGRGRxKzg3QTJFSGpkNWh5RmdxZzhpTXVvTUV5SFZnM2dzNHVqWkJRaTlwdmhkclEyNVNDSHJsVkZzeVpBaGc1ZmQ0NlhlSG43YnVHRUVDL0ZmUHVIelNhRkRZSVFYLS05UkJqM24yM0d4bjFBRWFVQjlYSzJnPT0%3D--e17089851778bedd374f240c353f399027fe0fb1', \
               'sa_current_city_coordinates_cross_domain' : '%5B59.91815364%2C30.305578%5D', \
                  'sa_current_city_cross_domain' : '%D0%A1%D0%B0%D0%BD%D0%BA%D1%82-%D0%9F%D0%B5%D1%82%D0%B5%D1%80%D0%B1%D1%83%D1%80%D0%B3',\
                 'lazy_loader_last_url' : '/products?category_id=1-ovoschi-frukty-griby-yagody&from_category=true'})
soup = BeautifulSoup(r.text)

We are installing 4 cookies, that emulate the human’s behaviour and the selected hypermarket (we’ve installed them empirically from cookies, installed by the browser, at selection of the relevant hypermarket):

Cookies, affecting the display of the site’s main page

Great, what we have left, is to collect the categories we need and links thereto. In order to do it, we will write the following code:

categories=soup.find_all('div', {'class':['with-children']})
tree = {}
for x in categories:
    tree[x.findNext('span').text]=x.findNext('a').get('href')

The same way as before, in this snippet we call the desired browser page by GET-request with parameters (cookies) and download the data, thereafter we have an object of a class BeautifulSoup.
We use the command:

categories=soup.find_all('div', {'class':['with-children']})

And find all the elements <div>, that have the class with-children, here you can see the same thing in the site’s code:

Elements, containing the name of the category.

Then, we add an empty class object dict and for each of the above-found elements in the cycle we assemble:

tree[x.findNext('span').text]=x.findNext('a').get('href’)

Which, in fact, means: we take the text of the subsequent element to the found <div> element <span> and the link address, subsequent to the found one <div>.
That’s exactly what we had to receive. Thus, we’ve got a dictionary of a type {Category: URL of the site}:

Directory of categories and URL relevant thereto

In the next article you can find the follow-up on information collection on cards of catalogue’s goods.

 No comments    178   2019   data science   python

Collecting data from hypermarket receipts on Python

Время чтения текста – 6 минут

Recently, once again buying products in a hypermarket, I recalled that, according to the Russian Federal Act FZ-54, any trade operator, that issues a receipt, is obliged to send the data thereof to the Tax Service.

Receipt from “Lenta” hypermarket. The QR-code of our interest is circled.

So, what does it mean for us, data analysts? It means that we can know ourselves and our needs better, and also acquire interesting data on own purchases.

Let’s try to assemble a small prototype of an app that will allow to make a dynamic of our purchases within the framework of blog posts’ series. So, we’ll start from the fact, that each receipt has a QR-code, and if you identify it, you’ll receive the following line:

t=20190320T2303&s=5803.00&fn=9251440300007971&i=141637&fp=4087570038&n=1

This line comprises:

t – timestamp, the time when you made a purchase
s – sum of the receipt
fn – code number of fss, will be needed further in a request to API
i – receipt number, will be needed further in a request to API
fp – fiscalsign parameter, will be needed further in a request to API

Within the solution of the first step, we will parse the receipt data and collect it in pandas dataframe, using Python modules.

We will use API, that provides data on the receipt from the Tax Service website.

Initially, we will receive authentication data:

import requests
your_phone = '+7XXXYYYZZZZ' #you need to state your phone number, SMS with password will arrive thereon
r = requests.post('https://proverkacheka.nalog.ru:9999/v1/mobile/users/signup', json = {"email":"email@email.com","name":"USERNAME","phone":your_phone})

As a result of performing POST request we receive a password in SMS to the indicated phone number. Further on, we will be using it in a variable pwd

Now we’ll parse our line with values from QR-code:

import re
qr_string='t=20190320T2303&s=5803.00&fn=9251440300007971&i=141637&fp=4087570038&n=1'
t=re.findall(r't=(\w+)', qr_string)[0]
s=re.findall(r's=(\w+)', qr_string)[0]
fn=re.findall(r'fn=(\w+)', qr_string)[0]
i=re.findall(r'i=(\w+)', qr_string)[0]
fp=re.findall(r'fp=(\w+)', qr_string)[0]

We’ll use the variables obtained in order to extract the data.
One Habr post pretty thoroughly examines status of errors at formation of API request, therefore I won’t repeat this information.

In the beginning, we need to verify the presence of data on this receipt, so we form a GET request.

headers = {'Device-Id':'', 'Device-OS':''}
payload = {'fiscalSign': fp, 'date': t,'sum':s}
check_request=requests.get('https://proverkacheka.nalog.ru:9999/v1/ofds/*/inns/*/fss/'+fn+'/operations/1/tickets/'+i,params=payload, headers=headers,auth=(your_phone, pwd))
print(check_request.status_code)

In the request one needs to indicate headers, at least empty ones. In my case, GET request returns error 406, thus I get that such receipt is found (why GET request returns 406 remains a mystery to me, so I will be glad to receive some clues in comments). If not indicating sum or date, GET request returns error 400 – bad request.

Let’s move on to the most interesting part, receiving data of the receipt:

request_info=requests.get('https://proverkacheka.nalog.ru:9999/v1/inns/*/kkts/*/fss/'+fn+'/tickets/'+i+'?fiscalSign='+fp+'&sendToEmail=no',headers=headers,auth=(your_phone, pwd))
print(request_info.status_code)
products=request_info.json()

We should receive code 200 (successful execution of the request), and in the variable products – everything, that applies to our receipt.

In order to further work with this data, let’s use pandas and transform everything in dataframe.

import pandas as pd
from datetime import datetime
my_products=pd.DataFrame(products['document']['receipt']['items'])
my_products['price']=my_products['price']/100
my_products['sum']=my_products['sum']/100
datetime_check = datetime.strptime(t, '%Y%m%dT%H%M') #((https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior formate the date))
my_products['date']=datetime_check
my_products.set_index('date',inplace=True)

Now we have working pandas.dataframe with receipts, visually it looks as follows:

“Header” of receipt data

You can construct a bar chart of purchases or observe everything as a box plot:

import matplotlib.pyplot as plt
%matplotlib inline
my_products['sum'].plot(kind='hist', bins=20)
plt.show()
my_products['sum'].plot(kind='box')
plt.show()
boxplot_cheques.png

In conclusion, we will simply get descriptive statistics as text, using a command .describe():

my_products.describe()

It’s convenient to write down data as .csv file, so that the next time you can amend the statistics:

with open('hyper_receipts.csv', 'a') as f:
             my_products.to_csv(f, header=True)
 No comments    703   2019   analysis   data science   machine learning   pandas   python   web-crawling