Parsing the data of site’s catalogue, using Beautiful Soup (part 1)

⏱ Время чтения текста – 6 минут

We continue discovering Python and receipt data. We have information on receipt records, obtained from the Tax Service site, it’s good. Now we need to learn how to define a category, corresponding to a good. We will do it in an automatic mode, having provisionally constructed a model of machine learning.

The key “pain” / complexity – we need so-called training set. Looking ahead, I’ll say, that I have already managed to test such training set on real data of one of the food chains. Unfortunately, the results were not so cheering, therefore we’ll try to assemble another fair training set, using open sources from the Internet.

For our work we’ll take a popular site for food delivery from hypermarkets. Today we’ll use simple, handy and functional library in Python for data parsing from .html pages – Beautiful Soup.

The target page, whereon the data of our interest look as follows:

Page of the site’s catalogue on food delivery from hypermarkets

However, if we visit the site for the first time, we will be redirected to the main page, since we haven’t selected the closest hypermarket.

The main page of the site on food delivery from hypermarkets

The problem is clear, most likely, it can be solved by installing cookies, like those that a user has locally saved.

Now it’s important to define the course of action and the structure of data collection.

Main task:

To collect the names of all the top-levelled categories and URL of the site, relevant thereto
Using the list, obtained in the clause 1, to collect the data on goods from every category’s page.

Today we will sort out the first part. Let’s begin, in theory, the following commands should be enough:

from bs4 import BeautifulSoup #importing a module of BeautifulSoup
import requests
r = requests.get('https://i****.ru/products?category_id=1-ovoschi-frukty-griby-yagody&from_category=true')
soup = BeautifulSoup(r.text)

However, if we view the contents of soup, we will find out, that we’ve received the source code of the main page, instead of the point of our interest. The main page doesn’t fit for the aims of our analysis and information collection.

We’ve received the source code of the site’s main page

So, we’ll use a method Session of a library requests, by means of which one can transfer cookies as a parameter. Therefore, our code looks as follows:

import requests
s = requests.Session()
r = s.get('https://i****.ru/products?category_id=1-ovoschi-frukty-griby-yagody&from_category=true', \
       cookies = {'_igooods_session_cross_domain':'WWJFaU8wMTBMSE9uVlR2YnRLKzlvdHE3MVgyTjVlS1JKVm1qMjVNK2JSbEYxcVZNQk9OR3A4VU1LUzZwY1lCeVlTNDVsSkFmUFNSRWt3cXdUYytxQlhnYk5BbnVoZktTMUJLRWQyaWxFeXRsR1ZCVzVnSGJRU0tLVVR0MjRYR2hXbXpaZnRnYWRzV0VnbmpjdjA5T1RzZEFkallmMEVySVA3ZkV3cjU5dVVaZjBmajU5bDIxVkEwbUQvSUVyWGdqaTc5WEJyT2tvNTVsWWx1TEZhQXB1L3dKUXl5aWpOQllEV245VStIajFDdXphWFQxVGVpeGJDV3JseU9lbE1vQmxhRklLa3BsRm9XUkNTakIrWXlDc3I5ZjdZOGgwYmplMFpGRGRxKzg3QTJFSGpkNWh5RmdxZzhpTXVvTUV5SFZnM2dzNHVqWkJRaTlwdmhkclEyNVNDSHJsVkZzeVpBaGc1ZmQ0NlhlSG43YnVHRUVDL0ZmUHVIelNhRkRZSVFYLS05UkJqM24yM0d4bjFBRWFVQjlYSzJnPT0%3D--e17089851778bedd374f240c353f399027fe0fb1', \
               'sa_current_city_coordinates_cross_domain' : '%5B59.91815364%2C30.305578%5D', \
                  'sa_current_city_cross_domain' : '%D0%A1%D0%B0%D0%BD%D0%BA%D1%82-%D0%9F%D0%B5%D1%82%D0%B5%D1%80%D0%B1%D1%83%D1%80%D0%B3',\
                 'lazy_loader_last_url' : '/products?category_id=1-ovoschi-frukty-griby-yagody&from_category=true'})
soup = BeautifulSoup(r.text)

We are installing 4 cookies, that emulate the human’s behaviour and the selected hypermarket (we’ve installed them empirically from cookies, installed by the browser, at selection of the relevant hypermarket):

Cookies, affecting the display of the site’s main page

Great, what we have left, is to collect the categories we need and links thereto. In order to do it, we will write the following code:

categories=soup.find_all('div', {'class':['with-children']})
tree = {}
for x in categories:
    tree[x.findNext('span').text]=x.findNext('a').get('href')

The same way as before, in this snippet we call the desired browser page by GET-request with parameters (cookies) and download the data, thereafter we have an object of a class BeautifulSoup.
We use the command:

categories=soup.find_all('div', {'class':['with-children']})

And find all the elements <div>, that have the class with-children, here you can see the same thing in the site’s code:

Elements, containing the name of the category.

Then, we add an empty class object dict and for each of the above-found elements in the cycle we assemble:

tree[x.findNext('span').text]=x.findNext('a').get('href’)

Which, in fact, means: we take the text of the subsequent element to the found <div> element <span> and the link address, subsequent to the found one <div>.
That’s exactly what we had to receive. Thus, we’ve got a dictionary of a type {Category: URL of the site}:

Directory of categories and URL relevant thereto

In the next article you can find the follow-up on information collection on cards of catalogue’s goods.

Your password

LEFT JOIN: blog on analytics, visualisation & data science

Parsing the data of site’s catalogue, using Beautiful Soup (part 1)

Main task: