Parsing the data of site’s catalogue, using Beautiful Soup and Selenium (part 2)

Время чтения текста – 6 минут

Follow-up of the previous article on data collection from the famous online catalogue of goods.
If analyzing the behaviour of page with goods thoroughly, one can notice, that the goods are uploaded dynamically, i.e. scrolling the page down, you will receive a new set of goods, and, thus, the code from the previous article will turn out to be useless for this task.

Dynamic uploading of goods on the page

For these cases, there is also a solution in python – Selenium library, it launches the browser’s engine and emulates human’s behaviour.

In the first part of the script we will assemble a tree of categories similarly to the previous article, but already using Selenium.

import time
from selenium import webdriver
from bs4 import BeautifulSoup as bs

browser = webdriver.Chrome()
browser.get("https://i****.ru/products?category_id=1-ovoschi-frukty-griby-yagody&from_category=true")
cookies_1= {'domain': '.i****.ru', 'expiry': 1962580137, 'httpOnly': False, 'name': '_igooods_session_cross_domain', 'path': '/', 'secure': False, 'value': 'WWJFaU8wMTBMSE9uVlR2YnRLKzlvdHE3MVgyTjVlS1JKVm1qMjVNK2JSbEYxcVZNQk9OR3A4VU1LUzZwY1lCeVlTNDVsSkFmUFNSRWt3cXdUYytxQlhnYk5BbnVoZktTMUJLRWQyaWxFeXRsR1ZCVzVnSGJRU0tLVVR0MjRYR2hXbXpaZnRnYWRzV0VnbmpjdjA5T1RzZEFkallmMEVySVA3ZkV3cjU5dVVaZjBmajU5bDIxVkEwbUQvSUVyWGdqaTc5WEJyT2tvNTVsWWx1TEZhQXB1L3dKUXl5aWpOQllEV245VStIajFDdXphWFQxVGVpeGJDV3JseU9lbE1vQmxhRklLa3BsRm9XUkNTakIrWXlDc3I5ZjdZOGgwYmplMFpGRGRxKzg3QTJFSGpkNWh5RmdxZzhpTXVvTUV5SFZnM2dzNHVqWkJRaTlwdmhkclEyNVNDSHJsVkZzeVpBaGc1ZmQ0NlhlSG43YnVHRUVDL0ZmUHVIelNhRkRZSVFYLS05UkJqM24yM0d4bjFBRWFVQjlYSzJnPT0%3D--e17089851778bedd374f240c353f399027fe0fb1'}
cookies_2= {'domain': '.i****.ru', 'expiry': 1962580137, 'httpOnly': False, 'name': 'sa_current_city_coordinates_cross_domain', 'path': '/', 'secure': False, 'value': '%5B59.91815364%2C30.305578%5D'}
cookies_3= {'domain': '.i****.ru', 'expiry': 1962580137, 'httpOnly': False, 'name': 'sa_current_city_cross_domain', 'path': '/', 'secure': False, 'value': '%D0%A1%D0%B0%D0%BD%D0%BA%D1%82-%D0%9F%D0%B5%D1%82%D0%B5%D1%80%D0%B1%D1%83%D1%80%D0%B3'}
browser.add_cookie(cookies_1)
browser.add_cookie(cookies_2)
browser.add_cookie(cookies_3)
browser.get("https://i****.ru/products?category_id=1-ovoschi-frukty-griby-yagody&from_category=true")
source_data = browser.page_source
soup = bs(source_data)
categories=soup.find_all('div', {'class':['with-children']})
tree = {}
for x in categories:
    tree[x.findNext('span').text]=x.findNext('a').get('href')

In this snippet, the same way as before, by a get-request with parameters, we call the desired browser page and download the data, then we get an object of bs class, with which we make the similar operations. Thus, we receive a dictionary tree, where URL pages are stored for each category. Subsequently, we will need this dictionary for item-by-item examination in the cycle.

Let’s initiate the data collection for goods. In order to do it, we import the library pandas and create a new dataframe with four columns.

import pandas as pd
df = pd.DataFrame(columns=['SKU', 'Weight', 'Price','Category'])

Thereafter, we’ll use our dictionary tree and obtain page’s data for each category. You can see the code below. We still want to install cookie, that a user has installed, and also to perform some tricky commands for operation of browser’s engine, that can emulate cursor’s movement down the page.

for cat, link in tree.items():
    browser.maximize_window()
    browser.get('https://i****.ru'+link)
    cookies_1= {'domain': '.i****.ru', 'expiry': 1962580137, 'httpOnly': False, 'name': '_i****_session_cross_domain', 'path': '/', 'secure': False, 'value': 'WWJFaU8wMTBMSE9uVlR2YnRLKzlvdHE3MVgyTjVlS1JKVm1qMjVNK2JSbEYxcVZNQk9OR3A4VU1LUzZwY1lCeVlTNDVsSkFmUFNSRWt3cXdUYytxQlhnYk5BbnVoZktTMUJLRWQyaWxFeXRsR1ZCVzVnSGJRU0tLVVR0MjRYR2hXbXpaZnRnYWRzV0VnbmpjdjA5T1RzZEFkallmMEVySVA3ZkV3cjU5dVVaZjBmajU5bDIxVkEwbUQvSUVyWGdqaTc5WEJyT2tvNTVsWWx1TEZhQXB1L3dKUXl5aWpOQllEV245VStIajFDdXphWFQxVGVpeGJDV3JseU9lbE1vQmxhRklLa3BsRm9XUkNTakIrWXlDc3I5ZjdZOGgwYmplMFpGRGRxKzg3QTJFSGpkNWh5RmdxZzhpTXVvTUV5SFZnM2dzNHVqWkJRaTlwdmhkclEyNVNDSHJsVkZzeVpBaGc1ZmQ0NlhlSG43YnVHRUVDL0ZmUHVIelNhRkRZSVFYLS05UkJqM24yM0d4bjFBRWFVQjlYSzJnPT0%3D--e17089851778bedd374f240c353f399027fe0fb1'}
    cookies_2= {'domain': '.i****.ru', 'expiry': 1962580137, 'httpOnly': False, 'name': 'sa_current_city_coordinates_cross_domain', 'path': '/', 'secure': False, 'value': '%5B59.91815364%2C30.305578%5D'}
    cookies_3= {'domain': '.i****.ru', 'expiry': 1962580137, 'httpOnly': False, 'name': 'sa_current_city_cross_domain', 'path': '/', 'secure': False, 'value': '%D0%A1%D0%B0%D0%BD%D0%BA%D1%82-%D0%9F%D0%B5%D1%82%D0%B5%D1%80%D0%B1%D1%83%D1%80%D0%B3'}
    browser.add_cookie(cookies_1)
    browser.add_cookie(cookies_2)
    browser.add_cookie(cookies_3)
    browser.get('https://i****.ru'+link)
    
    # Script, that searches the end of the page every 3 seconds and is performed until the receipt of new data is finished.
    lenOfPage = browser.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
    match=False
    while(match==False):
        lastCount = lenOfPage
        time.sleep(3)
        lenOfPage = browser.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
        if lastCount==lenOfPage:
             match=True

Now, we have made it to the end of the page and can collect the data for work of the library beautifulsoup.

# Collecting data from the page
    source_data = browser.page_source
    soup = bs(source_data)
    skus=soup.find_all('div', {'class':['b-product-small-card']})
    last_value=len(df)+1 if len(df)>0 else 0
    for i,x in enumerate(skus):
        df.loc[last_value+i]=[x.findNext('a').contents[0],\
                   x.findNext('div',{'class':'product-weight'}).contents[0],\
                   x.findNext('div',{'class':'g-cart-action small'})['data-price'],\
                   cat]
browser.close()

In the code fragment, presented above, we look for all the elements <div>, that have class – b-product-small-card, and then, for every found good, we collect the values of the fields of weight and price.

Source code of the product card site.

We launch the script’s performance and go to have a cup of coffee. Voila, now we have pandas dataframe with data of all the goods:

DataFrame with goods, collected from the site.

Now we possess great data for training of the NLP model – names of goods and their affiliation to various categories.

Share
Send
Pin
 173   2019   analysis   data science   python
Popular