1 post tagged

selenium

Gathering fresh proxies with Python for Free

Время чтения текста – 4 минуты

Sometimes, when we try to parse a website with Selenium our IP address might get blacklisted. That’s why it’s better to use a proxy. Today we are going to write a script that would scrape new proxies, do checking and, in case of success, pass them to Selenium

Scraping fresh proxies

Let’s start by importing libraries, we will need modules for sending requests, scraping and storing data.

import requests_html
from bs4 import BeautifulSoup
import pickle
import requests

All proxies wiil be stored in the px_list and written to  proxis.pickle. Data will be loaded from this file if it’s not empty.

px_list = set()
try:
    with open('proxis.pickle', 'rb') as f:
            px_list = pickle.load(f)
except:
    pass

The  scrap_proxy() function will navigate to free-proxy-list.net and gather the latest 20 proxies, which are updated every minute on the site. Here’s the field we are interested in:

We will need to extract an IP Address and Port. Let’s inspect elements on this page:

The data we need to gather is represented as table cells. We will take the first 20 cells with a for loop, and request an IP-address and Port with xpath. Finally, the function will send fresh proxies to the pickle file and return them as a list.

def scrap_proxy():  
    global px_list
    px_list = set()

    session = requests_html.HTMLSession()
    r = session.get('https://free-proxy-list.net/')
    r.html.render()
    for i in range(1, 21):
        add=r.html.xpath('/html/body/section[1]/div/div[2]/div/div[2]/div/table/tbody/tr[{}]/td[1]/text()'.format(i))[0]
        port=r.html.xpath('/html/body/section[1]/div/div[2]/div/div[2]/div/table/tbody/tr[{}]/td[2]/text()'.format(i))[0]
        px_list.add(':'.join([add, port]))

    print("---New proxy scraped, left: " + str(len(px_list)))
    with open('proxis.pickle', 'wb') as f:
        pickle.dump(px_list, f)
    return px_list

Checking new proxies

Oftentimes gathered proxies might be not  working, so we need to write a function that would check them by sending a GET request to Google and, if there is an error, it will return False. In case the proxy is working, it will return True.

def check_proxy(px):
    try:
        requests.get("https://www.google.com/", proxies = {"https": "https://" + px}, timeout = 3)
    except Exception as x:
        print('--'+px + ' is dead: '+ x.__class__.__name__)
        return False
    return True

Main function

We will pass to our main function the scap parameter, which is False by default. New proxies will be gathered if the following conditions are met: scrap == True or len(px_list)<6. Then we gather new proxies using a while loop , take the last one to check, if check_proxy returns True , other proxies will be sent to the pickle file and the function return the working IP address and Port.

def get_proxy(scrap = False):
    global px_list
    if scrap or len(px_list) < 6:
            px_list = scrap_proxy()
    while True:
        if len(px_list) < 6:
            px_list = scrap_proxy()
        px = px_list.pop()
        if check_proxy(px):
            break
    print('-'+px+' is alive. ({} left)'.format(str(len(px_list))))
    with open('proxis.pickle', 'wb') as f:
            pickle.dump(px_list, f)
    return px

Changing proxies in Selenium

Сheck out our previous articles on Selenium about handling website buttons and scraping an online store catalog

Import the get_proxy function to configure proxies in Selenium and run a while loop. The PROXY variable will accept our freshly-grabbed proxy and be added to the browser options. Now we can create a new webdriver instance with updated options and let’s try to access the website, add some cookies, and if everything works fine the while loop will be break. Otherwise, the function will run until there’s a working proxy found.

from px_scrap import get_proxy

while True:
    PROXY = get_proxy(scrap=True)
    options.add_argument('--proxy-server=%s' % PROXY)
    driver = webdriver.Chrome(chrome_options=options, executable_path=os.path.abspath("chromedriver"))
    try:
        driver.get('https://google.com')
        driver.add_cookie(cookies)
    except:
        print('Captcha!')
 1 comment    480   2020   proxy   python   selenium