Gathering fresh proxies with Python for Free
⏱ Время чтения текста – 4 минутыSometimes, when we try to parse a website with Selenium our IP address might get blacklisted. That’s why it’s better to use a proxy. Today we are going to write a script that would scrape new proxies, do checking and, in case of success, pass them to Selenium
Scraping fresh proxies
Let’s start by importing libraries, we will need modules for sending requests, scraping and storing data.
import requests_html
from bs4 import BeautifulSoup
import pickle
import requests
All proxies wiil be stored in the px_list and written to proxis.pickle. Data will be loaded from this file if it’s not empty.
px_list = set()
try:
with open('proxis.pickle', 'rb') as f:
px_list = pickle.load(f)
except:
pass
The scrap_proxy() function will navigate to free-proxy-list.net and gather the latest 20 proxies, which are updated every minute on the site. Here’s the field we are interested in:
We will need to extract an IP Address and Port. Let’s inspect elements on this page:
The data we need to gather is represented as table cells. We will take the first 20 cells with a for loop, and request an IP-address and Port with xpath. Finally, the function will send fresh proxies to the pickle file and return them as a list.
def scrap_proxy():
global px_list
px_list = set()
session = requests_html.HTMLSession()
r = session.get('https://free-proxy-list.net/')
r.html.render()
for i in range(1, 21):
add=r.html.xpath('/html/body/section[1]/div/div[2]/div/div[2]/div/table/tbody/tr[{}]/td[1]/text()'.format(i))[0]
port=r.html.xpath('/html/body/section[1]/div/div[2]/div/div[2]/div/table/tbody/tr[{}]/td[2]/text()'.format(i))[0]
px_list.add(':'.join([add, port]))
print("---New proxy scraped, left: " + str(len(px_list)))
with open('proxis.pickle', 'wb') as f:
pickle.dump(px_list, f)
return px_list
Checking new proxies
Oftentimes gathered proxies might be not working, so we need to write a function that would check them by sending a GET request to Google and, if there is an error, it will return False. In case the proxy is working, it will return True.
def check_proxy(px):
try:
requests.get("https://www.google.com/", proxies = {"https": "https://" + px}, timeout = 3)
except Exception as x:
print('--'+px + ' is dead: '+ x.__class__.__name__)
return False
return True
Main function
We will pass to our main function the scap parameter, which is False by default. New proxies will be gathered if the following conditions are met: scrap == True or len(px_list)<6. Then we gather new proxies using a while loop , take the last one to check, if check_proxy returns True , other proxies will be sent to the pickle file and the function return the working IP address and Port.
def get_proxy(scrap = False):
global px_list
if scrap or len(px_list) < 6:
px_list = scrap_proxy()
while True:
if len(px_list) < 6:
px_list = scrap_proxy()
px = px_list.pop()
if check_proxy(px):
break
print('-'+px+' is alive. ({} left)'.format(str(len(px_list))))
with open('proxis.pickle', 'wb') as f:
pickle.dump(px_list, f)
return px
Changing proxies in Selenium
Сheck out our previous articles on Selenium about handling website buttons and scraping an online store catalog
Import the get_proxy function to configure proxies in Selenium and run a while loop. The PROXY variable will accept our freshly-grabbed proxy and be added to the browser options. Now we can create a new webdriver instance with updated options and let’s try to access the website, add some cookies, and if everything works fine the while loop will be break. Otherwise, the function will run until there’s a working proxy found.
from px_scrap import get_proxy
while True:
PROXY = get_proxy(scrap=True)
options.add_argument('--proxy-server=%s' % PROXY)
driver = webdriver.Chrome(chrome_options=options, executable_path=os.path.abspath("chromedriver"))
try:
driver.get('https://google.com')
driver.add_cookie(cookies)
except:
print('Captcha!')