<?xml version="1.0" encoding="utf-8"?> 
<rss version="2.0">

<channel>

<title>LEFT JOIN: blog on analytics, visualisation &amp; data science, posts tagged: proxy</title>
<link>https://en.leftjoin.ru/tags/proxy/</link>
<description></description>
<generator>E2 (v3386; Aegea)</generator>

<item>
<title>Gathering fresh proxies with Python for Free</title>
<guid isPermaLink="false">36</guid>
<link>https://en.leftjoin.ru/all/gathering-fresh-proxies-with-python-for-free/</link>
<comments>https://en.leftjoin.ru/all/gathering-fresh-proxies-with-python-for-free/</comments>
<description>
&lt;p&gt;Sometimes, when we try to parse a website with Selenium our IP address might get blacklisted. That’s why it’s better to use a proxy. Today we are going to write a script that would scrape new proxies, do checking and, in case of success, pass them to Selenium&lt;/p&gt;
&lt;h2&gt;Scraping fresh proxies&lt;/h2&gt;
&lt;p&gt;Let’s start by importing libraries,  we will need modules for sending requests, scraping and storing data.&lt;/p&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;import requests_html
from bs4 import BeautifulSoup
import pickle
import requests&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;All proxies wiil be stored in the &lt;span class="inline-code"&gt;px_list&lt;/span&gt; and written to  &lt;span class="inline-code"&gt;proxis.pickle&lt;/span&gt;. Data will be loaded from this file if it’s not empty.&lt;/p&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;px_list = set()
try:
    with open('proxis.pickle', 'rb') as f:
            px_list = pickle.load(f)
except:
    pass&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;The  &lt;span class="inline-code"&gt;scrap_proxy()&lt;/span&gt; function will navigate to free-proxy-list.net and gather the latest 20 proxies, which are updated every minute on the site. Here’s the field we are interested in:&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/1-10.png" width="1190" height="513" alt="" /&gt;
&lt;/div&gt;
&lt;p&gt;We will need to extract an IP Address and Port. Let’s inspect elements on this page:&lt;/p&gt;
&lt;div class="e2-text-picture"&gt;
&lt;img src="https://en.leftjoin.ru/pictures/2-10.png" width="671" height="260" alt="" /&gt;
&lt;/div&gt;
&lt;p&gt;The data we need to gather is represented as table cells. We will take the first 20 cells with a for loop, and request an IP-address and Port with &lt;span class="inline-code"&gt;xpath&lt;/span&gt;.  Finally, the function will send fresh proxies to the pickle file and return them as a list.&lt;/p&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;def scrap_proxy():  
    global px_list
    px_list = set()

    session = requests_html.HTMLSession()
    r = session.get('https://free-proxy-list.net/')
    r.html.render()
    for i in range(1, 21):
        add=r.html.xpath('/html/body/section[1]/div/div[2]/div/div[2]/div/table/tbody/tr[{}]/td[1]/text()'.format(i))[0]
        port=r.html.xpath('/html/body/section[1]/div/div[2]/div/div[2]/div/table/tbody/tr[{}]/td[2]/text()'.format(i))[0]
        px_list.add(':'.join([add, port]))

    print(&amp;quot;---New proxy scraped, left: &amp;quot; + str(len(px_list)))
    with open('proxis.pickle', 'wb') as f:
        pickle.dump(px_list, f)
    return px_list&lt;/code&gt;&lt;/pre&gt;&lt;h2&gt;Checking new proxies&lt;/h2&gt;
&lt;p&gt;Oftentimes gathered proxies might be not  working, so we need to write a function that would check them by sending a GET request to Google and, if there is an error, it will return &lt;span class="inline-code"&gt;False&lt;/span&gt;. In case the proxy is working, it will return  &lt;span class="inline-code"&gt;True&lt;/span&gt;.&lt;/p&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;def check_proxy(px):
    try:
        requests.get(&amp;quot;https://www.google.com/&amp;quot;, proxies = {&amp;quot;https&amp;quot;: &amp;quot;https://&amp;quot; + px}, timeout = 3)
    except Exception as x:
        print('--'+px + ' is dead: '+ x.__class__.__name__)
        return False
    return True&lt;/code&gt;&lt;/pre&gt;&lt;h2&gt;Main function&lt;/h2&gt;
&lt;p&gt;We will pass to our main function the scap parameter, which is &lt;span class="inline-code"&gt;False&lt;/span&gt; by default. New proxies will be gathered if the following conditions are met: &lt;span class="inline-code"&gt;scrap == True&lt;/span&gt;  or len(px_list)&lt;6. Then we gather new proxies using a &lt;span class="inline-code"&gt;while loop&lt;/span&gt; , take the last one to check, if  &lt;span class="inline-code"&gt;check_proxy&lt;/span&gt; returns &lt;span class="inline-code"&gt;True&lt;/span&gt; , other proxies will be sent to the pickle file and the function return the working IP address and Port.&lt;/p&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;def get_proxy(scrap = False):
    global px_list
    if scrap or len(px_list) &amp;lt; 6:
            px_list = scrap_proxy()
    while True:
        if len(px_list) &amp;lt; 6:
            px_list = scrap_proxy()
        px = px_list.pop()
        if check_proxy(px):
            break
    print('-'+px+' is alive. ({} left)'.format(str(len(px_list))))
    with open('proxis.pickle', 'wb') as f:
            pickle.dump(px_list, f)
    return px&lt;/code&gt;&lt;/pre&gt;&lt;h2&gt;Changing proxies in Selenium&lt;/h2&gt;
&lt;p class="note"&gt;Сheck out our previous articles on Selenium about &lt;a href="https://www.valiotti.com/leftjoin/all/handling-website-buttons-in-selenium/"&gt;handling website buttons&lt;/a&gt; and &lt;a href="https://www.valiotti.com/leftjoin/all/parse-website-with-python-p2/"&gt;scraping an online store catalog&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Import the &lt;span class="inline-code"&gt;get_proxy&lt;/span&gt; function to configure proxies in Selenium and run a while loop. The &lt;span class="inline-code"&gt;PROXY&lt;/span&gt;  variable will accept our freshly-grabbed proxy and be added to the browser options. Now we can create a new webdriver instance with updated options and let’s try to access the website, add some cookies, and if everything works fine the while loop will be &lt;span class="inline-code"&gt;break&lt;/span&gt;. Otherwise, the function will run until there’s a working proxy found.&lt;/p&gt;
&lt;pre class="e2-text-code"&gt;&lt;code&gt;from px_scrap import get_proxy

while True:
    PROXY = get_proxy(scrap=True)
    options.add_argument('--proxy-server=%s' % PROXY)
    driver = webdriver.Chrome(chrome_options=options, executable_path=os.path.abspath(&amp;quot;chromedriver&amp;quot;))
    try:
        driver.get('https://google.com')
        driver.add_cookie(cookies)
    except:
        print('Captcha!')&lt;/code&gt;&lt;/pre&gt;</description>
<pubDate>Fri, 10 Jul 2020 15:11:47 +0300</pubDate>
</item>


</channel>
</rss>