Python Web Scraping Search Results

Python Web Scraping Tutorial

Published: August 4, 2020

Author: Harrison DeSantis

Obtaining visibility in search search snippets is an excellent way to boost CTR and increase organic site traffic. As a quick refresher, below is a visual example of a SERP snippet for the query “paid search vs seo”:
Snippets take up serious SERP real estate and often drive more traffic than a #1 ranking. There are several great hacks on how to optimize for Google snippets, but not many hacks on how to find snippet opportunities. The process is very manual. It usually requires the following:

Web scraping is a complex task and the complexity multiplies if the website is dynamic. According to United Nations Global Audit of Web Accessibility more than 70% of the websites are dynamic in nature and they rely on JavaScript for their functionalities. Implementing steps to Scrape Google Search results using BeautifulSoup. We will be implementing BeautifulSoup to scrape Google Search results here. BeautifulSoup is a Python library that enables us to crawl through the website and scrape the XML and HTML documents, webpages, etc.

Search for keywords individually on Google
Record the SERP snippets per keyword
Identify which existing content/pages can break into those snippets
Edit the content to do so

Step 3 is the ultimate takeaway, but about 90% of the work really goes into Steps 1 & 2. However, what if you could automate finding the snippets and preserve your brainpower for winning the snippets?
That’s where Python comes in. As a novice Python coder (If I can even call myself that), I’ve quickly realized that Python can significantly reduce time for SERP research projects. Snippet research is one of those instances. Rather than individually searching/recording Google snippets for hundreds of keywords, Python can do the grunt work so you can be more efficient with the time spent winning that valuable organic real estate.
Below, I’m placing the raw code that you can use to execute a snippet SERP scrape. Underneath that I’ll provide step-by-step instructions on what each command actually does and how it works:

Translation: Switch VPNs and don’t try this at work.
While search engine scraping is legal, Google can flag and deny any IP it suspects of bot-like behavior. Therefore, changing proxies is a prerequisite to successfully scraping. If you are constantly using the same VPN and IP address for this practice, Google can store your information in its database of repeat offenders. While these bans are usually temporary, they still increase your likelihood of being denied again. This can be especially problematic if your work address is stored on a denylist and none of your coworkers can access Google as a result…

Part 1. Get Python to Read the Document:

First, we must have all the keywords we want to search listed in a Text file. We choose Text files because they’re minimal and easy for Python to deal with. Save the file somewhere easy to access, as Python will need to access through your computer.
*WARNING: IF YOU ARE SAVING AS A RICH TEXT DOCUMENT ON MACBOOK (RFT), MAKE SURE TO EDIT THE FILE SO IT IS A .TXT. WE WANT THE SIMPLEST FILE POSSIBLE SO PYTHON DOESN’T WASTE ITS TIME READING HEADER ELEMENTS WITHIN A COMPLICATED WORD PROCESSOR.
Once we have that, we open our script with this:
Line 1: with open(“/Users/Desktop/Text_Doc_Python1.txt”, “r”) as f:
We instruct Python to open the list of queries from the file’s absolute path on your computer. In this example, the file was on my desktop and was housed under the path “/Users/Desktop/”. The file with my keywords was titled “Text_Doc_Python1.txt”. Hence, our open instruction is to open “/Users/Desktop/Text_Doc_Python1.txt”
In that same line, we’re giving Python read permissions (‘r’). So now Python can both open and read the file.
Lastly, I’m going to define this whole operation of opening and reading the file as “f”. This way, I can refer back to it via a single letter rather than typing out that long file path whenever I want to use it.
Line 2: queries = [line.strip() for line in f]
Our keyword list might not be perfect, and there might be some stray spaces following our term. To account for this, we’re going to use line.strip() to remove any stray spaces from before or after the KW; this ensures that the term you think is getting Googled is actually getting Googled.
We’re going to define these cleaned-up line items from our handy ‘f’ document as “queries”. This will represent each query we’re going to get a snippet from.
Line 3: print(queries)
For good measure, we’ll also print the queries in our Console just so we can preview what exactly Python will be running through Google search:
Looks good so far! We’ve now confirmed that Python can access the document, clean up the queries, and repeat them back to us. We have our input for what will be several automated Google searches ready.

Python web scraping search results definition

Part 2. Import Our Packages

In Python, we work with modules (aka packages). A module is a piece of software that serves a very specific functionality. In Python, you access modules through the “import” command. For this specific project, we’ll be importing these packages. I’ll explain why:
The selenium webdrivermodule is what’s going to allow Python to perform searches in Google.
Requests will supplement webdriver by allowing Python to request a specific search URL from the server.
BeautifulSoup will let Python analyze that SERP and scrape elements (i.e. the snippet).
Random generates a random number within a certain defined range. We use random so that each request has a different server request time. If we run hundreds of requests that have the same exact delay time in between each search, Google will assume you are a bot and likely block your IP.
Time is required to define the sleep period in between searches before Python performs another one. Time works in tandem with the random module in this project.
The csv module simply allows Python to interact with and write csv files. When Python is done crawling our list items in Google, we’ll want it to package up the results in a nice CSV document that is ready for analysis.
Lastly, even though delay_seconds isn’t a package, we’re going to take care of defining this variable now because we’ll need it later. What we’re doing is using our newly imported random package to give us a random integer between 100 and 500. We’re then dividing that number by 100. So what does that do? It gives us any decimal number between 1.00 and 5.00, which will be used as the amount of seconds our program will wait in between crawls. We’ll explain why that matters later.

Part 3. Set Up Chrome Driver

For the selenium webdriver module to work, Python needs an application that it will run the searches through. To get this, you will need to install ChromeDriver from https://chromedriver.chromium.org:
Once you have chromium installed, you will need to find where it’s located on your computer. Much like we did with our open(“/Users/Desktop/Text_Doc_Python1.txt” command, where we needed to tell Python where the file was we’re accessing and what to do with it, we’re doing the same with Chromedriver. We’re telling Python where the browser is and where these searches will be performed from.
In this example, the file was found in /Users/Downloads/ and the file was called “chromedriver 3”. Ergo, we are defining our chromedriver variable as “/Users/Downloads/chromedriver 3”
While we’ve defined which application we’ll use to search things, we haven’t yet set a command to search things. driver = webdriver.Chrome(chromedriver) commands Python to automate a search process via webdriver using the chromedriver browser that we just assigned above. We’ve taken this process of automating a search and defined it as driver.
So just to recap, webdriver is our automation and chromedriver is our Google Chrome application where searches will be automated.

Part 4. Create a New File Where Our Scraped Results Will Go

Before we finally start automating searches, we want to make sure that this data is going to be packaged up in a file for us once it’s done. The data won’t do much good sitting in a Python console: we need it in a CSV file that we can manipulate and analyze.
Line 1: with open(‘innovators.csv’, ‘w’) as file:
To do this, we’re going to pull that same open command we used to access our list of queries earlier. But there’s a core difference with how we’re using it. On the query list, we just wanted Python to read the file (hence the “r” in with open(“/Users/Desktop/Text_Doc_Python1.txt”, “r”. ). Now, we want Python to write a file, so we’re going with ‘w’ instead. This whole process of writing to the file I’ve inexplicably named ‘innovators.csv’ is going to be defined as file.
Line 2: fields = [‘Query’, ‘Snippet’]
We’re going to give this file header values of ‘Query’ and ‘Snippet’. We want to easily show a third-party that “Column A is the search keyword, Column B is the snippet result”. These two headers are being packaged up into an easy variable named fields.
Line 3: writer = csv.writer(file)
I’m now inventing a variable called “writer”, where we’re going to write onto the file we defined before.
Line 4: writer.writerow(fields)
Now that we’re accessing the file, I can write my fields onto my csv document. When this script runs and writes a CSV file, my columns will have a header element now. So far, that’s all the document has though.

Part 5. Perform the Searches

Line 1: for item in queries:
In our first line, we’re now establishing an important Python function called a “for loop”, which essentially repeats an operation for us. Here, we’re simply instructing our program that we’re about to perform an operation for each query (or item) from our queries variable (out full list of queries we defined earlier).
Line 2: updated_query = item.replace(” “, “+”)
For each query in our doc, we’re now mutating each one into a Google search URL. We know that when you search for something in Google, the URL we get back follows a formula:
Query: “test query”
Google URL: https://www.google.com/search?q=test+query
Easy enough! First, we have to replace each space with a “+” sign. As you see in the Google URL above, “test query” is transformed into “test+query” when in the Google search URL. We then apply that formula to each query (or item) and redefine it as updated_query. To recap:
Item = “test query”
updated_query = “test+query”
Line 3: google_url = “https://www.google.com/search?q=” + updated_query
Here, we’re now making another variable called google_url , which is simply the Google URL search prefix of https://www.google.com/search?q= followed by our URL-ready updated_query.
In these few steps, we’ve now changed “test query” into https://www.google.com/search?q=test+query for each one of our keywords.
Line 4: driver.get(google_url)
In Step 3, we defined the action of performing a search on our Google driver as driver. So now that we have both our Google search operation set up and the specific URL we need to be searched, we’re just instructing driver to perform its operation with our google_url.
Line 5: time.sleep(delay_seconds)
You may recall that in Part 1, we imported our packages but also made the variable delay_seconds to generate a decimal number between 1 and 5:
delay_seconds = random.randint(100, 500)/100
For each time this script runs, a different number will generate and is assigned as the time.sleep value. time.sleep is the amount of seconds that the program will wait until performing another search. So after each search, the program will wait somewhere between 1.00 and 5.00 seconds before performing the next search.
Why do we do that? If our program waits for the same exact amount of time in between dozens of searches, Google will figure out pretty quickly that this is a bot performing the search, which increases odds of getting blocked by Google. We use the random value for time.sleep to anonymize ourselves to prevent this from happening.

Part 6. Pull the Snippets

Line 1: soup = BeautifulSoup(driver.page_source,’lxml’)
The BeautifulSoup package we imported earlier allows us to pull HTML from a live URL. Meanwhile, driver has a built-in page_source attribute that allows our program to parse the HTML of a selected page (‘lxml’ is said parcer). We’re defining this whole operation as soup.
Line 2: for s in soup.find_all(id=”res”):
We’re running another for loop for every scraped Google result. For every scrape we perform (now defined as soup), we want to find all instances of when the page code has an id value of ”res” (hence (id=”res”)).
Why? Because the actual code within Google’s SERP has defined “res” as a DIV id:
You could choose from a number of other ids found on Google’s SERP, but we went with “res” here.
Lines 3 & 4:
s = s.text.replace(‘Search ResultsFeatured snippet from the web’, ”).split(“›”)[0]
ns = s.replace(‘Search ResultsWeb results’, ‘No snippet : ‘)
Here we’re just cleaning up the result from our scrape (s) or labeling if our result has returned no snippets. ns is the variable that will live on because it is searching for a word combination that will ultimately tell us if a snippet exists (‘Search ResultsFeatured snippet from the web’) or doesn’t exist (‘Search ResultsWeb results’). If a snippet exists, we’ll get the scraped result back. If it doesn’t exist, the line for that query will read ‘No snippet : ‘.
Line 5 & 6:
data = item, ns
print(data)
Lastly, we’re just making a variable that organizes the data we want to get back. As you may recall, item is from the very beginning of our Part 5 for loop and is the original query we just used for our scrape. ns is our scrape result (which will either yield the scrape result or “No Snippet”).
The print(data) command will then display for us what those results will be.

Part 7. Get the Scrape Results Written to Our File, Ready to Analyze

Python Web Scraping Tutorial

writer.writerow(data)
file.close()
The scraping is done! Now we just need to get it into a document that we can analyze. You may recall the following code from Part 4:
So since writer has already been defined as the action of writing onto our original ‘innovators.csv’ doc, and since we already have ‘Query’ and ‘Snippet’ written onto the doc as column headers (via writer.writerow(fields)), we then invoke the writer.writerow command again to write our data result (item, which is essentially the query, and ns) which is the snippet result) directly below the appropriate headers in the correct columns.
Once all the loops have run and are written into our document, we then use file.close() to close the file. You will get an output that looks something like this:
Our first query “How do you get health insurance in Vermont” returned no snippet at the time of the search. Meanwhile, “How do you get health insurance in West Virginia” did, and we can see the result along with the URL at the very end.
Now you know how to scrape featured snippets from Google! You can likely make small tweaks to scrape for other features such as People Also Ask fields, but this is a good starting point for your snippet scraping needs.