Data_Snapshot

Gazpacho Scraping

This a rework of my original web scraping post. I will be endevouring to use gazpacho instead of requests and BeautifulSoup.

Why gazpacho? It's lighter weight, more pythonic, cooler and spicier. What's not to like?

Somewhere in your data science journey you will need to scrape a webpage. This post will try to get you up and running as fast as possible.

To start we need to import gazpacho.

import gazpacho as gz

Now we are going to identify a page to scrape. Our target is going to be band members from their Discogs page. In this case it's going to be Larkin Poe because they are awesome and you should be listening to them. We will stuff the url into a variable and then use request.get to retrieve the data.

url = 'https://www.discogs.com/artist/2487986-Larkin-Poe'

resp = gz.get(url)

Now we have stuffed the HTML code into resp. You may want to take a look through the full page code to familiarize yourself with some html, however if I were to do that this would become a very long blog post. Instead I have printed the first five hundred characters below to give you a taste.

print(resp[:500])
<!DOCTYPE html>
<html
    class="is_not_mobile needs_reduced_ui "
    lang="en"
    xmlns:og="http://opengraphprotocol.org/schema/"
    xmlns:fb="http://www.facebook.com/2008/fbml"
>
    <head>
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">
        <meta http-equiv="content-language" content="en">
        <meta http-equiv="pragma" content="no-cache" />
        <meta http-equiv="expires" content="-1" />

                <meta id="viewport" name="viewp

This is the spot that BeautifulSoup would come in, but instead we just continue using gazpacho. Different syntax from BeautifulSoup but just as accessible. Calling the soup method parses the html like BeautifulSoup would have, and then we call the find method with our class and flags to strip out what we need.

soup = gz.Soup(resp)

found = soup.find('div', {'class': 'readmore'})
found
[<div class="readmore" id="profile">
             Rebecca & Megan Lovell
                     </div>, <div class="readmore">

                     <a href="/artist/4443794-Megan-Lovell">Megan Lovell</a>, 
                     <a href="/artist/3126405-Rebecca-Lovell">Rebecca Lovell</a>            </div>]

If you have just come from my other webscraping post you will notice this has all happened in a few less lines of python. It feels like 20 percent less, which in this realtively straigthtforward case translates to one fewer line of code. However this is going to scale as your webscraping becomes more complex, and having all of the functionality inside gazpacho is going to mean less lines for you to write, and less lines for you to fix.

The following code is the regex and related functions that I used to strip the info down to a list. It live slightly beyond the BeautifulSoup/requests/gazpacho world, and therefore has been left untouched.

import re
regular = re.findall(r'\b[A-Z].*', (str(found)))
print(regular)
['Rebecca & Megan Lovell', 'Megan-Lovell">Megan Lovell</a>, ', 'Rebecca-Lovell">Rebecca Lovell</a>            </div>]']
def cleaner(in_list):
    a = []
    b = []
    for i in regular:
        if '>' in i:
            a.append((i.split('>',1))[0])
        else:
            pass
    for i in a:
        c = i.replace('"',"")
        b.append((c.replace("-"," ")))
    return b
done = cleaner(regular)
done
['Megan Lovell', 'Rebecca Lovell']

Gazpacho brings everything together in an efficient way, similar to how you've done it before, but producing a lower overhead and faster output. For me its a drop in replacement, and at this point gazpacho is my first choice when it comes to scraping.