Colly provides a clean interface to write any kind of crawler/scraper/spider
Golang Web Scraper Example
- A simple concurrent scraper Our scraper will basically try to download a list of web pages we’re giving him first, and check it gets a 200 HTTP status code (meaning the server returned an HTML page without an error).
- Scraping the Web in Golang with Colly and Goquery March 1, 2018. 9 minutes If told to write a web crawler, the tools at the top of my mind would be Python based: BeautifulSoup or Scrapy. However, the ecosystem for writing web scrapers and crawlers in Go is quite robust.
- In Golang we can use httpmock to intercept any http requests made and pin the responses in our tests. This way we can verify that our program works correctly, without having to actually send a requests over the network. To install httpmock we can add a go.mod file.
With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.
But recently, there has been a buzz for Go or Golang as someone says which made me to look into it. As a way to practice my knowledge and understanding of Go, I thought to write a program to scrape a website that requires login. I usually start my learning of any language from scraping if possible because it interests me so much.
Features
- Clean API
- Fast (>1k request/sec on a single core)
- Manages request delays and maximum concurrency per domain
- Automatic cookie and session handling
- Sync/async/parallel scraping
- Distributed scraping
- Caching
- Automatic encoding of non-unicode responses
- Robots.txt support
- Google App Engine support
Batteries included
Colly comes with all the tools you need for scraping.
Open Source
Development of Colly is community driven and public.
Web scraping holds a dear place in my heart. During my undergrad, I had the opportunity to present some research at the Canadian Undergraduate Math Conference, that being numerical solutions to biological aggregation differential equations [1]. In other words, I simulated animal and insect swarming behaviour and presented some visualizations [2]. Going to that conference changed my life as it exposed me to Python and web scraping, which eventually led me to the path of data science and machine learning.
The math conference was held at the same time as the computer science conference, so I had the chance to sit in on some neat machine learning talks. One talk in particular which amazed me was on sports analytics. It wasn’t the analytics that amazed me, more so it was the student who scraped the data from the web himself. I didn’t know you could do such a thing [3], so after the talk I asked him how he did it. He dropped some names that would eventually send me down the gopher hole:
Wanting to learn more, I googled how to web scrape and I came across Greg Reda’s post on using Python for web scraping. His posts helped me create my own web scraper using Python and it was my very first experience with the language itself. I owe it to both the helpful individual at the conference and Greg Reda, as they were in my life at the right time.
On my journey towards learning a new language this year, I’m keeping up with tradition by creating a web scraper using Go (or Golang). Like those individuals who have helped me before, I want to do the same by writing a post about making it. In this post, I will show you how to make a simple web scraper using Go. We will be scrapping the Pitchfork 500 which is a list of alternative, punk, electronic and hip-hop songs from 1977 to 2006, and saving those tracks to a CSV file. Our web scraper will involve the following steps:
- Get the Go app running (a.k.a the “Hello, world!”)
- Get the HTML from the page
- Parse the tracks from the HTML using Goquery
- Save the data to a CSV
One thing to keep in mind when creating a web scraper is to research the site’s policy on data gathering and to follow those guidelines. This tutorial will be aimed towards those who are familiar with Python but who want to learn Go. The source code used in this tutorial can be found on GitHub.
$ go install go
Golang Colly
Let’s begin by first installing Go! You can check if you have it installed on your system by running which go
in the terminal. If a path is returned you have it installed already.
If you don’t have Go installed, you can grab it from the main site here. A simpler way to install it, if you are using a *nix based system, would be to use your OS’ package manager (e.g. apt for Debian/Ubuntu or Homebrew for macOS). Using a package manager keeps the installation to one line and it will automatically configure everything. For example in macOS, you can install Go by running:
Once Go has been installed, it’s time to make the app.
Hello, World!
Create a directory for your scraper called (not surprisingly) scraper
. Inside the folder, create two files:
main.go
will hold our scraper code. Makefile
will hold command shortcuts which will make running and debugging your code easier. To learn more, see my post on Makefiles.
Inside main.go
, add the following:
What’s shown here is typical boilerplate for a Go app. In our main()
function, we are calling the Println
method from the fmt library and printing the string Hello, World
from the terminal. To run this code in the terminal, use the command:
If all is well, you should see 'Hello, world!'
printed in the terminal!
Type less, code more!
This will make more sense later on in the tutorial, but to avoid typing this long command into the terminal I would suggest using the Makefile
and adding that command as the default instruction. In the Makefile
, add the following:
Running make
in the terminal will now run the code.
Get HTML from the page
One thing that makes Go unique compared to Python is its native http library. This library allows you to make http requests, build REST apis and other fun internet stuff. Compare this to Python where you need to use external libraries like requests or Flask to build these things.
Create a function called MakeRequest()
:
MakeRequest()
takes in a URL, grabs the HTML from the URL using a GET method, and returns the HTML back as a string. Some things that stand out if you’re only familiar with Python are the _
’s and the manipulation of response resp
. The Get()
method returns two outputs, the response and an error. If we’re calling this method, we need to assign the outputs to two variables, however we’re not interested in the error [4]. In Go, we’re not allowed to declare unused variables, therefore the _
is used to indicate to the compiler to discard that output. As for the response, we need to first close our request, read the request to a byte array, then convert that byte array to a string.
Adding this method to our code, importing the libraries and adding the URL, we have the following:
If you run this code now (remember use make
instead of go run main.go
), you should see the HTML dumped to the terminal.
Parse the HTML using Goquery
Web scraping is a R E L A X I N G
activity. A typical web scraping sesh involves going through the web page’s source, identifying the HTML tags which are wrapped around the data you want, coding your parser to find those tags, and running the code. If it works, great! In practice however, you will be repeating this process over and over again until you get what you want (this is why I mentioned the Makefile earlier in this post). Good thing for you that I’ll be telling you where to look!
On the Pitchfork 500 page, right click and select “View Page Source”. On the wiki page, we want to grab the listed tracks. In the source, we should try to find where this content lies. A picture is given below showing the tracks on the page (left) and where they are in the source (right).
In the wiki page, we see that the tracks are listed by years. In the HTML, we see that they are identified by the <div>
tags. Within these sections, the tracks are wrapped in <li>
tags. Our parser should first find all the sections which are wrapped in the div-col
tags, then iterate through every <li>
element, extracting the track from them.
We will modify our MakeRequest()
method to parse through the HTML code using Goquery and return to us the tracks in a special type of array called a slice. This new modified method looks like this:
Now that our method parses HTML and returns the tracks, it’s best practice to rename it to ParseWiki()
. The first few lines are what we had in the original function, but now we put the request into the NewDocumentFromReader()
method in Goquery. The doc
object is something we can now manipulate to find and extract the data we want.
In doc
, we first find all .div-col
objects which are the sections which hold our tracks. For each section, we iterate trough all li
tags which contain our track names. We grab the text from those tags and parse the artist and track name using the strings library.
In the code snippet above, some lines might seem strange. Some heuristics and assumptions are added about the text, in particular splitting the tracks by “ -” instead of “-” and trimming “ “’s (e.g. strings.Split(t.Text(), ' –')
and strings.Trim(text[1], ' ')
respectively). This is the nature of scraping - we repeatedly change our code until we get the data that we want in the format we want.
Save the data in CSV format
Now that we have the tracks, it’s time to save them to a CSV file. Let’s modify our main()
function by creating the CSV file, iterating through the track slice and writing the track to the file.
Put it all together
Putting everything together, we should have something like this:
Final thoughts
Go is a cool, fun and powerful language. As someone who doesn’t use low-level programming languages a whole lot, Go seems to be a good entry point as it has abstractions when you just want to get something up and running, but allows for more control when you want it. Fork this code and modify it to scrape other sites! In the future I hope to create more apps using Go as an alternative to Python. Until then, let me know what you think and if this was helpful!
[1] That was a mouthful.
[2] That sounds like a dope post… Stay tuned.
[3] After all, the only programming I knew of was MATLAB.
[4] In practice, you should be!