Whenever we come across a web page, we read its content, and even save it for later reading and analysis. Some web pages offer a lot of good information that maybe useful later. And sometimes those information are not easy to copy. It will require a lot of patience especially when the information you want to extract from a web page is on a bad format. Luckily, there’s what we call scraping, and of course to make things faster, programming will kick in.
Say for example, we need to harvest some email addresses from a web page. Lets use this page as an example. It’s a webpage with several email addresses, and our goal is to copy them all. Of course it will be a bit easy if we just use our mouse and copy-paste the email addresses. But the thing is, what if there’s another page to be scraped, with thousands of unique email addresses. That would be a pain, wouldn’t it?
The way we will extract data, will be regular expressions. It’s very popular and over used, but I believe it would be the best tool to accomplish this. And for the language, we will use Python. To be honest, I haven’t learned Python very well. I’m more like a Perl guy, but yeah, it’s time to learn something new. My version of Python is 2.5, you can use the latest 2.7 but not 3.x. Actually you can too, but I don’t recommend it. Most Linux distributions have Python installed, just like Perl. And for windows, you can download ActivePython.
The first thing we need, is to find/write a regular expression for the harvest. Now you can just search Google and find the one you need, here’s an example:
([A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4})
Let me explain this regular expression first, before we continue. The [A-Z0-9._%+-] part means match any of the following: A-Z, 0-9, a dot (.), underscore(_), percent sign (%), plus sign (+) and minus sign (-). The following + means match one or more of the set of characters inside the brackets []. So basically, [A-Z0-9._%+-]+ matches the local part of the email address. Next is the @ character. It’s pretty literal, because it means, match the @ character once, after the local part. The [A-Z0-9.-]+\.[A-Z]{2,4} matches the domain.
Here’s the code in Python, this prints out all the email addresses found on our example page.
#!/usr/bin/python
import urllib2
import re
request = urllib2.Request('http://www.daniweb.com/forums/thread218079.html')
request.add_header('UserAgent', 'Ruel.ME Sample Scraper')
response = urllib2.urlopen(request)
for line in response.read().split('\n'):
match = re.search(r'([A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4})', line, re.I)
if match:
print match.group(1)
I will explain the code line by line:
import urllib2 import the urlib2 module for our connection
import re import the regular expression module
request = urllib2.Request('http://www.daniweb.com/forums/thread218079.html') make the request of the page
request.add_header('UserAgent', 'Ruel.ME Sample Scraper') add a UserAgent
response = urllib2.urlopen(request) performs the request
for line in response.read().split('\n'):(request) read the html source line by line
match = re.search(r'([A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4})', line, re.I) perform the search with the regular expression
if match: check if there are matches
print match.group(1) display the match
It was actually pretty easy. So the output of the script is the email addresses from that page. Web page scraping is not a hard task. It’s very easy and exciting. You can use other languages like Perl, PHP, etc. With this methods, data extraction is easier, and faster.