Tag Archive > scrape

Scrape Your Facebook Friends’ Emails with Python

» 03 March 2011 » In Internet, Open-Source, Programming, Python » 34 Comments

This is an update of an earlier post about Facebook contact info scraping.

DISCLAIMER: This is against Facebook TOS . Use at your own risk.

It’s been so, so long since I posted something here. And if you missed me, I apologize for that. Well never mind the previous statement. This is an update of the Facebook contact info scraper in python. The old one stopped working when Facebook updated their User Interface. And I must tell you, that this is the greatest drawback of writing a scraper that relies on regular expressions.

Yes, using regular expressions with scrapers is pretty much a bad idea, but for tools like this, an exception must be made. Most programming languages nowadays do not include good enough HTML parsers. But why? Yes there are available libraries/modules, like Beautiful Soup in Python. It’s a powerful module, but for this script, it was way too powerful. Regular expressions however, is just right, in my opinion. As this script doesn’t require heavy parsing. Of course there will be fellow coders that will disagree with this paragraph, you’re very much welcome, and let me hear you on the comments. :)

What will be the changes for this one? Actually the script will be using a bit of the Graph API. Too bad it doesn’t provide email information about your friends. Actually it provides email information, but a special permission is required. We will be using the Graph API to get the complete list of our Facebook friends. Unlike the previous scraper, this one will be a lot faster, and more accurate on gathering friend information.

And of course, as you can see from the title, we will only be scraping the email addresses. As for the output, we will no longer use the elegant HTML/CSS report. Instead, we will generate a CSV file containing the name, and the email.

Here’s the python code:

#!/usr/bin/python

'''
	InFB - Information Facebook
	Usage: infb.py user@domain.tld password

http://ruel.me

	Copyright (c) 2011, Ruel Pagayon
	All rights reserved.

	Redistribution and use in source and binary forms, with or without
	modification, are permitted provided that the following conditions are met:
		* Redistributions of source code must retain the above copyright
		  notice, this list of conditions and the following disclaimer.
		* Redistributions in binary form must reproduce the above copyright
		  notice, this list of conditions and the following disclaimer in the
		  documentation and/or other materials provided with the distribution.
		* Neither the name of the author nor the names of its contributors 
		  may be used to endorse or promote products derived from this software 
		  without specific prior written permission.

	THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS "AS IS" AND ANY 
	EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 
	WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, 
	INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
	LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, 
	OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 
	LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE 
	OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 
	ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
'''

import sys
import re
import urllib
import urllib2
import cookielib
import csv
import json

def main():
	# Check the arguments
	if len(sys.argv) != 3:
		usage()
	user = sys.argv[1]
	passw = sys.argv[2]
	
	# Initialize the needed modules
	CHandler = urllib2.HTTPCookieProcessor(cookielib.CookieJar())
	browser = urllib2.build_opener(CHandler)
	browser.addheaders = [('User-agent', 'InFB - ruel@ruel.me - http://ruel.me')]
	urllib2.install_opener(browser)
	
	
	# Initialize the cookies and get the post_form_data
	print 'Initializing..'
	res = browser.open('http://m.facebook.com/index.php')
	mxt = re.search('name="post_form_id" value="(\w+)"', res.read())
	pfi = mxt.group(1)
	print 'Using PFI: %s' % pfi
	res.close()
	
	# Initialize the POST data
	data = urllib.urlencode({
		'lsd'				: '',
		'post_form_id'		: pfi,
		'charset_test' 		: urllib.unquote_plus('%E2%82%AC%2C%C2%B4%2C%E2%82%AC%2C%C2%B4%2C%E6%B0%B4%2C%D0%94%2C%D0%84'),
		'email'				: user,
		'pass'				: passw,
		'login'				: 'Login'
	})
	
	# Login to Facebook
	print 'Logging in to account ' + user
	res = browser.open('https://www.facebook.com/login.php?m=m&refsrc=http%3A%2F%2Fm.facebook.com%2Findex.php&refid=8', data)
	rcode = res.code
	if not re.search('Logout', res.read()):
		print 'Login Failed'
		
		# For Debugging (when failed login)
		fh = open('debug.html', 'w')
		fh.write(res.read())
		fh.close
		
		# Exit the execution :(
		exit(2)
	res.close()
	
	# Get Access Token
	res = browser.open('http://developers.facebook.com/docs/reference/api')
	conft = res.read()
	mat = re.search('access_token=(.*?)"', conft)
	acct = mat.group(1)
	print 'Using access token: %s' % acct
	
	# Get friend's ID
	res = browser.open('https://graph.facebook.com/me/friends?access_token=%s' % acct)
	fres = res.read()
	jdata = json.loads(fres)
	
	# Initialize the CSV writer
	fbwriter = csv.writer(open('%s.csv' % user, 'ab'), delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
	
	# God for each ID in the JSON response
	for acc in jdata['data']:
		fid = acc['id']
		fname = acc['name']
		
		# Go to ID's profile
		res = browser.open('http://m.facebook.com/profile.php?id=%s&v=info&refid=17' % fid)
		xma = re.search('mailto:(.*?)"', res.read())
		if xma:
			
			# Replace the html entity from the scraped information
			email = xma.group(1).replace('@', '@')
			
			# In case there will be weird characters, repr() will help us.
			try:
				print fname, email
			except:
				print repr(fname), repr(email)
				
			# Write to CSV, again with repr() if something weird prints out.
			try:
				fbwriter.writerow([fname, email])
			except:
				fbwriter.writerow([repr(fname), repr(email)])
	
	
def usage():
	'''
		Usage: infb.py user@domain.tld password
	'''
	print 'Usage: ' + sys.argv[0] + ' user@domain.tld password'
	sys.exit(1)
	
if __name__ == '__main__':
	main()

I’ve also updated the gist, so you can fork it anytime here.

Yes, it’s tested and working (for now). In the future this will pretty much stop working again, and I’ll be updating it so do not worry. And again, use this at your own risk. If you have questions/comments regarding this script, don’t hesitate to comment below.

Best Regards.

Continue reading...

Tags: , , , , ,

Scrape Your Facebook Friends’ Contact Info with Python

» 26 November 2010 » In Internet, Open-Source, Python » 54 Comments

UPDATE: This post has been updated as the code here is no longer working. You can find the updated post here.

I coded this script in Perl almost a month ago. But then I’m thinking of learning Python, so I re-coded this to Python. Basically this script demonstrates scraping, web crawling, and cookies. And of course, this is free and and forkable at Gist. I named the script InFB, my short version for Facebook and Information.

The output of this script, is the profile ID, profile pame, profile URL, e-mail address and mobile/phone number (if provided by friend). One thing to remember though, don’t expect this script to scrape those addresses/numbers which are hidden by your friend. This only extracts data based on what you can exactly see on his/her profile.

For easier page access and scraping, I used the mobile version of Facebook. It’s much lighter, and clearer. Besides, I can’t find a way to generate the friend list on the full site. Heh.

Usage

Using this script is easy, all you need is to load-up terminal or windows command prompt, and pass your e-mail address and password as the arguments.

infb.py user@domain.tld password

You can also put this on a batch file or shell script (if you have multiple accounts).

Code

UPDATE:, here’s the updated code (forked by gelendir). Using an HTML Parser, a better way to get data. This code can be found here. And the original one, can still be found here.

#!/usr/bin/python
#
#	InFB - Information Facebook
#	Usage: infb.py user@domain.tld password
#	http://ruel.me
#
#	Copyright (c) 2010, Ruel Pagayon - ruel@ruel.me
#	All rights reserved.
#
#	Redistribution and use in source and binary forms, with or without
#		* Redistributions of source code must retain the above copyright
#		  notice, this list of conditions and the following disclaimer.
#		* Redistributions in binary form must reproduce the above copyright
#		  notice, this list of conditions and the following disclaimer in the
#		  documentation and/or other materials provided with the distribution.
#		* Neither the name of ruel.me nor the names of its contributors
#		  may be used to endorse or promote products derived from this
#		  script without specific prior written permission.
#
#	THIS SCRIPT IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
#	ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
#	WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
#	DISCLAIMED. IN NO EVENT SHALL RUEL PAGAYON BE LIABLE FOR ANY
#	DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
#	(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
#	LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
#	ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
#	(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
#	SCRIPT, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


import sys, re, urllib, urllib2, cookielib, HTMLParser, getpass

class FormScraper(HTMLParser.HTMLParser):
    """
    Scrapes the Facebook login page for form values that need to be submitted on login.
    Necessary because the form values change each time the login page is loaded.

    Usage:
    form_scraper = FormScraper()
    form_scraper.feed(html_from_facebook)
    form_values = form_scraper.values
    """

    def __init__(self, *args, **kwargs):
        HTMLParser.HTMLParser.__init__(self, *args, **kwargs)
        self.in_form = False
        self.values = []

    def handle_starttag(self, tag, attrs):
        tag = tag.lower()
        attrs = dict(attrs)

        if tag == 'form' and attrs['id'] == 'login_form':
            self.in_form = True
        elif self.in_form and tag == 'input' and attrs['type'] == 'hidden':
            self.values.append( (attrs['name'], attrs['value']) )

    def handle_endtag(self, tag):
        if tag.lower() == 'form' and self.in_form:
            self.in_form = False

def main():
    if len(sys.argv) < 2:
        usage()
    
    user = sys.argv[1]

    if len(sys.argv) < 3:
        passw = getpass.getpass("Enter password: ")
    else:
        passw = sys.argv[2]

    # Set needed modules
    CHandler = urllib2.HTTPCookieProcessor(cookielib.CookieJar())
    browser = urllib2.build_opener(CHandler)
    browser.addheaders = [('User-agent', 'InFB - ruel@ruel.me - http://ruel.me')]
    urllib2.install_opener(browser)

    #Retrieve login form data and initialize the cookies
    print 'Initializing..'
    res = browser.open('https://www.facebook.com/login.php')

    #Determine string encoding
    content_type = res.info()['Content-Type'].split('; ')
    encoding = 'utf-8'
    if len(content_type) > 1 and content_type[1].startswith('charset'):
        encoding = content_type[1].split('=')[1]
    html = unicode( res.read(), encoding=encoding )
    res.close()

    #scrape form for hidden inputs, add email and password to values
    form_scraper = FormScraper()
    form_scraper.feed(html)
    form_data = form_scraper.values
    form_data.extend( [('email', user), ('pass', passw)] )
    #HACK: urlencode doesn't like strings that aren't encoded with the 'encode' function.
    #Using html.encode(encoding) doesn't help either. why ??
    form_data = [ ( x.encode(encoding), y.encode(encoding) ) for x,y in form_data ]
    data = urllib.urlencode(form_data)

    # Login
    print 'Logging in to account ' + user
    res = browser.open('https://login.facebook.com/login.php?login_attempt=1', data)
    rcode = res.code
    print rcode
    print res.url
    if not re.search('home\.php$', res.url):
        print 'Login Failed'
        exit(2)
    res.close()

    # Get Emails and Phone Numbers
    print "Getting Info..\n"
    flog = open(user + '.html', 'a')
    flog.write("<html>\n\t<head>\n\t\t<title>InFB - " + user + "</title>\n\t\t<link href=\"infb.css\" rel=\"stylesheet\" type=\"text/css\" />\n\t</head>\n\t<body>\n\t\t<div class=\"rby\">\n\t\t\t<table class=\"flist\">\n\t\t\t\t")
    page = 0
    while True:
        res = browser.open('http://m.facebook.com/friends.php?a&f=' + str(page))
        parp = res.read()
        m = re.findall('"\/friends\.php\?id=([0-9]+)&', parp)
        res.close()
        for i in m:
            prof = 'http://m.facebook.com/profile.php?id=' + i + '&v=info'
            res = browser.open(prof)
            cont = res.read()
            res.close()
            prof = prof.replace('m.', 'www.')
            ms = re.search('<div id="body"><div><div>(.*?)<\/div>', cont)
            if ms:
                name = ms.group(1)
            else:
                continue
            ms = re.search('href="tel:(.*?)"', cont)
            if ms:
                tel = ms.group(1)
            else:
                tel = ''
            ms = re.search('Emails?:<\/div><\/td><td valign="top"><div>(.*?)<\/div>', cont)
            if ms:
                email = re.sub('<br \/>', ', ', ms.group(1)).replace('&#64;', '@')
            else:
                continue
            print name + ' : ' + email + ' ' + tel
            flog.write("<tr class=\"lbreak\">\n\t\t\t\t\t<td class=\"num\">" + i + "</td><td class=\"fname\"><a href=\"" + prof + "\" title=\"" + name + "\">" + name + "</a></td><td class=\"fmail\">" + email + "</td></td><td class=\"cnum\">" + tel + "</td>\n\t\t\t\t\t</tr>\n\t\t\t\t")
        if re.search('Next', parp):
            page += 10
        else:
            break
    flog.write("\n\t\t\t</table>\n\t\t</div>\n\t</body>\n</html>")
    flog.close()

def usage():
    print 'Usage: ' + sys.argv[0] + ' user@domain.tld [password]'
    sys.exit(1)

if __name__ == '__main__':
    main()

CSS

This generates an HTML log file, and of course, beautified by CSS so you can customize the output (name it infb.css):

/*
	Ruel Pagayon (c) 2010 - ruel@ruel.me
	
	Cascading Style Sheet for InFB Log Output.
*/
body {
	background-color: #3C3C3C;
	color: #FFF;
	margin-top: 50px;
	margin-left: 25px;
	font-size: xx-small;
	font-family: Calibri, Arial, sans;
}
.rby {
	text-align: center;
	font-size: xx-small;
}

table  {
	text-align: center;
}

td {
	padding-top: 0.5em;
	padding-bottom: 0.5em;
	padding-left: 1em;
	padding-right: 1em;
	text-align: left;
	font-size: small;
}

td.num {
	color: #CCC;
}

td.cnum {
	color: #AFAFAF;
}

a:active, a:visited, a:link  {
	color: #FFF;
	font-weight: bold;
	text-decoration: none;
}

a:hover {
	color: #FFF;
	font-weight: bold;
	text-decoration: underline;
}

Again, if you want to suggest changes you can apply it directly by forking it on Gist ( or fork gelendir’s version) or simply drop a comment below. Thank you.

Disclaimer

This is against Facebook TOS. Use at your own risk.

Continue reading...

Tags: , , , , ,