Scrape Your Facebook Friends’ Emails with Python

» 03 March 2011 » In Internet, Open-Source, Programming, Python »

This is an update of an earlier post about Facebook contact info scraping.

DISCLAIMER: This is against Facebook TOS . Use at your own risk.

It’s been so, so long since I posted something here. And if you missed me, I apologize for that. Well never mind the previous statement. This is an update of the Facebook contact info scraper in python. The old one stopped working when Facebook updated their User Interface. And I must tell you, that this is the greatest drawback of writing a scraper that relies on regular expressions.

Yes, using regular expressions with scrapers is pretty much a bad idea, but for tools like this, an exception must be made. Most programming languages nowadays do not include good enough HTML parsers. But why? Yes there are available libraries/modules, like Beautiful Soup in Python. It’s a powerful module, but for this script, it was way too powerful. Regular expressions however, is just right, in my opinion. As this script doesn’t require heavy parsing. Of course there will be fellow coders that will disagree with this paragraph, you’re very much welcome, and let me hear you on the comments. :)

What will be the changes for this one? Actually the script will be using a bit of the Graph API. Too bad it doesn’t provide email information about your friends. Actually it provides email information, but a special permission is required. We will be using the Graph API to get the complete list of our Facebook friends. Unlike the previous scraper, this one will be a lot faster, and more accurate on gathering friend information.

And of course, as you can see from the title, we will only be scraping the email addresses. As for the output, we will no longer use the elegant HTML/CSS report. Instead, we will generate a CSV file containing the name, and the email.

Here’s the python code:

#!/usr/bin/python

'''
	InFB - Information Facebook
	Usage: infb.py user@domain.tld password

http://ruel.me

	Copyright (c) 2011, Ruel Pagayon
	All rights reserved.

	Redistribution and use in source and binary forms, with or without
	modification, are permitted provided that the following conditions are met:
		* Redistributions of source code must retain the above copyright
		  notice, this list of conditions and the following disclaimer.
		* Redistributions in binary form must reproduce the above copyright
		  notice, this list of conditions and the following disclaimer in the
		  documentation and/or other materials provided with the distribution.
		* Neither the name of the author nor the names of its contributors 
		  may be used to endorse or promote products derived from this software 
		  without specific prior written permission.

	THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS "AS IS" AND ANY 
	EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 
	WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, 
	INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
	LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, 
	OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 
	LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE 
	OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 
	ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
'''

import sys
import re
import urllib
import urllib2
import cookielib
import csv
import json

def main():
	# Check the arguments
	if len(sys.argv) != 3:
		usage()
	user = sys.argv[1]
	passw = sys.argv[2]
	
	# Initialize the needed modules
	CHandler = urllib2.HTTPCookieProcessor(cookielib.CookieJar())
	browser = urllib2.build_opener(CHandler)
	browser.addheaders = [('User-agent', 'InFB - ruel@ruel.me - http://ruel.me')]
	urllib2.install_opener(browser)
	
	
	# Initialize the cookies and get the post_form_data
	print 'Initializing..'
	res = browser.open('http://m.facebook.com/index.php')
	mxt = re.search('name="post_form_id" value="(\w+)"', res.read())
	pfi = mxt.group(1)
	print 'Using PFI: %s' % pfi
	res.close()
	
	# Initialize the POST data
	data = urllib.urlencode({
		'lsd'				: '',
		'post_form_id'		: pfi,
		'charset_test' 		: urllib.unquote_plus('%E2%82%AC%2C%C2%B4%2C%E2%82%AC%2C%C2%B4%2C%E6%B0%B4%2C%D0%94%2C%D0%84'),
		'email'				: user,
		'pass'				: passw,
		'login'				: 'Login'
	})
	
	# Login to Facebook
	print 'Logging in to account ' + user
	res = browser.open('https://www.facebook.com/login.php?m=m&refsrc=http%3A%2F%2Fm.facebook.com%2Findex.php&refid=8', data)
	rcode = res.code
	if not re.search('Logout', res.read()):
		print 'Login Failed'
		
		# For Debugging (when failed login)
		fh = open('debug.html', 'w')
		fh.write(res.read())
		fh.close
		
		# Exit the execution :(
		exit(2)
	res.close()
	
	# Get Access Token
	res = browser.open('http://developers.facebook.com/docs/reference/api')
	conft = res.read()
	mat = re.search('access_token=(.*?)"', conft)
	acct = mat.group(1)
	print 'Using access token: %s' % acct
	
	# Get friend's ID
	res = browser.open('https://graph.facebook.com/me/friends?access_token=%s' % acct)
	fres = res.read()
	jdata = json.loads(fres)
	
	# Initialize the CSV writer
	fbwriter = csv.writer(open('%s.csv' % user, 'ab'), delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
	
	# God for each ID in the JSON response
	for acc in jdata['data']:
		fid = acc['id']
		fname = acc['name']
		
		# Go to ID's profile
		res = browser.open('http://m.facebook.com/profile.php?id=%s&v=info&refid=17' % fid)
		xma = re.search('mailto:(.*?)"', res.read())
		if xma:
			
			# Replace the html entity from the scraped information
			email = xma.group(1).replace('@', '@')
			
			# In case there will be weird characters, repr() will help us.
			try:
				print fname, email
			except:
				print repr(fname), repr(email)
				
			# Write to CSV, again with repr() if something weird prints out.
			try:
				fbwriter.writerow([fname, email])
			except:
				fbwriter.writerow([repr(fname), repr(email)])
	
	
def usage():
	'''
		Usage: infb.py user@domain.tld password
	'''
	print 'Usage: ' + sys.argv[0] + ' user@domain.tld password'
	sys.exit(1)
	
if __name__ == '__main__':
	main()

I’ve also updated the gist, so you can fork it anytime here.

Yes, it’s tested and working (for now). In the future this will pretty much stop working again, and I’ll be updating it so do not worry. And again, use this at your own risk. If you have questions/comments regarding this script, don’t hesitate to comment below.

Best Regards.

Tags: , , , , ,

Trackback URL

  • http://www.facebook.com/profile.php?id=28118280 Alfred Inacio

    This is good — good thinking Ruel; you made good use of the Facebook API. Here’s another solution based on recursion:

    #!/usr/bin/python
    #
    # InFB – Information Facebook
    # Usage: infb.py user@domain.tld password
    # http://ruel.me
    #
    # Copyright (c) 2010, Ruel Pagayon – ruel@ruel.me
    # All rights reserved.
    #
    # Redistribution and use in source and binary forms, with or without
    # * Redistributions of source code must retain the above copyright
    # notice, this list of conditions and the following disclaimer.
    # * Redistributions in binary form must reproduce the above copyright
    # notice, this list of conditions and the following disclaimer in the
    # documentation and/or other materials provided with the distribution.
    # * Neither the name of ruel.me nor the names of its contributors
    # may be used to endorse or promote products derived from this
    # script without specific prior written permission.
    #
    # THIS SCRIPT IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND
    # ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
    # WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
    # DISCLAIMED. IN NO EVENT SHALL RUEL PAGAYON BE LIABLE FOR ANY
    # DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
    # (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
    # LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
    # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
    # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
    # SCRIPT, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

    import sys, time, re, urllib, urllib2, cookielib, HTMLParser, getpass

    class FormScraper(HTMLParser.HTMLParser):
    “”"
    Scrapes the Facebook login page for form values that need to be submitted on login.
    Necessary because the form values change each time the login page is loaded.

    Usage:
    form_scraper = FormScraper()
    form_scraper.feed(html_from_facebook)
    form_values = form_scraper.values
    “”"

    def __init__(self, *args, **kwargs):
    HTMLParser.HTMLParser.__init__(self, *args, **kwargs)
    self.in_form = False
    self.values = []

    def handle_starttag(self, tag, attrs):
    tag = tag.lower()
    attrs = dict(attrs)

    #if tag == ‘form’ and attrs['id'] == ‘login_form’:
    if tag == ‘form’:
    self.in_form = True
    elif self.in_form and tag == ‘input’ and attrs['type'] == ‘hidden’:
    try:
    self.values.append( (attrs['name'], attrs['value']) )
    print(attrs)
    except:
    self.values.append( (attrs['name'], attrs['autocomplete']) )
    else:
    pass
    #print(‘else case…’)
    #print(self.in_form)
    #print(tag)
    #print(attrs)
    def handle_endtag(self, tag):
    if tag.lower() == ‘form’ and self.in_form:
    self.in_form = False

    browser = 0

    def main():
    global browser
    if len(sys.argv) < 2:
    usage()

    user = sys.argv[1]

    if len(sys.argv) 1 and content_type[1].startswith(‘charset’):
    encoding = content_type[1].split(‘=’)[1]
    #print(res.read())
    #exit(0)
    html = unicode( res.read(), encoding=encoding )
    res.close()

    #scrape form for hidden inputs, add email and password to values
    form_scraper = FormScraper()
    form_scraper.feed(html)
    form_data = form_scraper.values
    form_data.extend( [('email', user), ('pass', passw)] )
    #HACK: urlencode doesn’t like strings that aren’t encoded with the ‘encode’ function.
    #Using html.encode(encoding) doesn’t help either. why ??
    form_data = [ ( x.encode(encoding), y.encode(encoding) ) for x,y in form_data ]
    data = urllib.urlencode(form_data)

    # Login
    print ‘Logging in to account ‘ + user
    res = browser.open(‘https://login.facebook.com/login.php?login_attempt=1′, data)
    rcode = res.code
    print rcode
    print res.url
    if not re.search(‘home.php$’, res.url):
    print ‘Login Failed’
    exit(2)
    res.close()

    recursPageSearch(‘http://m.facebook.com/friends.php’)

    def recursPageSearch(address):
    print ‘checking page: ‘ + address
    time.sleep(4)
    res = browser.open(address)
    parp = res.read()
    #print parp
    m = re.findall(r’<a class="sec" href="(/friends.php?.+?start=.+?&end=.+?;refid=d+)"', parp)
    if len(m) == 0:
    mProfiles = re.findall(r'href="(/.+?refid=d+?)"', parp)
    for iProfile in mProfiles:
    if re.search('home.php', iProfile) == None and re.search('tel:', iProfile) == None and re.search('logout.php', iProfile) == None and re.search('profile.php?refid', iProfile) == None and re.search('inbox', iProfile) == None and re.search('survey.php', iProfile) == None and re.search('help.php', iProfile) == None and re.search('preferences.php', iProfile) == None and re.search('findfriends.php', iProfile) == None and re.search('friends.php', iProfile) == None :
    iProfile = iProfile.replace("&", "&")

    newPage = 'http://m.facebook.com&#039; + iProfile
    print('checking profile: ' + newPage)
    time.sleep(8)
    res = browser.open(newPage)
    parp = res.read()
    #mInfo = re.find(r'href="(/.+?info.+?)"', parp)
    #mInfo = mInfo.replace("&", "&")
    mInfo = re.findall(r'Info‘, parp)
    for theInfo in mInfo:
    theInfo = theInfo.replace(“&”, “&”)
    newPage = ‘http://m.facebook.com’ + theInfo
    print(‘checking info: ‘ + newPage)
    time.sleep(9)
    res = browser.open(newPage)
    parp = res.read()
    emails = re.findall(r’”mailto:(.+?)”‘, parp)
    for email in emails:
    email = email.replace(‘&#64′, ‘@’)
    print(‘EMAIL–: ‘ + email)
    for i in m:
    time.sleep(11)
    i = i.replace(“&”, “&”)
    recursPageSearch(‘http://m.facebook.com’ + i)

    def usage():
    print ‘Usage: ‘ + sys.argv[0] + ‘ user@domain.tld [password]‘
    sys.exit(1)

    if __name__ == ‘__main__’:
    main()

    • http://ruel.me Ruel

      Thanks, you’re welcome to fork: https://gist.github.com/716622 Just for code readability. I suggest you get rid of this long comment block. :P

  • paul

    what are some practical applications of this code in action?
    for example, graphing friend account activity of any type of new posting. Basically to look at activity.

  • Anonymous

    Thanks for posting this script! I’m just getting started with facebook OAuth and tried changing /me/ to a friends ID to scrape their friend list (ie. https://graph.facebook.com/btaylor/friends). The script returned:
    {
    “error”: {
    “type”: “OAuthException”,
    “message”: “An access token is required to request this resource.”
    }
    }

    I have no problems viewing their friend list on the fb as we are friends so I figured it would be possible to do it under the API as well. Any ideas? Thanks again.

  • http://www.facebook.com/people/Pherson-Ngai/802644355 Pherson Ngai

    1. I run the script in the python shell 2.7.1, but it showed

    Usage: C:Python27Libinfb.py user@domain.tld password

    Traceback (most recent call last):
    File “C:Python27Libinfb.py”, line 139, in
    main()
    File “C:Python27Libinfb.py”, line 45, in main
    usage()
    File “C:Python27Libinfb.py”, line 136, in usage
    sys.exit(1)
    SystemExit: 1

    How to fix it?

    2. If you using Access Token, is it means we need to use facebook app?

    3. Can you show the output to me? On the IE or create a csv file?

  • Beza1e1

    This script does not work for localized accounts. In german there is no “Logout” on the page but “Abmelden”. Check for “logout.php” after login.

    Also, i had to remove the user-agent header, otherwise Facebook rejected the graph API access.

  • Benjamin

    Hi there, i get “login failed” :(

    • http://ruel.me Ruel

      Weird, I tried with my own account and it’s working fine. Is your facebook on different language? Because it checks the word “Logout” (in english). So you might need to change that in line 78.

      • Benjamin

        My account’s set to English, but i live in the Netherlands, could it be that the default language is the local one? I can log in via my browser no problem.

        • http://ruel.me Ruel

          Yeah, probably because the script is using its own cookies. And facebook detects the language based on your IP. Try changing the “Logout” in line 78 to the equivalent word in your language. :)

          • Benjamin

            “Afmelden” indeed was the word to use. Great! I’ve got to study your code to understand what it’s doing exactly. I made some Facebook applet in Python for Cairo-Dock but i depend on FBCMD. I’d like to use python only, but i’m new to urllib2 and authentication etc. :) Thanks (I may come back with some questions ;) )

          • http://ruel.me Ruel

            Alright, no problem. :)

          • Benjamin

            Hi there Ruel, i don’t understand lines 66, 67 and 68. Could you explain?
            Also, how do you know what to pass in the url opener? (for instance line 76) Is there any documentation for that? (what i found on the Facebook site is rather confusing and does not resemble what you are doing). Thanks :) B.

          • Benjamin

            PS: a nice way to solve the local language issue is to ask the user for its first or last name and look for it in the page, it can’t miss out and it’s not language dependent (although i don’t know what would happen for special characters).

          • http://ruel.me Ruel

            The secret to automating login pages is by tracing how it actually logs in. For this one, I used Firefox + LiveHTTPHeaders. That extension sniffs the HTTP headers passed. You can see the POST data, and the POST action url. I’ll post a separate post on how to use LiveHTTPHeaders one of these days.

  • Anonymous

    Actually it provides email information, but a appropriate permission is required. We will be application the Graph API to get the complete account of our Facebook friends.

    toshiba direct coupon code

  • http://pulse.yahoo.com/_7XYSLNRN4R2VB4R2G4VLNLDNFY Joe

    The script fails to login for me. Message is just “Login failed”. And debug.html is empty. Does this mean the FB HTML changed and the script no longer works?

  • Durgesh1624

    hw can we scrape our frnd’s frnds name & id from facebook

  • Guest

    code doesn’t work anymore: “Login Failed”

  • Quilibet

    Hey, thats indeed nice work!
    I am trying to re-build a program like netvizz for scintific purpose. Have you any comment / tips how to realize that?
    Thanks in advance

  • http://profiles.google.com/tudzag Mike Webster

    I suspect Facebook broke this somehow.  I used the code to pull up JSON for my newsfeed and it was working fine until 12/16.  I’m now getting http 400 type errors.

  • devtha

    Hi

    how this work? pls explian me in search.gayathri@gmail.com pls

  • wkgeorge

    Hi

    I received this error when I tried to run your code, I used the updated code from this website https://gist.github.com/1511968

    AttributeError: ‘NoneType’ object has no attribute ‘group’

    • Awood314

      I was getting this same problem, you have to change the search parameter for re

      change:
      mxt = re.search(‘name=”post_form_id” value=”(w+)”‘, res.read()) 
      to
      mxt = re.search(‘id=”login_form”‘, res.read()) should work.

  • tudza

    Looks like m.facebook.com/index.php no longer contains the string the regular expression is looking for?  Is there another page to try or a change to the res. line that would fix this?

  • tudza

    Okay, I tried grabbing just any number from the page this references.  The  value that comes up for the item with name=”m_ts” works for me when I assign that value to pfi.

    I don’t know if this is the correct thing to use really, but I’m getting the expected results from my script. Suppose I’ll see about altering the regular expression or do the task in some less elegant way until the author weighs in on the subject.

    We were warned Facebook would break the original script eventually. No surprises.

  • tudza

    So yeah, try this instead:

    mxt = re.search(‘name=”m_ts” value=”(w+)”‘, res.read())

    • Guest

      Thank you! This works now!

  • Anirudh Goyal

     Initializing..
    Traceback (most recent call last):
      File “infb.py”, line 173, in
        main()
      File “infb.py”, line 16, in main
        pull(user, passw)
      File “infb.py”, line 49, in pull
        pfi = mxt.group(1)
    AttributeError: ‘NoneType’ object has no attribute ‘group’
    HELP???

  • Rob

    OK, had to make some changes.  First of all:

    60:     mxt = re.search(‘id=”login_form”‘, res.read())
    61:    pfi = mxt.group(0)

    Script is slow.  I get 4 results every 60 seconds or so.  So don’t think it just hangs.  I think FB has a limit on queries per second.  I’ll tweak it some more.  Muchas gracias Ruel.

  • Rob

    Goodness gracious it’s slow.  Can I set the timeout to be longer, or is it through when they close my session?

  • atylus

    Hi ,

    I am getting this error on running the script

    Initializing..
    Traceback (most recent call last):
    File “xxxxx.py”, line 140, in
    main()
    File “xxxxx.py”, line 61, in main
    pfi = mxt.group(1)
    AttributeError: ‘NoneType’ object has no attribute ‘group’

  • Ted

    Ruel please explain how someone can view the contents of the stored cookie within this session. Thanks!

  • nour

    I have this problem , where is the solution
    res = browser.open(‘http://m.facebook.com/index.php’)
    NameError: name ‘browser’ is not defined
    Thanks