Tag Archive > data

Committing Sensitive Data

» 18 July 2012 » In C#, Git, Open-Source, Programming, Security » 2 Comments

I spent the whole night trying to figure out how to revert commits, and remove my sensitive information from my public repositories. This helped me realized how important it is to learn and practice git.

I mostly do open source projects on C#, and I always put to-change data like API keys in app.config. At first, I thought I was on the safe road, while making git ignore changes from my app.config file.

git update-index --assume-unchanged app.config

But then I was wrong. I am actually partially on the safe road.

Several hours ago, while working on a project, I noticed that Settings.settings and Settings.Designer.cs updates whenever I change something in my settings variables. Upon looking at the contents of the files, panic filled my soul and I just realized that it’s not just app.config that I should be untracking.

I furiously searched for a way to resolve this issue, and thinking that I have to go over one commit at a time, I started to think that I made a very huge mistake. I have a database connection string and my personal phone number in two separate projects in GitHub.

Luckily I found this article. Without further munching, I applied the commands to every repository I have, wherein I used app.config to store sensitive data.

Afterwards, I ran the command I used with app.config on Settings.settings and Settings.Designer.cs. Now I am both ashamed to be in such situation, and happy to learn something new.

Continue reading...

Tags: , , , , , ,

Web Page Scraping with Python

» 06 November 2010 » In Guides, Python » 6 Comments

Whenever we come across a web page, we read its content, and even save it for later reading and analysis. Some web pages offer a lot of good information that maybe useful later. And sometimes those information are not easy to copy. It will require a lot of patience especially when the information you want to extract from a web page is on a bad format. Luckily, there’s what we call scraping, and of course to make things faster, programming will kick in.

Say for example, we need to harvest some email addresses from a web page. Lets use this page as an example. It’s a webpage with several email addresses, and our goal is to copy them all. Of course it will be a bit easy if we just use our mouse and copy-paste the email addresses. But the thing is, what if there’s another page to be scraped, with thousands of unique email addresses. That would be a pain, wouldn’t it?

The way we will extract data, will be regular expressions. It’s very popular and over used, but I believe it would be the best tool to accomplish this. And for the language, we will use Python. To be honest, I haven’t learned Python very well. I’m more like a Perl guy, but yeah, it’s time to learn something new. My version of Python is 2.5, you can use the latest 2.7 but not 3.x. Actually you can too, but I don’t recommend it. Most Linux distributions have Python installed, just like Perl. And for windows, you can download ActivePython.

The first thing we need, is to find/write a regular expression for the harvest. Now you can just search Google and find the one you need, here’s an example:

([A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4})

Let me explain this regular expression first, before we continue. The [A-Z0-9._%+-] part means match any of the following: A-Z, 0-9, a dot (.), underscore(_), percent sign (%), plus sign (+) and minus sign (-). The following + means match one or more of the set of characters inside the brackets []. So basically, [A-Z0-9._%+-]+ matches the local part of the email address. Next is the @ character. It’s pretty literal, because it means, match the @ character once, after the local part. The [A-Z0-9.-]+\.[A-Z]{2,4} matches the domain.

Here’s the code in Python, this prints out all the email addresses found on our example page.

#!/usr/bin/python

import urllib2
import re

request = urllib2.Request('http://www.daniweb.com/forums/thread218079.html')
request.add_header('UserAgent', 'Ruel.ME Sample Scraper')
response = urllib2.urlopen(request)
for line in response.read().split('\n'):
	match = re.search(r'([A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4})', line, re.I)
	if match:
		print match.group(1)

I will explain the code line by line:

  • import urllib2 import the urlib2 module for our connection
  • import re import the regular expression module
  • request = urllib2.Request('http://www.daniweb.com/forums/thread218079.html') make the request of the page
  • request.add_header('UserAgent', 'Ruel.ME Sample Scraper') add a UserAgent
  • response = urllib2.urlopen(request) performs the request
  • for line in response.read().split('\n'):(request) read the html source line by line
  • match = re.search(r'([A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4})', line, re.I) perform the search with the regular expression
  • if match: check if there are matches
  • print match.group(1) display the match

It was actually pretty easy. So the output of the script is the email addresses from that page. Web page scraping is not a hard task. It’s very easy and exciting. You can use other languages like Perl, PHP, etc. With this methods, data extraction is easier, and faster.

Continue reading...

Tags: , , , , , ,