Scraping Websites for Fun and Profit Part 2
Posted on December 22nd, 2007 in Automation, Coding
If you have not read Part 1, please take a moment to do so:
Scraping Websites for Fun and Profit Part 1
A few weeks ago, Nickycakes wrote about getting your feet wet with website scraping. If you’re interested in learning how to use php to grab content from other sites automatically, you should check out this book. It basically has everything you need to get started.
Anyway, the author of said book has published a set of library files for php that make scraping and parsing anything on the web fairly painless. You can download the entire set of library files here:
http://www.nickycakes.com/files/LIB_http.zip
There are a bunch of files inside, but you will probably only be using a couple of them for most tasks: LIB_http.php, LIB_parse.php. You can include the functions from these libraries by putting them in the same directory as your php script and, in your php script putting the line include (”LIB_http.php”); Inside each of the files is a description of the functions they include. LIB_http.php will have to be edited a little bit if you’re writing a scraper to make it look like your script is a browser and not a php script.
Here’s a description of the most useful functions in these files for scraping websites:
LIB_http.php
- http_get($target, $ref)
You give it $target (url you want to grab) and $ref (where you want the website to think you came from) and it will return an array with 3 variables, FILE, STATUS, and ERROR. $return_array[’FILE’] will have the contents of the webpage, $return_array[’STATUS’] will have the curl status of the transfer, and $return_array[’ERROR’] will have the curl error status. Example:
$target = “http://www.google.com”;
$ref = “http://www.yahoo.com”;
$google_frontpage = http_get($target, $ref);
echo($google_frontpage[’FILE’]);
Displays google frontpage. - http_post_form($target, $ref, $data_array)
Submit a form with POST method. Same $target and $ref information as above. $data_array should include the information you’re submitting. Example:
$target = “https://login.facebook.com/login.php”;
$ref = “http://www.facebook.com”;
$data_array[’email’]=”your@email.com”;
$data_array[’pass’]=”password”;
$results = http_post_form($target,$ref,$data_array);
echo($results[’FILE’]);
Congrats…now you’re logged into facebook. (you may have to run the script twice initially as curl sets up your cookie file.) - http_get_form($target,$ref,$data_array)
Works the same way as post, but does it with GET method.
LIB_parse.php
Ok, so it’s probably better for you to just open this one up and read the comments. There are a few simple functions in here that should let you easily parse any website without knowing how to use Regular Expressions. Regular Expression functions in php, in addition to being hard to learn for a newbie, are stupidly inefficient and will slow your programs down, so you don’t want to use them for parsing websites anyway.
If you are going to be scraping websites that require you submit form information, you will want to download and install Web Developer Toolbar for Firefox to help you figure out the form field information in a hurry without viewing the page source.
Hope this helps.






December 22nd, 2007 at 11:36 pm
Alright Nicky…the fact that you posted this shit NOW, instead of yesterday when i was coding up the promo tool really pisses me off
Ah well…you’re codes gonna be sharper than mine…so be it.
December 23rd, 2007 at 5:16 am
Very useful stuff Nicky, thanks a lot for sharing!
December 24th, 2007 at 1:13 am
Dude, I’ve never seen that firefox add-on (the web dev toolbar) before, but it kicks ass. Thanks for the link.
December 26th, 2007 at 1:51 am
yeah it really rocks, i love it
January 28th, 2008 at 9:02 am
Nice article, but i don’t get your comment above, since regex is exactly what this library is using. I don’t see any other way to parse. What am i missing here?
“Regular Expression functions in php, in addition to being hard to learn for a newbie, are stupidly inefficient and will slow your programs down, so you don’t want to use them for parsing websites anyway.”
January 28th, 2008 at 12:12 pm
Yeah, I looked closer a little later and found out that it does use regex. However, you can use the str functions in php, which are much quicker.