Scraping Websites for Fun and Profit Part 2

If you have not read Part 1, please take a moment to do so:
Scraping Websites for Fun and Profit Part 1

A few weeks ago, Nickycakes wrote about getting your feet wet with website scraping.  If you’re interested in learning how to use php to grab content from other sites automatically, you should check out this book.   It basically has everything you need to get started.

Anyway, the author of said book has published a set of library files for php that make scraping and parsing anything on the web fairly painless.  You can download the entire set of library files here:
http://www.nickycakes.com/files/LIB_http.zip

There are a bunch of files inside, but you will probably only be using a couple of them for most tasks: LIB_http.php, LIB_parse.php.  You can include the functions from these libraries by putting them in the same directory as your php script and, in your php script putting the line include (“LIB_http.php”); Inside each of the files is a description of the functions they include.  LIB_http.php will have to be edited a little bit if you’re writing a scraper to make it look like your script is a browser and not a php script.

Here’s a description of the most useful functions in these files for scraping websites:

LIB_http.php

  • http_get($target, $ref)
    You give it $target (url you want to grab) and $ref (where you want the website to think you came from) and it will return an array with 3 variables, FILE, STATUS, and ERROR.  $return_array['FILE'] will have the contents of the webpage, $return_array['STATUS'] will have the curl status of the transfer, and $return_array['ERROR'] will have the curl error status.  Example:
    $target = “http://www.google.com”;
    $ref = “http://www.yahoo.com”;
    $google_frontpage = http_get($target, $ref);
    echo($google_frontpage['FILE']);
    Displays google frontpage.
  • http_post_form($target, $ref, $data_array)
    Submit a form with POST method.  Same $target and $ref information as above.  $data_array should include the information you’re submitting.  Example:
    $target = “https://login.facebook.com/login.php”;
    $ref = “http://www.facebook.com”;
    $data_array['email']=”your@email.com”;
    $data_array['pass']=”password”;
    $results = http_post_form($target,$ref,$data_array);
    echo($results['FILE']);

    Congrats…now you’re logged into facebook. (you may have to run the script twice initially as curl sets up your cookie file.)
  • http_get_form($target,$ref,$data_array)
    Works the same way as post, but does it with GET method.

LIB_parse.php

Ok, so it’s probably better for you to just open this one up and read the comments.  There are a few simple functions in here that should let you easily parse any website without knowing how to use Regular Expressions.  Regular Expression functions in php, in addition to being hard to learn for a newbie, are stupidly inefficient and will slow your programs down, so you don’t want to use them for parsing websites anyway.

If you are going to be scraping websites that require you submit form information, you will want to download and install Web Developer Toolbar for Firefox to help you figure out the form field information in a hurry without viewing the page source.

Hope this helps.

Peanut Gallery

  • Alright Nicky…the fact that you posted this shit NOW, instead of yesterday when i was coding up the promo tool really pisses me off ;) Ah well…you’re codes gonna be sharper than mine…so be it.

  • Very useful stuff Nicky, thanks a lot for sharing! :)

  • Dude, I’ve never seen that firefox add-on (the web dev toolbar) before, but it kicks ass. Thanks for the link.

  • yeah it really rocks, i love it

  • Nice article, but i don’t get your comment above, since regex is exactly what this library is using. I don’t see any other way to parse. What am i missing here?

    “Regular Expression functions in php, in addition to being hard to learn for a newbie, are stupidly inefficient and will slow your programs down, so you don’t want to use them for parsing websites anyway.”

  • Yeah, I looked closer a little later and found out that it does use regex. However, you can use the str functions in php, which are much quicker.

  • Nicky,

    Just wonder how you make it work. I brought the book and try to run the first couple of examples. But I got the

    Fatal error: Call to undefined function curl_init() in /home/admin/php/LIB_http.
    php on line 210

    I check the LIB_http.php file, there is a line calling curl_init() but nowhere in the code define it.

    Do I missing anything here?

    Rgds,
    Chip

  • Sorry Dude, got it fixed. It is the php.ini under php5-cli is not setup properly. I have two php.ini, one under apache2 and one under php5-cli. The apache2 one already have the line “extension=curl.so” and mislead me thinking that I have it setting up correctly. I appended the line into the php5-cli php.ini file and is now working.

  • yeah gotta have curl enabled =)

  • hi…I have tried your code and i get this error:
    Parse error: parse error, unexpected ‘[‘ in E:\Program Files\EasyPHP 2.0b1\www\http_get.php on line 6

    here is my code:

  • Took a long time to find your site, but this did the trick. The site I was trying to log into required additional data items (form ‘names’) for the data array. I was able to discover the additional fields using the Web Developer toolbar and comparing the results to facebook.

Reply

Add a new comment