Scraping Websites for Fun and Profit Part 2

Posted on December 22nd, 2007 in Automation, Coding

If you have not read Part 1, please take a moment to do so:
Scraping Websites for Fun and Profit Part 1

A few weeks ago, Nickycakes wrote about getting your feet wet with website scraping.  If you’re interested in learning how to use php to grab content from other sites automatically, you should check out this book.   It basically has everything you need to get started.

Anyway, the author of said book has published a set of library files for php that make scraping and parsing anything on the web fairly painless.  You can download the entire set of library files here:
http://www.nickycakes.com/files/LIB_http.zip

There are a bunch of files inside, but you will probably only be using a couple of them for most tasks: LIB_http.php, LIB_parse.php.  You can include the functions from these libraries by putting them in the same directory as your php script and, in your php script putting the line include (”LIB_http.php”); Inside each of the files is a description of the functions they include.  LIB_http.php will have to be edited a little bit if you’re writing a scraper to make it look like your script is a browser and not a php script.

Here’s a description of the most useful functions in these files for scraping websites:

LIB_http.php

  • http_get($target, $ref)
    You give it $target (url you want to grab) and $ref (where you want the website to think you came from) and it will return an array with 3 variables, FILE, STATUS, and ERROR.  $return_array[’FILE’] will have the contents of the webpage, $return_array[’STATUS’] will have the curl status of the transfer, and $return_array[’ERROR’] will have the curl error status.  Example:
    $target = “http://www.google.com”;
    $ref = “http://www.yahoo.com”;
    $google_frontpage = http_get($target, $ref);
    echo($google_frontpage[’FILE’]);
    Displays google frontpage.
  • http_post_form($target, $ref, $data_array)
    Submit a form with POST method.  Same $target and $ref information as above.  $data_array should include the information you’re submitting.  Example:
    $target = “https://login.facebook.com/login.php”;
    $ref = “http://www.facebook.com”;
    $data_array[’email’]=”your@email.com”;
    $data_array[’pass’]=”password”;
    $results = http_post_form($target,$ref,$data_array);
    echo($results[’FILE’]);

    Congrats…now you’re logged into facebook. (you may have to run the script twice initially as curl sets up your cookie file.)
  • http_get_form($target,$ref,$data_array)
    Works the same way as post, but does it with GET method.

LIB_parse.php

Ok, so it’s probably better for you to just open this one up and read the comments.  There are a few simple functions in here that should let you easily parse any website without knowing how to use Regular Expressions.  Regular Expression functions in php, in addition to being hard to learn for a newbie, are stupidly inefficient and will slow your programs down, so you don’t want to use them for parsing websites anyway.

If you are going to be scraping websites that require you submit form information, you will want to download and install Web Developer Toolbar for Firefox to help you figure out the form field information in a hurry without viewing the page source.

Hope this helps.

Published by nickycakes

6 Responses to “Scraping Websites for Fun and Profit Part 2”

  1. Nick Says:

    Alright Nicky…the fact that you posted this shit NOW, instead of yesterday when i was coding up the promo tool really pisses me off ;) Ah well…you’re codes gonna be sharper than mine…so be it.

  2. Adam Says:

    Very useful stuff Nicky, thanks a lot for sharing! :)

  3. Robert Norton Says:

    Dude, I’ve never seen that firefox add-on (the web dev toolbar) before, but it kicks ass. Thanks for the link.

  4. nickycakes Says:

    yeah it really rocks, i love it

  5. nohatter Says:

    Nice article, but i don’t get your comment above, since regex is exactly what this library is using. I don’t see any other way to parse. What am i missing here?

    “Regular Expression functions in php, in addition to being hard to learn for a newbie, are stupidly inefficient and will slow your programs down, so you don’t want to use them for parsing websites anyway.”

  6. nickycakes Says:

    Yeah, I looked closer a little later and found out that it does use regex. However, you can use the str functions in php, which are much quicker.

Leave a Comment