Wikipedia Scraper

Posted on January 6th, 2008 in Automation, Coding, Wikipedia

For a recent project, Nickycakes had to code a Wikipedia scraper. Here’s a simplified version of the function for you to use if you want. This code requires a few library files, which are included in LIB_http.zip.

Enjoy:

include('LIB_http.php');
include('LIB_parse.php'); 

function wikiscrape($topic){
   $target = "http://en.wikipedia.org/wiki/".urlencode($topic);
   $results = http_get($target,"");
   $paragraphs = parse_array($results['FILE'],"<p>","</p>",EXCL);
   foreach($paragraphs as $paragraph){
     $paragraph = strip_tags($paragraph);
     $paragraph = preg_replace("[\[.*\]]","",$paragraph);
     if ($paragraph){
       $final = $final . $paragraph . "\n\n";
     }
   }
   return $final;
}
Published by nickycakes

5 Responses to “Wikipedia Scraper”

  1. rgh Says:

    Well now.
    Combine that with the dialectizer: http://www.rinkworks.com/dialect/
    and you are one step closer to enormous cosmic power!

  2. freshinc Says:

    What type of info would you scrape from wikipedia?

  3. nickycakes Says:

    articles, generally. say you have a database site with a bunch of different topics and you want to quickly add a bunch of content. you could scrape a wikipedia article about each one, etc..

  4. Rob Says:

    Why scrape wikipedia when you can download the entire database dump for free? -> http://en.wikipedia.org/wiki/Wikipedia:Database_download

  5. nickycakes Says:

    yeah, downloading the db is good sometimes, but its hard to automatically grab a relevant article without google. the best way to do it is to search google with like:

    site:wikipedia.org yourtopic

    and then scrape the first result

Leave a Comment