Wikipedia Scraper
Posted on January 6th, 2008 in Automation, Coding, Wikipedia
For a recent project, Nickycakes had to code a Wikipedia scraper. Here’s a simplified version of the function for you to use if you want. This code requires a few library files, which are included in LIB_http.zip.
Enjoy:
include('LIB_http.php');
include('LIB_parse.php');
function wikiscrape($topic){
$target = "http://en.wikipedia.org/wiki/".urlencode($topic);
$results = http_get($target,"");
$paragraphs = parse_array($results['FILE'],"<p>","</p>",EXCL);
foreach($paragraphs as $paragraph){
$paragraph = strip_tags($paragraph);
$paragraph = preg_replace("[\[.*\]]","",$paragraph);
if ($paragraph){
$final = $final . $paragraph . "\n\n";
}
}
return $final;
}
Published by nickycakes






January 6th, 2008 at 1:53 pm
Well now.
Combine that with the dialectizer: http://www.rinkworks.com/dialect/
and you are one step closer to enormous cosmic power!
January 6th, 2008 at 10:49 pm
What type of info would you scrape from wikipedia?
January 6th, 2008 at 10:51 pm
articles, generally. say you have a database site with a bunch of different topics and you want to quickly add a bunch of content. you could scrape a wikipedia article about each one, etc..
April 26th, 2008 at 12:55 pm
Why scrape wikipedia when you can download the entire database dump for free? -> http://en.wikipedia.org/wiki/Wikipedia:Database_download
April 26th, 2008 at 2:25 pm
yeah, downloading the db is good sometimes, but its hard to automatically grab a relevant article without google. the best way to do it is to search google with like:
site:wikipedia.org yourtopic
and then scrape the first result