Wikipedia Scraper
For a recent project, Nickycakes had to code a Wikipedia scraper. Here’s a simplified version of the function for you to use if you want. This code requires a few library files, which are included in LIB_http.zip.
Enjoy:
include('LIB_http.php');
include('LIB_parse.php');
function wikiscrape($topic){
$target = "http://en.wikipedia.org/wiki/".urlencode($topic);
$results = http_get($target,"");
$paragraphs = parse_array($results['FILE'],"<p>","</p>",EXCL);
foreach($paragraphs as $paragraph){
$paragraph = strip_tags($paragraph);
$paragraph = preg_replace("[\[.*\]]","",$paragraph);
if ($paragraph){
$final = $final . $paragraph . "\n\n";
}
}
return $final;
}






Well now.
Combine that with the dialectizer: http://www.rinkworks.com/dialect/
and you are one step closer to enormous cosmic power!
What type of info would you scrape from wikipedia?
articles, generally. say you have a database site with a bunch of different topics and you want to quickly add a bunch of content. you could scrape a wikipedia article about each one, etc..
Why scrape wikipedia when you can download the entire database dump for free? -> http://en.wikipedia.org/wiki/Wikipedia:Database_download
yeah, downloading the db is good sometimes, but its hard to automatically grab a relevant article without google. the best way to do it is to search google with like:
site:wikipedia.org yourtopic
and then scrape the first result
sample please…
Its code-overkill… why use the WebbotsSpidersScreenScraper_Libraries when a line of Curl with about 3 lines of Curl Setopt will do the same thing
Cause i use the wrappers for everything and so writing it with curl manually is code overkill.
seems like pretty much all the results have a bunch of special chars that don’t want to render. any ideas?
that’s a matter of encoding. if you’re in firefox and go to view->character encoding and select utf8 or whatever, it should work fine.
sweet. worked. thanks, script works charmingly.
=))
u the man yet again ..thnx nickycakes.
Thx for the script – it works great! But how can you limit the amount of data returned? Ie – stop the data returned after about 3 paragraphs, or limit the number of characters returned?
Hello!
I am using your wonderful script, but only offline, when using it on my live webserver nothing is displayed!
Is there any settings or modules I need to install on the webserver in order for it to get working?
Please help, I must get this Wikipedia scraping working.
Kind regards and thanks for a wonderful article!
/Daniel
Thanks for the script – very useful. The script can be a little picky with regard to capital letters – by constructing the exact wikipedia url from the topic you provide, the capitalisation has to be exact or the search will fail. I solved this by pointing my script at the search function instead. This also gets around the problem of topics with more than one word – the wikipedia urls use underscores ( _ ) rather than spaces, so wouldn’t return a result.
It’s easy and just requires one line to be changed:
change:
$target = “http://en.wikipedia.org/wiki/”.urlencode($topic);
to:
$target=”http://en.wikipedia.org/w/index.php?title=Special:Search&search=”.urlencode($topic).”&ns0=1&redirs=0″;
I also put a character limit in to prevent very small paragraphs being returned:
change:
if ($paragraph){
to:
if ($paragraph AND strlen($paragraph)>80){
to be fair, i posted that several years ago
Oh yeah! Oh well, still a handy little script – ones I have used in the past stopped working a while ago.