How To Build A Basic Web Crawler To Pull Information From A Website (Part 2)

By James Bruce, MakeUseOf – December 17, 2010 at 12:31PM

build a webcrawler This is part 2 in a series I started last time about how to build a web crawler in PHP. Previously I introduced the Simple HTML DOM helper file, as well as showing you how incredibly simple it was to grab all the links from a webpage, a common task for search engines like Google.

If you read part 1 and followed along, you’ll know I set some homework to adjust the script to grab images instead of links.

I dropped some pretty big hints, but if you didn’t get it or if you couldn’t get your code to run right, then here is the solution. I added an additional line to output the actual images themselves as well, rather than just the source address of the image.

<?php
 include_once('simple_html_dom.php');
 $target_url = "https://www.tokyobit.com";
 $html = new simple_html_dom();
 $html->load_file($target_url);
 foreach($html->find('img') as $img)
 {
 echo $img->src."<br />";
 echo $img."<br/>";
 }
 ?>

This should output something like this:

build a webcrawler

Of course, the results are far from elegant, but it does work. Notice that the script is only capable of grabbing images that are on the content of the page in the form of <img> tags – a lot of the page design elements are hard-coded into the CSS, so our script can’t grab those. Again, you can run this through my server and if you wish at this URL, but to enter your own target site you’ll have to edit the code and run on your own server as I explained in part 1. At this point, you should bear in mind that downloading images from a website is significantly more stress on the server than simply grabbing text links, so do only try the script on your own blog or mine and try not to refresh lots of times.

Let’s move on and be a little more adventurous. We’re going to build upon our original file, and instead of just grabbing all the links randomly, we’re going to make it do something more useful by getting the post content instead. We can do this quite easily because standard WordPress wraps the post content within a <div class=”post”> tag, so all we need to do is grab any “div” with that class type, and output them – effectively stripping everything except the main content out of the original site. Here is our initial code:

<?php
 include_once('simple_html_dom.php');
 $target_url = "https://www.tokyobit.com";
$html = new simple_html_dom();
$html->load_file($target_url);
 foreach($html->find(‘div[class=post]‘) as $post)
 {
 echo $post.”<br />”;
 }
?>

You can see the output by running the script from here (forgive the slowness, my site is hosted at GoDaddy and they don’t scale very well at all), but it doesn’t contain any of the original design – it is literally just the content.

Let me show you another cool feature now – the ability to delete elements of the page that we don’t like. For instance, I find the meta data quite annoying – like the date and author name – so I’ve added some more code that finds those bits (identified by various classes of div such as post-date, post-info, and meta). I’ve also added a simple CSS style-sheet to format the output a little. Daniel covered a number of great places to learn CSS online if you’re not familiar with it.

As I mentioned in part 1, even though the file contains PHP code, we can still add standard HTML or CSS to the page and the browser will understand it just fine – the PHP code is run on the server, then everything is sent to the browser, to you, as standard HTML. Anyway, here’s the whole final code:

<head>
 <style type=”text/css”>
 div.post{background-color: gray;border-radius: 10px;-moz-border-radius: 10px;padding:20px;}
 img{float:left;border:0px;padding-right: 10px;padding-bottom: 10px;}
 body{width:60%;font-family: verdana,tahamo,sans-serif;margin-left:20%;}
 a{text-decoration:none;color:lime;}
 </style>
 </head>
<?php
 include_once(‘simple_html_dom.php’);
$target_url = “https://www.tokyobit.com”;
$html = new simple_html_dom();
$html->load_file($target_url);
 foreach($html->find(‘div[class=post]‘) as $post)
 {
 $post->find(‘div[class=post-date]‘,0)->outertext = ”;
 $post->find(‘div[class=post-info]‘,0)->outertext = ”;
 $post->find(‘div[class=meta]‘,0)->outertext = ”;
 echo $post.”<br />”;
 }
?>

You can check out the results here. Pretty impressive, huh? We’ve taken the content of the original page, got rid of a few bits we didn’t want, and completely reformatted it in the style we like! And more than that, the process is now automated, so if new content were to be published, it would automatically display on our script.

build a webcrawler

That’s only a fraction of the power available to you though, you can read the full manual online here if you’d like to explore it a little more of the PHP Simple DOM helper and how it greatly aids and simplifies the web crawling process. It’s a great way to take your knowledge of basic HTML and take it up to the next dynamic level.

What could you use this for though? Well, let’s say you own lots of websites and wanted to gather all the contents onto a single site. You could copy and paste the contents every time you update each site, or you could just do it all automatically with this script. Personally, even though I may never use it, I found the script to be a useful exercise in understanding the underlying structure of modern internet documents. It also exposes how simple it is to re-use content when everything is published on a similar system using the same semantics.

What do you think? Again, do let me know in the comments if you’d like to learn some more basic web programming, as I feel like I’ve started you off on level 5 and skipped the first 4! Did you follow along and try yourself, or did you find it a little too confusing? Would you like to learn more about some of the other technologies behind the modern internet browsing experience?

If you’d prefer learning to program on the desktop side of things, Bakari covered some great beginner resources for learning Cocoa Mac OSX desktop programming at the start of the year, and our featured directory app CodeFetch is useful for any programming language. Remember, skills you develop programming in any language can be used across the board.

Follow MakeUseOf on Twitter. Includes cool extras.