Pages

Thursday, February 14, 2013

PHP Tutorial: Making a webcrawler!

Hey!

Don't you know what a webcrawler is? A webcrawler is used by search engines like:
Google, Yahoo and bing. These search engines got bots running 24/7 searching for new websites. How do these bots work? Easy.
Keep reading to find out how to make one yourself!


Let me show you in steps:

Step 1: Bot starts and gets a URL.
Step 2: The bot opens the URL and searches for all links.
Step 3: The bot delete not working links.
Step 4: The bot adds the links to a database.
Step 5: The bot goes back to Step 1 with the a found link.



Lets get started making one shall we?

Requirements:
  • Brains
  • Some php knowledge (Variables, Functions etc.)
  • Some HTML knowledge (How to make a link.)
  • A webserver (See below)
  • A MySQL Database (comes with the webserver below)
  • "Simple_HTML_Dom.PHP" (Download: HERE)


 Index: 
  • Setting up a webserver
  • Concept
  • Making the PHP crawler.
  • Variations.

Setting up a webserver:
You wanna know how to setup a webserver?
It's super easy! But do you want it on your pc? or on an USB stick?

For the PC download: WAMP WebServer.
Setting it up wont be that hard. Just follow the install instructions and start it.
Now goto: http://localhost/ 
(NO .com .org .net)
Put your ".php" files in: {wamp instal directory}/www/

For USB download: EasyPHP - The portable webserver!
The setup is the same as wamp but make sure your install location is on your USB.

NOTE: Both webserver got MySQL preinstalled!
To connect use:


mysql_connect("localhost", "root", "");

Server: localhost
Username: Root
Password: (none)


Concept:
The concept is very easy. Like said before the bots runs in a loop.
And adds the links to the database. 
Now how are we going to get the HTML source code?
The "Simple_HTML_Dom.php" has all the functions we need!
So let's include it in our html first:


<?php
  include_once('simple_html_dom.php');
?>


Ok! We've included the extension we can now use it!
If we want the source code from a url we need to define the url and load it first:



  <?php
  //---
  include_once('simple_html_dom.php');
  //---
  $url = "http://timvanosch.blogger.com/";
  $html = new simple_html_dom();
  $html->load_file($url);
  //--
  ?>
 
Ok let me explain it:
  1. Opening the PHP file: "<?php"
  2. Comment (non code)
  3. Including the extension
  4. Comment (non code)
  5. Defining variable "$url". This is the page that we will grab the source from!
  6. Defining variable "$html". This is a extension class. (read on)
  7. Execute function "load_file" in the "$html" class. This will load "$url" source!
  8. Comment (non code)
  9. Closing the PHP file: "?>"
We're here. We've succesfully loaded the file in the "$html" variable!.


Making the base PHP file:
We've already made a good base but we want to extend it, So it will echo out the links. 
So we've got the source in "$html", We now need to find all the "<a href="blabla"></a>" tags and cut out the href link.
We're lucky cause "Simple_html_dom.php" already got such a function and looks like this:


$html->find('a');

Ok, This function will return an array with all the "<A>" tags in the source!
To get through all the "<A>" tags quickly we're gonna use "foreach(){}" function.
And I'm gonna use the code from Concept:


<?php
  //---
  include_once('simple_html_dom.php');
  //---
  $url = "http://timvanosch.blogspot.com/";
  $html = new simple_html_dom();
  $html->load_file($url);
  //--
  foreach($html->find("a") as $link)
  {
    echo $link->href."< br />;
  }
?>
 
Now let me explain it:

    9. The foreach will loop and assign a array entry to "$link" till there are no more left.
         So it will start at 0,1,2,3,4,5,6,7,8,9 in the array.
   11. This echos out the href from the "<A>" tag and adds an enter.

Now change "$url" to a site and watch the magic happens.
This is my output: (I have changed the urls a bit for protection!)

http://timvanosch.blogspot.nl/2013/02/earn-money-onlin...
http://bit.ly/Wluavs
http://timvanosch.blogspot.nl/2013/02/earn-money-onlin...
https://plus.google.com/11489535515489321
http://timvanosch.blogspot.nl/2013/02/earn-money-onlin...
http://timvanosch.blogspot.nl/2013/02/earn-money-onlin...
http://www.blogger.com/post-edit.g?blogID=743878789022...
http://www.blogger.com/share-post.g?blogID=74387878902...
http://www.blogger.com/share-post.g?blogID=74387878902...
http://www.blogger.com/share-post.g?blogID=74387878902...
http://www.blogger.com/share-post.g?blogID=743878785902...
http://timvanosch.blogspot.nl/
http://timvanosch.blogspot.com/feeds/posts/default
//www.blogger.com/rearrange?blogID=743878789022871384...
//www.blogger.com/rearrange?blogID=743878789022871384...
//www.blogger.com/rearrange?blogID=743878789022871384...
http://www.blogger.com
//www.blogger.com/rearrange?blogID=743878789022871384...

Nice! We've got results.
So what've you learned ?:
  • How to incude extensions.
  • How to use extensions.
  • How to get source code.
  • How to use "foreach(){}"
  • How to crawl the web!
Now if you want to make an: 'infinite crawler' just apply your basic php skills and you'll be able to make a loop.
Again goto Concept to see what you have to do for an 'infinite crawler'.
You can put all the urls found on an website in an array or directly into a database.
Then use those urls and crawl them.


Variations:
Ofcourse you can variate much in crawlers. I made one which will show you the found links on a site. You can press these links and it will crawl the pressed link. 
It's like an 'infinite crawler' but then with human pauses in between.

Download: DropBox link to: Crawler_source_code.rar

For the ones that don't trust me:
Jotti - Online virus scanner

Jotti is online virus scanner. It will scan a file with 21 different virus scanners.
I've already uploaded the file on jotti so you can view the results above.

Thank you!
Thanks for reading this post. If you wish to get more tutorials like these subscribe to this blog on the right site. Just enter your e-mail and you will get all the post right to your mail!

Greets, Tim.

37 comments:

  1. Good overview, Lee. Using a tool such as can be helpful in assessment and identifying areas of need.

    Website Development company

    ReplyDelete
    Replies
    1. Thank you for your response. I'm happy this was usefull!

      Delete
  2. You will discover some fascinating points in time in this post but I don’t know if I see all of them interior to heart. I am learning great extra challenging on distinct blogs everyday. Lots of people will be benefited from your writing. Cheers!

    Press Release Writers
    Press Release Writing Service

    ReplyDelete
  3. Hi Tim. Wonderful tutorial. I added to my list of PHP-based web crawler tutorials. Thanks for the great resource!

    ReplyDelete
  4. It is a pleasure going through your post. I have bookmarked you to check out new stuff from your side.asp.net training in jalandhar

    ReplyDelete
  5. hi, can simple_html_dom or PHPCrawl crawl a dynamic ajax or javascript content? if it is can, can you show me how to do that? I tried to combine this two methods and works for several websites, but when I tried to this two dynamic websites and I can’t load the value I want.
    The value exist when I inspect the element, but when I view page source, the value is not in there

    ReplyDelete
  6. Your blog has given me that thing which I never expect to get from all over the websites. Nice post guys!

    regards,
    Melbourne Web Designer

    ReplyDelete
  7. Thanks for great post. Very nice information it's very useful for everyone. Keep posting. best php training in pune

    ReplyDelete
  8. Hello could you make a video tutorial please

    ReplyDelete
  9. thank you for sharing this informative blog.. this blog really helpful for everyone.. explanation are clear so easy to understand... I got more useful information from this blog

    php training | php training in chennai | best php training | best php training in chennai

    ReplyDelete
  10. It is really a great work and the way you sharing the knowledge is excellent.
    As a beginner in PHP your post is very help full. Thanks for your informative article. If you guys interested to learn PHP join Hire PHP developer in India

    ReplyDelete
  11. Nice info about Php it’s reallyhelpful…. If it possible share some more tutorials……….

    ReplyDelete
  12. waoo nice post about "PHP Tutorial: Making a webcrawler!"

    Thanks,

    Silver Jackpot Call

    ReplyDelete
  13. Australia Best Tutor is one of the best Online Assignment Help providers at an affordable price. Here All Learners or Students are getting best quality assignment help with reference and styles formatting.

    Visit us for more Information

    Australia Best Tutor
    Sydney, NSW, Australia
    Call @ +61-730-407-305
    Live Chat @ https://www.australiabesttutor.com




    Our Services

    Online assignment help Australia
    my assignment help Australia
    assignment help
    help with assignment
    Online instant assignment help
    Online Assignment help Services

    ReplyDelete
  14. I read this article. I think You put a lot of effort to create this article. I appreciate your work.
    thesis Writing Service

    ReplyDelete
  15. This is an awesome post.Really very informative and creative contents. These concept is a good way to enhance the knowledge.I like it and help me to development very well.Thank you for this brief explanation and very nice information.Well, got a good knowledge.

    rpa Training in tambaram

    blueprism Training in tambaram

    automation anywhere training in tambaram

    iot Training in tambaram

    rpa training in sholinganallur

    blue prism training in sholinganallur

    automation anywhere training in sholinganallur

    iot training in sholinganallur

    ReplyDelete
  16. Wonderful article, very useful and well explanation. Your post is extremely incredible.

    ReplyDelete
  17. This is a 2 good post. This post gives truly quality information.

    RPA Training in Hyderabad

    ReplyDelete
  18. very useful and well explained. Your post is extremely incredible.


    RPA Training in Hyderabad

    ReplyDelete
  19. hank you for benefiting from time to focus on this kind of, I feel firmly about it and also really like comprehending far more with this particular subject matter. In case doable, when you get know-how, is it possible to thoughts modernizing your site together with far more details? It’s extremely useful to me 

    java training in tambaram | java training in velachery

    java training in omr | oracle training in chennai

    java training in annanagar | java training in chennai

    ReplyDelete
  20. You blog post is just completely quality and informative. Many new facts and information which I have not heard about before. Keep sharing more blog posts.
    python training in pune
    python online training
    python training in OMR

    ReplyDelete
  21. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.
    Devops training in velachery
    Devops training in annanagar
    Devops training in sholinganallur

    ReplyDelete
  22. Your blog is very useful for me, Thanks for your sharing.


    MSBI Training in Hyderabad


    ReplyDelete
  23. Thanks for the good words! Really appreciated. Great post. I’ve been commenting a lot on a few blogs recently, but I hadn’t thought about my approach until you brought it up. 
    Blueprism training institute in Chennai

    Blueprism online training

    Blue Prism Training Course in Pune

    Blue Prism Training Institute in Bangalore

    ReplyDelete
  24. Excellant post!!!. The strategy you have posted on this technology helped me to get into the next level and had lot of information in it.

    angularjs Training in bangalore

    angularjs Training in btm

    angularjs Training in electronic-city

    angularjs online Training

    angularjs Training in marathahalli

    ReplyDelete
  25. You rock particularly for the high caliber and results-arranged offer assistance. I won't reconsider to embrace your blog entry to anyone who needs and needs bolster about this region.
    safety course in chennai

    ReplyDelete
  26. When cooking with oil, you will see the fact that smoke usually receives emitted in case you often uses the identical oil. Typically, these form of eating places have today's hoods as well as exhaust fans.
    Visit here
    Kitchen Chimney Repair Service in Noida
    Kitchen Kitchen Chimney Repair Service in Vaishali
    Kitchen Kitchen Chimney Repair Service in indirapuram
    Kitchen Kitchen Chimney Repair Service in vasundhra
    Kitchen Kitchen Chimney Repair Service in faridabad

    ReplyDelete

  27. When I initially commented, I clicked the “Notify me when new comments are added” checkbox and now each time a comment is added I get several emails with the same comment. Is there any way you can remove people from that service? Thanks.

    AWS Training in Bangalore | Amazon Web Services Training in Bangalore

    Amazon Web Services Training in Pune | Best AWS Training in Pune

    AWS Online Training | Online AWS Certification Course - Gangboard

    ReplyDelete
  28. I’m thoroughly enjoying your blog. I as well as an aspiring blog writer, but I’m still new to the whole thing. Do you have any recommendations for newbie blog writers? I’d appreciate it.
    safety course in chennai

    ReplyDelete