Pages

Thursday, February 14, 2013

PHP Tutorial: Making a webcrawler!

Hey!

Don't you know what a webcrawler is? A webcrawler is used by search engines like:
Google, Yahoo and bing. These search engines got bots running 24/7 searching for new websites. How do these bots work? Easy.
Keep reading to find out how to make one yourself!


Let me show you in steps:

Step 1: Bot starts and gets a URL.
Step 2: The bot opens the URL and searches for all links.
Step 3: The bot delete not working links.
Step 4: The bot adds the links to a database.
Step 5: The bot goes back to Step 1 with the a found link.



Lets get started making one shall we?

Requirements:
  • Brains
  • Some php knowledge (Variables, Functions etc.)
  • Some HTML knowledge (How to make a link.)
  • A webserver (See below)
  • A MySQL Database (comes with the webserver below)
  • "Simple_HTML_Dom.PHP" (Download: HERE)


 Index: 
  • Setting up a webserver
  • Concept
  • Making the PHP crawler.
  • Variations.

Setting up a webserver:
You wanna know how to setup a webserver?
It's super easy! But do you want it on your pc? or on an USB stick?

For the PC download: WAMP WebServer.
Setting it up wont be that hard. Just follow the install instructions and start it.
Now goto: http://localhost/ 
(NO .com .org .net)
Put your ".php" files in: {wamp instal directory}/www/

For USB download: EasyPHP - The portable webserver!
The setup is the same as wamp but make sure your install location is on your USB.

NOTE: Both webserver got MySQL preinstalled!
To connect use:


mysql_connect("localhost", "root", "");

Server: localhost
Username: Root
Password: (none)


Concept:
The concept is very easy. Like said before the bots runs in a loop.
And adds the links to the database. 
Now how are we going to get the HTML source code?
The "Simple_HTML_Dom.php" has all the functions we need!
So let's include it in our html first:


<?php
  include_once('simple_html_dom.php');
?>


Ok! We've included the extension we can now use it!
If we want the source code from a url we need to define the url and load it first:



  <?php
  //---
  include_once('simple_html_dom.php');
  //---
  $url = "http://timvanosch.blogger.com/";
  $html = new simple_html_dom();
  $html->load_file($url);
  //--
  ?>
 
Ok let me explain it:
  1. Opening the PHP file: "<?php"
  2. Comment (non code)
  3. Including the extension
  4. Comment (non code)
  5. Defining variable "$url". This is the page that we will grab the source from!
  6. Defining variable "$html". This is a extension class. (read on)
  7. Execute function "load_file" in the "$html" class. This will load "$url" source!
  8. Comment (non code)
  9. Closing the PHP file: "?>"
We're here. We've succesfully loaded the file in the "$html" variable!.


Making the base PHP file:
We've already made a good base but we want to extend it, So it will echo out the links. 
So we've got the source in "$html", We now need to find all the "<a href="blabla"></a>" tags and cut out the href link.
We're lucky cause "Simple_html_dom.php" already got such a function and looks like this:


$html->find('a');

Ok, This function will return an array with all the "<A>" tags in the source!
To get through all the "<A>" tags quickly we're gonna use "foreach(){}" function.
And I'm gonna use the code from Concept:


<?php
  //---
  include_once('simple_html_dom.php');
  //---
  $url = "http://timvanosch.blogspot.com/";
  $html = new simple_html_dom();
  $html->load_file($url);
  //--
  foreach($html->find("a") as $link)
  {
    echo $link->href."< br />;
  }
?>
 
Now let me explain it:

    9. The foreach will loop and assign a array entry to "$link" till there are no more left.
         So it will start at 0,1,2,3,4,5,6,7,8,9 in the array.
   11. This echos out the href from the "<A>" tag and adds an enter.

Now change "$url" to a site and watch the magic happens.
This is my output: (I have changed the urls a bit for protection!)

http://timvanosch.blogspot.nl/2013/02/earn-money-onlin...
http://bit.ly/Wluavs
http://timvanosch.blogspot.nl/2013/02/earn-money-onlin...
https://plus.google.com/11489535515489321
http://timvanosch.blogspot.nl/2013/02/earn-money-onlin...
http://timvanosch.blogspot.nl/2013/02/earn-money-onlin...
http://www.blogger.com/post-edit.g?blogID=743878789022...
http://www.blogger.com/share-post.g?blogID=74387878902...
http://www.blogger.com/share-post.g?blogID=74387878902...
http://www.blogger.com/share-post.g?blogID=74387878902...
http://www.blogger.com/share-post.g?blogID=743878785902...
http://timvanosch.blogspot.nl/
http://timvanosch.blogspot.com/feeds/posts/default
//www.blogger.com/rearrange?blogID=743878789022871384...
//www.blogger.com/rearrange?blogID=743878789022871384...
//www.blogger.com/rearrange?blogID=743878789022871384...
http://www.blogger.com
//www.blogger.com/rearrange?blogID=743878789022871384...

Nice! We've got results.
So what've you learned ?:
  • How to incude extensions.
  • How to use extensions.
  • How to get source code.
  • How to use "foreach(){}"
  • How to crawl the web!
Now if you want to make an: 'infinite crawler' just apply your basic php skills and you'll be able to make a loop.
Again goto Concept to see what you have to do for an 'infinite crawler'.
You can put all the urls found on an website in an array or directly into a database.
Then use those urls and crawl them.


Variations:
Ofcourse you can variate much in crawlers. I made one which will show you the found links on a site. You can press these links and it will crawl the pressed link. 
It's like an 'infinite crawler' but then with human pauses in between.

Download: DropBox link to: Crawler_source_code.rar

For the ones that don't trust me:
Jotti - Online virus scanner

Jotti is online virus scanner. It will scan a file with 21 different virus scanners.
I've already uploaded the file on jotti so you can view the results above.

Thank you!
Thanks for reading this post. If you wish to get more tutorials like these subscribe to this blog on the right site. Just enter your e-mail and you will get all the post right to your mail!

Greets, Tim.

86 comments:

  1. Good overview, Lee. Using a tool such as can be helpful in assessment and identifying areas of need.

    Website Development company

    ReplyDelete
    Replies
    1. Thank you for your response. I'm happy this was usefull!

      Delete
  2. You will discover some fascinating points in time in this post but I don’t know if I see all of them interior to heart. I am learning great extra challenging on distinct blogs everyday. Lots of people will be benefited from your writing. Cheers!

    Press Release Writers
    Press Release Writing Service

    ReplyDelete
  3. Hi Tim. Wonderful tutorial. I added to my list of PHP-based web crawler tutorials. Thanks for the great resource!

    ReplyDelete
  4. It is a pleasure going through your post. I have bookmarked you to check out new stuff from your side.asp.net training in jalandhar

    ReplyDelete
  5. hi, can simple_html_dom or PHPCrawl crawl a dynamic ajax or javascript content? if it is can, can you show me how to do that? I tried to combine this two methods and works for several websites, but when I tried to this two dynamic websites and I can’t load the value I want.
    The value exist when I inspect the element, but when I view page source, the value is not in there

    ReplyDelete
  6. Your blog has given me that thing which I never expect to get from all over the websites. Nice post guys!

    regards,
    Melbourne Web Designer

    ReplyDelete
  7. Thanks for great post. Very nice information it's very useful for everyone. Keep posting. best php training in pune

    ReplyDelete
  8. Hello could you make a video tutorial please

    ReplyDelete
  9. thank you for sharing this informative blog.. this blog really helpful for everyone.. explanation are clear so easy to understand... I got more useful information from this blog

    php training | php training in chennai | best php training | best php training in chennai

    ReplyDelete
  10. It is really a great work and the way you sharing the knowledge is excellent.
    As a beginner in PHP your post is very help full. Thanks for your informative article. If you guys interested to learn PHP join Hire PHP developer in India

    ReplyDelete
  11. waoo nice post about "PHP Tutorial: Making a webcrawler!"

    Thanks,

    Silver Jackpot Call

    ReplyDelete
  12. Australia Best Tutor is one of the best Online Assignment Help providers at an affordable price. Here All Learners or Students are getting best quality assignment help with reference and styles formatting.

    Visit us for more Information

    Australia Best Tutor
    Sydney, NSW, Australia
    Call @ +61-730-407-305
    Live Chat @ https://www.australiabesttutor.com




    Our Services

    Online assignment help Australia
    my assignment help Australia
    assignment help
    help with assignment
    Online instant assignment help
    Online Assignment help Services

    ReplyDelete
  13. Wonderful article, very useful and well explanation. Your post is extremely incredible.

    ReplyDelete
  14. This is a 2 good post. This post gives truly quality information.

    RPA Training in Hyderabad

    ReplyDelete
  15. very useful and well explained. Your post is extremely incredible.


    RPA Training in Hyderabad

    ReplyDelete
  16. hank you for benefiting from time to focus on this kind of, I feel firmly about it and also really like comprehending far more with this particular subject matter. In case doable, when you get know-how, is it possible to thoughts modernizing your site together with far more details? It’s extremely useful to me 

    java training in tambaram | java training in velachery

    java training in omr | oracle training in chennai

    java training in annanagar | java training in chennai

    ReplyDelete
  17. You blog post is just completely quality and informative. Many new facts and information which I have not heard about before. Keep sharing more blog posts.
    python training in pune
    python online training
    python training in OMR

    ReplyDelete
  18. Your blog is very useful for me, Thanks for your sharing.


    MSBI Training in Hyderabad


    ReplyDelete
  19. Thanks for the good words! Really appreciated. Great post. I’ve been commenting a lot on a few blogs recently, but I hadn’t thought about my approach until you brought it up. 
    Blueprism training institute in Chennai

    Blueprism online training

    Blue Prism Training Course in Pune

    Blue Prism Training Institute in Bangalore

    ReplyDelete
  20. Excellant post!!!. The strategy you have posted on this technology helped me to get into the next level and had lot of information in it.

    angularjs Training in bangalore

    angularjs Training in btm

    angularjs Training in electronic-city

    angularjs online Training

    angularjs Training in marathahalli

    ReplyDelete
  21. I’m thoroughly enjoying your blog. I as well as an aspiring blog writer, but I’m still new to the whole thing. Do you have any recommendations for newbie blog writers? I’d appreciate it.
    safety course in chennai

    ReplyDelete
  22. Hey Nice Blog!! Thanks For Sharing!!!Wonderful blog & good post.Its really helpful for me, waiting for a more new post. Keep Blogging!
    best java training in coimbatore
    php training in coimbatore
    best php training institutes in coimbatore

    ReplyDelete
  23. Thanks For Sharing The Information The information Shared Is Very valuable Please keep updating us Time Just Went On reading The article Python Online Course AWS Online Course Devops Online Course DataScience Online Course

    ReplyDelete
  24. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...

    Article submission sites
    Guest posting sites

    ReplyDelete
  25. Wonderful Tutorial, It’s very informative and you are obviously very knowledgeable in this field. Very solid content.


    ExcelR Data Science

    ReplyDelete
  26. It should be noted that whilst ordering papers for sale at paper writing service, you can get unkind attitude. In case you feel that the bureau is trying to cheat you, don't buy term paper from it.
    data science courses training
    data analytics certification courses in Bangalore
    ExcelR Data science courses in Bangalore

    ReplyDelete



  27. wow, great, I was wondering how to cure acne naturally. and found your site by google, learned a lot, now i’m a bit clear. I’ve bookmark your site and also add rss. keep us updated.

    DATA SCIENCE COURSE MALAYSIA

    ReplyDelete
  28. I just got to this amazing site not long ago. I was actually captured with the piece of resources you have got here. Big thumbs up for making such wonderful blog page!data science course in dubai

    ReplyDelete
  29. I finally found great post here.I will get back here. I just added your blog to my bookmark sites. thanks.Quality posts is the crucial to invite the visitors to visit the web page, that's what this web page is providing.
    top 7 best washing machine
    www.technewworld.in

    ReplyDelete

  30. Really appreciate this wonderful post that you have provided for us.Great site and a great topic as well i really get amazed to read this. Its really good.
    www.technewworld.in
    How to Start A blog 2019
    Eid AL ADHA

    ReplyDelete
  31. I love your article so much. Good job
    Participants who complete the assignments and projects will get the eligibility to take the online exam. Thorough preparation is required by the participants to crack the exam. ExcelR's faculty will do the necessary handholding. Mock papers and practice tests will be provided to the eligible participants which help them to successfully clear the examination.

    Excelr Solutions

    ReplyDelete
  32. I love your article so much. Good job
    Participants who complete the assignments and projects will get the eligibility to take the online exam. Thorough preparation is required by the participants to crack the exam. ExcelR's faculty will do the necessary handholding. Mock papers and practice tests will be provided to the eligible participants which help them to successfully clear the examination.

    Excelr Solutions

    ReplyDelete
  33. Visit here for become a Big Data/Hadoop Training in Bangalore -> Big Data and Hadoop Training in Bangalore

    ReplyDelete
  34. Your info is really amazing with impressive content..Excellent blog with informative concept. Really I feel happy to see this useful blog, Thanks for sharing such a nice blog..
    If you are looking for any Data science Related information please visit our website data science institutes in bangalore page!

    ReplyDelete
  35. Impressive! I finally found a great post here. It's really a nice experience to read your post. Thanks for sharing your innovative ideas to our vision.
    Data Science Course
    Data Science Course in Marathahalli

    ReplyDelete
  36. I have read your blog. It's very attractive and impressive. Very systematic indeed! Excellent work!
    Data Science Course in Marathahalli

    ReplyDelete
  37. Great post, Its very useful for me. Thanks for sharing. We are also providing various software solutions like devops development , Mobile application etc..

    ReplyDelete
  38. Excellent! I love to post a comment that "The content of your post is awesome" Great work!

    best data analytics courses in mumbai

    ReplyDelete
  39. Timmothy,

    Show us a tutorial building a web crawler in Php 7 with Php's cURL.

    1. Add feature to crawl all kinds of links it encounters.
    1A). File types example:
    /
    //
    ./
    #
    ../
    javascipt:
    https

    1B).
    Feature to understand REGEX so we can feed regex expressions for it to
    crawl only pages that match the given REGEX expressions on urls &
    links.

    2. Add depth following feature.
    2A). How much deep to follow links from same domain.
    2B). And much deep to follow links from each external domain.
    2C). And how many external domains to follow found on starting page.

    3. Add feature to recognise file types and deal appropriately (according to settings) with each type of file types.
    3A). Add feature to ignore certain types of files (img, video, javascript, xml, css).
    3B). And feature to only crawl certain types of files.

    4. Add page content size calculation feature. So it only crawls those pages that are within our page size ranges.

    5.
    Add feature so it can identify to what kind of links the crawled page
    is linking to. Is it linking to img files, video files, xml files, etc.
    That way we can get it to ignore pages that are linking to video files
    or img files or linking to css files.

    6. Add feature to extract content from crawled page to add the extracted content (snippet) in Index as page
    description should the page contain no meta tags (meta keywords, meta
    description) or title.

    7. Add Page Content & Link Anchor Text & Url keywords Filter.
    7A). Add feature for it to crawl only those pages who's content contain
    certain keywords (eg. check for mentions of our brand keywords, mentions
    of our links, etc.) or do not contain certain keywords (eg. check for
    mentions of bannded words such as 'porn', 'sex' and check for mentions
    of links to our competitions, check for mentions of our competitions
    brand keywords, etc.).

    7B). Add feature for it to crawl only those pages who's link anchor texts contain certain keywords or do
    not contain certain keywords. Eg. Only crawl links who's anchor texts do
    not contain "porn", "sex", etc. We should be able to give unlimited
    number of banned words.

    8. Add threads so it can simultaneously crawl different domains. Simultaneously crawl more than one page.

    I just gave you some basic feature suggestions. Nothing too much. If you
    build such a tutorial then you should get lots of subscribers. And
    remember not to GPL or Creative Commons the license. Do not release your
    tutorial codes under any licenses. I do not like licenses such as GPL
    because if we modify the web crawler then we will have to release our
    modified versions to gnu site. I do not like such forcing to disclose.
    Hence, any tutorials that release codes under any kinds of licenses, I
    ignore such tutorials.

    Drop me a line when you start writing the tutorial incase I thinkup more basic good features.

    Anyone reading this comment, who likes my suggestions, LIKE or VOTE this comment. And do not forget to mention to the author that you want him to build a Web crawler based on my suggestive features. Ok ?

    Thanks

    ReplyDelete
  40. Thank you for sharing very good post, it was so Nice to read and useful to improve my knowledge as updated one, keep blogging.
    It’s great to come across a blog every once in a while that isn’t
    AWS training in chennai | AWS training in anna nagar | AWS training in omr | AWS training in porur | AWS training in tambaram | AWS training in velachery

    ReplyDelete
  41. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
    keep udates.
    Ai & Artificial Intelligence Course in Chennai
    PHP Training in Chennai
    Ethical Hacking Course in Chennai Blue Prism Training in Chennai
    UiPath Training in Chennai

    ReplyDelete
  42. Very awesome!!! When I seek for this I found this website at the top of all blogs in search engine.
    Data Science Course in Bangalore

    ReplyDelete
  43. I want to say thanks to you. I have bookmark your site for future updates.
    Data Science Training in Bangalore

    ReplyDelete
  44. I will really appreciate the writer's choice for choosing this excellent article appropriate to my matter.Here is deep description about the article matter which helped me more.
    I wanted to leave a little comment to support you and wish you a good continuation. Wishing you the best of luck for all your blogging efforts. Data Science Training In Chennai | Certification | Data Science Courses in Chennai | Data Science Training In Bangalore | Certification | Data Science Courses in Bangalore | Data Science Training In Hyderabad | Certification | Data Science Courses in hyderabad | Data Science Training In Coimbatore | Certification | Data Science Courses in Coimbatore | Data Science Training | Certification | Data Science Online Training Course

    ReplyDelete
  45. awesome one its very interesting to read .nice article,thanks for sharing

    ReplyDelete
  46. Its really great information i am thank full to this website........... oracle training in chennai

    ReplyDelete
  47. Thanks for a very interesting blog. What else may I get that kind of info written in such a perfect approach? I’ve a undertaking that I am simply now operating on, and I have been at the look out for such info. car rental Zanzibar

    ReplyDelete
  48. DevOps is currently a popular model currently organizations all over the world moving towards to it. Your post gave a clear idea about knowing the DevOps model and its importance.

    DevOps Training in Chennai

    DevOps Course in Chennai

    ReplyDelete
  49. Well describe that how to webcrawler work on search engines. First Copy Ladies Watches Online

    ReplyDelete
  50. This article gives the light in which we can observe the reality. This is very nice one and gives indepth information. Thanks for this nice article. twitter trends

    ReplyDelete
  51. In my opinion, email marketing is a campaign that needs to be used by companies that are serious about increasing their business sales. Email marketing is an online advertisement strategy which uses electronic means to communicate with prospects and customers. Email marketing is one of the most effective internet advertising tools because it helps you in creating a brand image. It is one of the most cost effective ways to generate leads for your business. It is important that your businessBuy pinterest accounts does not just survive but also thrives on a daily basis through an efficient email marketing campaign. There are so many benefits associated with email marketing and below are some of them: buy social accounts buy email accounts buy aged non pva instagram accounts buy instagram accounts instagram pva accounts buy bulk instagram accounts pvaaccountss com

    ReplyDelete
  52. Benifit is a service offered by Yahoo that enables you to remove unused or obsolete Yahoo accounts. There are several ways that you can get your account deleted including: sending a mail to Yahoo asking them to terminate your account; blocking the account from accessing the Internet; and calling them on the telephone with a Verizon Fios modem or any other modem. To apply for your account to be terminated, you must email
    Buy tinder accounts them a request to terminate your account with this information enclosed. Your request should include the following: your full name; current address; the account type; account passwords; and if applicable, a Verification Code.Buy yahoo accounts

    ReplyDelete
  53. If AWS is a job that you're dreaming of, then we, Infycle are with you to make your dream into reality. Infycle Technologies offers the best AWS Training in Chennai, with various levels of highly demanded software courses such as Oracle, Java, Python, Hadoop, Big Data, etc., in 100% hands-on practical training with specialized tutors in the field. Along with that, the pre-interviews will be given for the candidates, so that, they can face the interviews with complete knowledge. To know more, dial 7502633633 for more.
    Grab AWS Training in Chennai | Infycle Technologies

    ReplyDelete
  54. Very Informative article with detailed explanation. Thanks for sharing your work . keep up the good work Angular training in Chennai

    ReplyDelete
  55. Thanks for Sharing such Article.It contain useful and informative content.keep sharing!
    Data Science Training in Pune

    ReplyDelete
  56. Really I enjoy your site with effective and useful information. It is included very nice post with a lot of our resources.thanks for share. i enjoy this post. PHP tutorials

    ReplyDelete
  57. Learn PHP Thanks for taking the time to discuss this, I feel strongly about it and love learning more on this topic. If possible, as you gain expertise, would you mind updating your blog with extra information? It is extremely helpful for me.

    ReplyDelete
  58. I really admire your writing, I hope you are always healthy GBU

    judi depo pulsa

    ReplyDelete
  59. Really awesome blog. Useful information and knowledge. Thanks for posting this blog. Keep sharing more blogs again soon.
    Best Data Science Online Courses

    ReplyDelete
  60. Great tips and very easy to understand. This will definitely be very useful for me when I get a chance to start my blog.
    data science course fee in hyderabad

    ReplyDelete