Web spider
From BC$ MobileTV Wiki
A Web Spider (also commonly referred to as a Web Crawler) is a program which runs indefinitely and visits links on the web for the purpose of generating information about the underlying link-structure of the web. [1]
Contents
Tools
- Apache Nutch: https://nutch.apache.org/ (open source large-scale web crawler with distributed computing support)
CommonCrawl
CommonCrawl is an attempt to create an open and accessible crawl of the web for education, research and other non-commercial innovation.
- CommonCrawl: http://commoncrawl.org/
- CommonCrawl - Web Crawler Engine code: https://github.com/commoncrawl/
- CommonCrawl - Database (snapshot): http://commoncrawl.org/data/accessing-the-data/
Sphider
Sphider is a lightweight web spider and search engine written in PHP, using MySQL as its back end database. It is a great tool for adding search functionality to your web site or building your custom search engine. Sphider is small, easy to set up and modify, and is used in thousands of websites across the world.
- Sphider - PHP Search Engine Spider: http://www.sphider.eu/
Resources
- Mini Bots PHP Class: http://www.barattalo.it/mini-bots-php-class/
Tutorials
- Crawling the web with Apache Nutch and Cassandra (NoSQL Web Crawler): http://java.dzone.com/articles/crawling-web-cassandra-and
- Codd's Relational DataBase revolution - Has NoSQL come full-circle?: http://java.dzone.com/articles/codds-relational-vision-has
- How to make a simple web crawler in Java (using jSoup HTML parser): http://www.netinstructions.com/how-to-make-a-simple-web-crawler-in-java/
- How to make a web crawler in JavaScript / Node.js: http://www.netinstructions.com/how-to-make-a-simple-web-crawler-in-javascript-and-node-js/
- How to make a web crawler in under 50 lines of Python code: http://www.netinstructions.com/how-to-make-a-web-crawler-in-under-50-lines-of-python-code/
- Multi-Threaded Geo Web Crawler In Java: https://dzone.com/articles/efficient-multi-threaded-geo-web-crawler-using-jav
External Links
- wikipedia: Web Crawler
- My Web Spider: http://php4fun.blogspot.com/2007/11/my-web-spider.html
- Finding What People Want -- Experiences with the WebCrawler (PhD thesis that produced WebCrawler.com): http://www.thinkpink.com/bp/WebCrawler/WWW94.html
References
- ↑ wikipedia:Web spider
- ↑ Multithreaded Webcrawler: https://sites.google.com/site/javagamescorner/standalone-tutorials/concurrency-1/multithreaded-webcrawler
- ↑ Implementing Threads Into Java Web Crawler: https://stackoverflow.com/questions/26363585/implementing-threads-into-java-web-crawler
- ↑ How to write a Multi-threaded WebCrawler: http://www.andreas-hess.info/programming/webcrawler/index.html