Scraper
A Scraper is a script or program which runs for a given period of time (or indefinitely) in order to obtain content from another site.
Contents
Specifications
- Robots.txt - Exclusion instructions for "robots" (automated crawlers/scripts): http://www.robotstxt.org/robotstxt.html[1]
- ACAP: http://www.the-acap.org/Files/e1/e17c480c-69c3-45fe-ade4-d711773511dc.pdf[2]
- ACAP-enabled: http://the-acap.org/acap_enabled.aspx [3]
Contact Grabber
- Contact Grabber: http://www.sajithmr.com/contact-grabber/
Screen Scraping
Screen Scraping in computer science, is a technique whereby the contents of a page of content (typically in HTML, and accessed on the web) are parsed into a more easily understood format.
- wikipedia: Screen scraping
- Screen Scraping a la PHP: http://openconcept.ca/screenscraping_a_la_php
- PHP Screen Scraping: http://www.bradino.com/php/php-screen-scraping/
- The Java Web Scraping Handbook: https://www.scrapingbee.com/java-webscraping-book/
- Several tips how to bypass website anti-scraping protections: https://kb.apify.com/tips-and-tricks/several-tips-how-to-bypass-website-anti-scraping-protections
EXAMPLES
- BCmoney Scraper (alpha): http://bcmoney-mobiletv.com/scrap/index.php
- BCmoney Grabber (alpha): http://bcmoney-mobiletv.com/grabber/index.php
- BCmoney Contact Grabber (alpha): http://bcmoney-mobiletv.com/import/
- Grab Video Now: http://www.grabvideonow.com/
- Multisite Multimedia Grabber: http://demdaysao.net/grabber
Tools
- Convert a robots.txt file to include the equivalent ACAP: http://the-acap.org/getdoc/6bcda1e7-651c-46f5-923a-aa2a23d9a868/Convert_Robots_Txt_To_ACAP.aspx
JAVA
In Java, you could use the default URL() class to create, or one of the following libraries:
Web-Harvest
Web-Harvest is an Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for Text/XML manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content
- WebHarvest: http://web-harvest.sourceforge.net/
- WebHarvest -- User manual: http://web-harvest.sourceforge.net/manual.php
ScreenScraper
- ScreenScraper -- Web Data Extraction: http://www.screen-scraper.com/
- Running screen-scraper as a Server: http://community.screen-scraper.com/running_screen-scraper_as_a_server
- Invoking screen-scraper from Java: http://community.screen-scraper.com/invoking_screen-scraper_from_java
- Java SOAP Example: http://community.screen-scraper.com/Java_SOAP
- Web Scraping with Java & HtmlUnit: http://scraping.pro/web-scraping-java-htmlunit/
PHP
In PHP you can easily request a webpage, local data or indeed any resource via a URI, by using file_get_contents(), fopen(), fsockopen(), CURL or passthru("wget ...").
Simple-HTML-DOM
- simplehtmldom: http://simplehtmldom.sourceforge.net/
Python
Scrapy
- Scrapy and Scrapyrt — how to create your own API from (almost) any website: https://medium.com/@mottet.dev/scrapy-and-scrapyrt-how-to-create-your-own-api-from-almost-any-website-ecfb0058ad64
Resources
- Scraping website content with PHP using Curl: http://seopher.com/articles/scraping_website_content_with_php_using_curl
- Use HtmlUnit for Web Scraping: http://twit88.com/blog/2008/04/21/use-htmlunit-for-web-scraping/
- PHP -- Write a Web Page Scraper: http://twit88.com/blog/2008/02/04/php-write-a-web-page-scraper/
Tutorials
- Create a PHP web crawler or scraper in 5 minutes: http://vision-media.ca/resources/php/create-a-php-web-crawler-or-scraper-5-minutes[4]
- Scraping Links With PHP: http://www.merchantos.com/makebeta/php/scraping-links-with-php/
- Coding for Journalists 101 -- A four-part series: http://danwin.com/2010/04/coding-for-journalists-101-a-four-part-series/
- How to Write a Simple Recursive Web and Image Crawler in Perl: http://www.tidytutorials.com/2009/11/how-to-write-simple-recursive-web-and.html
- Web Crawler or WebRobot or Web Spider Working: http://codeglobe.blogspot.com/2009/02/web-crawler-or-webrobot-or-web-spider.html
- PHP Web Crawler basic example: http://www.pelaphptutorials.com/article/php-web-crawler.html
- Crawling Web Pages and Creating Sitemaps: http://www.kevinmusselman.com/2009/11/crawling-web-pages-for-sitemaps/
- Web Scraping Tutorial with Python - Tips and Tricks: https://hackernoon.com/web-scraping-tutorial-with-python-tips-and-tricks-db070e70e071
- Web Scraping, Regular Expressions, and Data Visualization -- Doing it all in Python: https://towardsdatascience.com/web-scraping-regular-expressions-and-data-visualization-doing-it-all-in-python-37a1aade7924 (exercise to figure out what 5 minutes of the University Presidents' time is worth across several, based on a publicly published annual salary list)
- JavaScript RegEx to extract all or specific <a> anchor text corresponding "href" URLs from anchor tags: https://stackoverflow.com/questions/369147/javascript-regex-to-extract-anchor-text-and-url-from-anchor-tags
- Quickly extract all links from a web page using the browser console (in JS): https://towardsdatascience.com/quickly-extract-all-links-from-a-web-page-using-javascript-and-the-browser-console-49bb6f48127b
External Links
- wikipedia: Automated Content Access Protocol
- Video Search Engine and Video Grabber: http://phppod.com/Video-Search-Download.html
- The Best Scraper on the Web: http://xquery.typepad.com/xquery/2006/10/the_best_scrape.html
- Another Web 2.0 Scraper Company: http://incredibill.blogspot.com/2006/05/another-web-20-scraper-company_18.html
- PHP - Making a search engine: http://syntax.cwarn23.net/PHP/Making_a_search_engine
- Half of all internet traffic comes from bots: https://www.axios.com/half-internet-traffic-from-bots-2504285553.html
- Web scraping is legal, US appeals court reaffirms: https://techcrunch.com/2022/04/18/web-scraping-legal-court/
References
- ↑ Robots.txt: http://www.robotstxt.org/
- ↑ A skeptical look at the Automated Content Access Protocol: http://arstechnica.com/business/news/2008/01/skeptical-look-at-acap.ars
- ↑ ACAP logo: http://the-acap.org/getdoc/8744fb58-c63c-428d-bc24-3c47a54ef5ad/Add_ACAP_Enabled.aspx
- ↑ Create a PHP web crawler or scraper in 5 minutes: http://rockmanx.wordpress.com/2009/06/04/create-a-php-web-crawler-or-scraper-in-5-minutes/
See Also
Crawler | URI | Search Engine | AI | Robots.txt