Scraper

From BC$ MobileTV Wiki
(Redirected from Scraping)
Jump to: navigation, search

A Scraper is a script or program which runs for a given period of time (or indefinitely) in order to obtain content from another site.


Specifications


Contact Grabber


Screen Scraping

Screen Scraping in computer science, is a technique whereby the contents of a page of content (typically in HTML, and accessed on the web) are parsed into a more easily understood format.

EXAMPLES


Tools


JAVA

In Java, you could use the default URL() class to create, or one of the following libraries:

Web-Harvest

Web-Harvest is an Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for Text/XML manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content

ScreenScraper

PHP

In PHP you can easily request a webpage, local data or indeed any resource via a URI, by using file_get_contents(), fopen(), fsockopen(), CURL or passthru("wget ...").

Simple-HTML-DOM

Python

Scrapy


Resources


Tutorials


External Links

References

  1. Robots.txt: http://www.robotstxt.org/
  2. A skeptical look at the Automated Content Access Protocol: http://arstechnica.com/business/news/2008/01/skeptical-look-at-acap.ars
  3. ACAP logo: http://the-acap.org/getdoc/8744fb58-c63c-428d-bc24-3c47a54ef5ad/Add_ACAP_Enabled.aspx
  4. Create a PHP web crawler or scraper in 5 minutes: http://rockmanx.wordpress.com/2009/06/04/create-a-php-web-crawler-or-scraper-in-5-minutes/

See Also

Crawler | URI | Search Engine | AI | Robots.txt