|
|
|
Configurable objects that can access, extract metadata from or filter web content.
HTMLScraper - An XML-configurable HTML 'scraping' engine that can extract structured data from consistently formatted HTML pages and web sites through a 'deep' scraping or 'site scraping' capability. THe HTMLScraper is used by HTML search sources, and by content acquisition, formatting and filtering elements such as the HTMLScraperGateway, HTMLScraperFilter and HTMLScraperPageImportRenderer. It is also used by the WebSiteTreeBuilder to extract hierarchies from web site maps.
HTMLFilter - uses the java.text.html parser package to provide a configurable HTMLFilter.