contract between content-provider and content-consumer. Scrapers must design their tools around
a model of the source content and hope that the provider consistently adheres to this model of
presentation. Web sites have a tendency to overhaul their look-and-feel periodically to remain fresh
and stylish, which imparts severe maintenance headaches on behalf of the scrapers because their
tools are likely to fail.
The second issue is the lack of sophisticated, re-usable screen-scraping toolkit software, colloquially
known as scrAPIs. The dearth of such APIs and toolkits is largely due to the extremely applicationspecific
needs of each individual scraping tool. This leads to large development overheads as designers
are forced to reverse-engineer content, develop data models, parse, and aggregate raw data from
the provider’s site.
Semantic Web and RDF
The inelegant aspects of screen scraping are directly traceable to the fact that content created for
human consumption does not make good content for automated machine consumption. Enter the
Semantic Web, which is the vision that the existing Web can be augmented to supplement the
content designed for humans with equivalent machine-readable information. In the context of the
Semantic Web, the term information is different from data; data becomes information when it
conveys meaning (that is, it is understandable). The Semantic Web has the goal of creating Web
infrastructure that augments data with metadata to give it meaning, thus making it suitable for
automation, integration, reasoning, and re-use.
The W3C family of specifications collectively known as the Resource Description Framework (RDF)
serves this purpose of providing methodologies to establish syntactic structures that describe
data. XML in itself is not sufficient; it is too arbitrary in that you can code it in many ways to
describe the same piece of data. RDF-Schema adds to RDF’s ability to encode concepts in a
machine-readable way. Once data objects can be described in a data model, RDF provides for the
construction of relationships between data objects through subject-predicate-object triples (“subject
S has relationship R with object O”). The combination of data model and graph of relationships
allows for the creation of ontologies, which are hierarchical structures of knowledge that can be
searched and formally reasoned about. For example, you might define a model in which a “carnivoretype”
as a subclass of “animal-type” with the constraint that it “eats” other “animal-type”, and
create two instances of it: one populated with data concerning cheetahs and polar bears and their
habitats, another concerning gazelles and penguins and their respective habitats. Inference engines
might then “mash” these separate model instances and reason that cheetahs might prey on gazelles
but not penguins.
RDF data is quickly finding adoption in a variety of domains, including social networking applications
(such as FOAF — Friend of a Friend) and syndication (such as RSS, which I describe next). In
addition, RDF software technology and components are beginning to reach a level of maturity,
especially in the areas of RDF query languages (such as RDQL and SPARQL) and programmatic
frameworks and inference engines (such as Jena and Redland).
RSS and ATOM
RSS is a family of XML-based syndication formats. In this context, syndication implies that a Web
site that wants to distribute content creates an RSS document and registers the document with an
RSS ublisher. An RSS-enabled client can then check the publisher’s feed for new content and react
to it in an appropriate manner. RSS has been adopted to syndicate a wide variety of content,
ranging from news articles and headlines, changelogs for CVS checkins or wiki pages, project
updates, and even audiovisual data such as radio programs. Version 1.0 is RDF-based, but the |