| 
                                       contract between content-provider and content-consumer. Scrapers must design their tools around 
                                        a model of the source content and hope that the provider consistently adheres to this model of 
                                        presentation. Web sites have a tendency to overhaul their look-and-feel periodically to remain fresh 
                                        and stylish, which imparts severe maintenance headaches on behalf of the scrapers because their 
                                        tools are likely to fail. 
                                                                                The second issue is the lack of sophisticated, re-usable screen-scraping toolkit software, colloquially 
                                          known as scrAPIs. The dearth of such APIs and toolkits is largely due to the extremely applicationspecific
                                          needs of each individual scraping tool. This leads to large development overheads as designers 
                                          are forced to reverse-engineer content, develop data models, parse, and aggregate raw data from 
                                          the provider’s site. 
                                                                                Semantic Web and RDF 
                                                                                  The inelegant aspects of screen scraping are directly traceable to the fact that content created for 
                                            human consumption does not make good content for automated machine consumption. Enter the 
                                            Semantic Web, which is the vision that the existing Web can be augmented to supplement the 
                                            content designed for humans with equivalent machine-readable information. In the context of the 
                                            Semantic Web, the term information is different from data; data becomes information when it 
                                            conveys meaning (that is, it is understandable). The Semantic Web has the goal of creating Web 
                                            infrastructure that augments data with metadata to give it meaning, thus making it suitable for 
                                            automation, integration, reasoning, and re-use. 
                                            The W3C family of specifications collectively known as the Resource Description Framework (RDF) 
                                            serves this purpose of providing methodologies to establish syntactic structures that describe 
                                            data. XML in itself is not sufficient; it is too arbitrary in that you can code it in many ways to 
                                            describe the same piece of data. RDF-Schema adds to RDF’s ability to encode concepts in a 
                                            machine-readable way. Once data objects can be described in a data model, RDF provides for the 
                                            construction of relationships between data objects through subject-predicate-object triples (“subject 
                                            S has relationship R with object O”). The combination of data model and graph of relationships 
                                            allows for the creation of ontologies, which are hierarchical structures of knowledge that can be 
                                            searched and formally reasoned about. For example, you might define a model in which a “carnivoretype”
                                            as a subclass of “animal-type” with the constraint that it “eats” other “animal-type”, and 
                                            create two instances of it: one populated with data concerning cheetahs and polar bears and their 
                                            habitats, another concerning gazelles and penguins and their respective habitats. Inference engines 
                                            might then “mash” these separate model instances and reason that cheetahs might prey on gazelles 
                                            but not penguins. 
                                                                                RDF data is quickly finding adoption in a variety of domains, including social networking applications 
                                          (such as FOAF — Friend of a Friend) and syndication (such as RSS, which I describe next). In 
                                          addition, RDF software technology and components are beginning to reach a level of maturity, 
                                          especially in the areas of RDF query languages (such as RDQL and SPARQL) and programmatic 
                                          frameworks and inference engines (such as Jena and Redland). 
                                                                                RSS and ATOM 
                                                                                  RSS is a family of XML-based syndication formats. In this context, syndication implies that a Web 
                                            site that wants to distribute content creates an RSS document and registers the document with an 
                                            RSS ublisher. An RSS-enabled client can then check the publisher’s feed for new content and react 
                                            to it in an appropriate manner. RSS has been adopted to syndicate a wide variety of content, 
                                            ranging from news articles and headlines, changelogs for CVS checkins or wiki pages, project 
                                            updates, and even audiovisual data such as radio programs. Version 1.0 is RDF-based, but the  |