Get Complete Project Material File(s) Now! »
Deriving Temporal Aspects from Web Pages
We review in this chapter major approaches to change detection in Web pages and extraction of temporal properties (especially, timestamps) of Web pages. We focus our attention on techniques and systems that have been proposed in the last ten years and we analyze them to get some insight into the practical solutions and best practices available. We aim at providing an analytical view of the range of methods that can be used, distinguishing them on several dimensions, especially, their static or dynamic nature, the modeling of Web pages, or, for more dynamic methods relying on comparison of successive versions of a page, the similarity metrics used. We study more in detail the method using Web feeds for change detection, and finally highlight the need for detecting change at Web object level.
The content of this chapter is the aggregation of a survey paper [Oita and Senellart, 2011] on the deriving of temporal properties from Web pages and another publication [Oita and Senellart, 2010b], in which we study the value of Web feeds in the change detection process, and make some other statistics on feeds.
Web Dynamics
The World Wide Web challenges our capacity to develop tools that can keep track of the huge amount of information that is getting modified at speed rate. The ability to capture the temporal dimension of textual information on the Web is essential to many natural language processing (NLP) applications, such as Question Answering, Automatic Summarization, and Information Retrieval (IR). The subject of inferring temporal aspects is of interest in various applications and domains, such as: large-scale information monitoring and delivery systems [Douglis et al., 1998; Flesca and Masciari, 2007; Jacob et al., 2004; Lim and Ng, 2001; Liu Web Dynamics et al., 2000] or services1, Web cache improvement [Cho and Garcia-Molina, 2000], version configuration and management of Web archives [Pehlivan et al., 2009], active databases [Jacob et al., 2004], servicing of continuous queries [Abiteboul, 2002], etc. These applications use change detection techniques with the same aim as temporal processing at textual level, but at semi-structured data level instead. The temporal dimension of Web content can be computed in a static manner, by deriving it using means inherent to the Web page itself (see § 2.1, or in a more dynamic manner which implies comparisons of versions of the same Web page or data object (see § 2.2) or through estimations (see § 2.3).
The bridge between these methods is more visible in Web crawling applications. In this field, which is closely related to Web archiving, substantial efforts are made to maximize the freshness of a Web collection or index.
Statistical: Change Estimation
Estimative models In crawling related applications, the interest is more in whether a Web page has changed or not, in order to know if a new version of a Web page should be downloaded.
From this perspective, a simple estimation of the change frequency is as effective as explicitly computing it, as we show in Section § 2.2. If Web crawlers were more aware of the semantics of data they process, they could clearly benefit of a broader, richer insight into the different facets of a Web page and could develop different strategies related to storage and processing.
An estimation of the change rate, although not describing where the change appeared or its type, is still useful, especially when we can imagine a strategy that combines estimative and comparative methods of deriving dynamics. For this reason, we shortly present some of the existing statistical approaches.
Table of contents :
List of Tables
List of Figures
List of Algorithms
I. Introduction
§ 1. Research Context
§ 2. Outlining Contributions
§ 3. Global Vision
II. Contributions
1. Deriving Temporal Aspects fromWeb Pages
§ 1. Web Dynamics
§ 2. Survey
§ 2.1. Static: Timestamping
§ 2.2. Comparative Methods
§ 2.2.1. Document Models
§ 2.2.2. Similarity Metrics
§ 2.2.3. Types of Changes
§ 2.2.4. Change Representation
§ 2.3. Statistical: Change Estimation
§ 2.4. Towards an Object-Centric Model for Change Detection
§ 3. Web Feeds
§ 3.1. Ephemeral Content
§ 3.2. Feed Files Structure
§ 3.3. Characteristics
§ 3.4. Web Page Change Detection
§ 4. Perspectives
2. Web Objects Extraction
§ 1. Context
§ 2. Related Work
§ 3. Signifier-based Methods
§ 3.1. Signifiers
§ 3.2. Acquisition
§ 4. Preliminary Notions
§ 5. SIGFEED
§ 5.1. Intuitions and Implementation
§ 5.2. Experiments
§ 6. FOREST
§ 6.1. Introduction
§ 6.2. Methodology
§ 6.2.1. Structural Patterns
§ 6.2.2. Informativeness Measure
§ 6.2.3. Combining Structure and Relevance
§ 6.3. Experiments
3. Discovering the Semantics of Objects
§ 1. The Deep Web
§ 2. Related Work
§ 3. Envisioned Approach
§ 3.1. Form Analysis and Probing
§ 3.2. Record Identification
§ 3.3. Output Schema Construction
§ 3.4. Input and Output Schema Mapping
§ 3.5. Labeled Graph Generation
§ 3.6. Ontology Alignment using PARIS
§ 3.7. Form Understanding and Ontology Enrichment
§ 4. Preliminary Experiments
III. Discussion
§ 1. Conclusions
§ 2. Further Research
IV. Résumé en français 95
§ 1. Contexte de recherche
§ 2. Description des contributions
§ 3. Extraction du contenu pertinent
§ 4. FOREST
§ 5. Autres directions de recherche étudiées
Bibliography