Get Complete Project Material File(s) Now! »
Automatic Information Extraction
Information extraction is the task consisting of the extraction of structured data from unstructured or semi-structured documents. In particular, extraction from Web documents generally leverages the HTML structure of Web pages to identify and extract data. The basic assumption here is that the general formatting of the page, and more specifically its HTML formatting, reflects the structure of the information that it contains and presents. A wrapper’s job is to extract the data contained in the Web pages. There have been many solutions proposed by researchers on information extraction. The solutions can be classified according to the data source and the degree of automation involved in the process.
Taxonomy Based on Data Source
Information extraction (IE) has been applied to different sources of data, including emails [41, 42, 43], Web documents [44, 45, 46, 47, 48], and para-relational data [49, 27, 50].
Zhang et al. [41] present an approach to extract structured data from emails. The repetitive structures of emails are used to induce the templates. The content of the email which is not part of the templates is then annotated for its semantic type. Types which are common such as dates, address, price, etc. are pre-annotated using existing annotators. Text without any pre-annotated type associated with it is then classified as a product name or not. The classification task is done by
implementing a weak classifier using Conditional Random Fields and then applying the Expectation-Maximization algorithm. In [42], Sheng et al. propose Juicer which is a framework for extracting information from large-scale email service. It achieves the goal without hurting the privacy of the users. Juicer learns from a representative sample of emails by clustering them based on templates from which the emails are instantiated (purchase receipts, bills, hotel reservations, etc.). Juicer then applied a set of classifiers as well as field extractors for each cluster. The result of this learning process is then aggregated into a set of static rules. This set of static rules is then used online for extracting the needed information from newly arrived emails. Mahlawi et al. [43] propose a novel approach to extract structured data from emails about a specific domain. The data extraction process consists of several sub-processes including keyword extraction, sentiment analysis, regular expression extraction, entity extraction, and summary extraction. Keywords in the email are extracted based on the word frequency analysis with refers to the individual email as well as the whole corpus. The sentiment analysis process leverages a dictionary corpus to evaluate the sentiment value of each email. Predefined regular expressions are used to extract important information such as the email of the sender and receiver, date, URL, and phone number. To extract entities from the email, it uses POS tagging technique. From the result of the POS tagging, common nouns are removed while the rest of the nouns are identified as entities. To summarise the
content of the email, the authors build a graph using sentences as the vertices. If two sentences are overlapped in terms of content or have semantic similarity, an edge is drawn between the two vertices representing the words. The number of inbound edges is used to rank the vertices. The summary is extracted from the graph as the most important vertex and its corresponding edges.
Taxonomy Based on Degree of Automation of the Process
Earlier information extraction systems require the user to manually write the extraction rules or wrappers. This manual method has since been left behind as machine learning was introduced to help in generating the rules automatically. From the point of view of how the wrappers are generated, information extraction systems can be classified as supervised, semi-supervised, and unsupervised systems. In supervised IE systems, a set of Web pages are annotated with the data to be extracted and used as input. The IE systems then infer the wrapper from the labelled Web pages. RAPIER [53] generates the extraction rules in a bottom-up manner. It starts with the most specific rules and then iteratively replaces them with more general rules. The generated rules are only used to extract single-slot data. The rules are comprised of three different patterns, a pattern to match text preceding the filler, a pattern to match the actual filler, and a pattern to match the text following the filler. In this learning process, RAPIER combines the syntactic and semantic information of the input Web pages. The semantic information comes from the part of speech tagger and lexicon such as Wordnet [54]. As opposed to RAPIER, WHISK [55] can extract multi-slot information. The rules are generated using a covering learning algorithm and can be applied to various sources of documents including structured and free text. When applied to free text documents, the input data needs to be labelled with a syntactic analyser and semantic tagger.
WHISK employs a top-down manner when generating the rules. The rules are in the form of regular expressions. The initial rules are the most general rules which encompass all instances, then WHISK continually adds a new term to extend the initial rules. In NoDoSe [56], a user can interactively decompose semi-structured document hierarchically. This helps NoDoSe to handle nested objects. It separates the text from the HTML code and applies heuristic-based mining components to each group. The goal of the mining component is to find the common prefix and suffix that identify different attributes. A tree describing the document structure is the output of this mining task.
Table of contents :
Contents
List of Figures
List of Tables
Acknowledgements
1 Introduction
1.1 Motivations
1.2 Contributions
1.3 Outline of the Thesis
2 RelatedWork
2.1 Automatic Information Extraction
2.1.1 Taxonomy Based on Data Source
2.1.2 Taxonomy Based on Degree of Automation of the Process
2.2 Set Expansion
2.2.1 Taxonomy Based on Data Source
2.2.2 Taxonomy Based on Target Relations
2.2.3 Taxonomy Based on Pattern Construction
2.2.4 Taxonomy Based on Ranking Mechanism
2.3 Truth Finding
2.3.1 Taxonomy of Truth Finding Approach
2.3.2 Ensemble Approach
2.4 Topic Labelling
2.5 Multi-label Classification
I Finding the Truth in Set of Tuples Expansion
3 Set of Tuples Expansion
3.1 Problem Definition
3.2 Proposed Approach
3.2.1 Crawling
3.2.2 Wrapper Generation and Candidates Extraction
3.2.3 Ranking
3.3 Performance Evaluation
3.3.1 Experiment Setting
3.3.2 Result and Discussion
3.4 Comparison with state-of-the-art
3.5 Conclusion
4 Truthfulness in Set of Tuples Expansion
4.1 Problem Definition
4.2 Proposed Approach
4.3 Performance Evaluation
4.3.1 Experiment Setting
4.3.2 Result and Discussion
4.4 Conclusion
5 Tuple Reconstruction
5.1 Problem Definition
5.2 Proposed Approach
5.3 Performance Evaluation
5.3.1 Experiment Setting
5.3.2 Result and Discussion
5.4 Conclusion
II Expanding the Set of Examples Semantically
6 Topic Labelling with Truth Finding
6.1 Introduction
6.2 Problem Definition
6.3 Proposed Approach
6.3.1 Generating Candidate Labels
6.3.2 Ranking the Labels
6.4 Performance Evaluation
6.4.1 Experiment Setting
6.4.2 Experiments
6.4.3 Result and Discussion
6.5 Conclusion
7 Set Labelling using Multi-label Classification
7.1 Introduction
7.2 Problem Definition
7.3 Proposed Approach
7.3.1 Training Process
7.3.2 User Interaction
7.4 Performance Evaluation
7.4.1 Experiment Setting
7.4.2 Result and Discussion
7.5 Conclusion
8 Conclusions and Perspectives
8.1 Conclusive Summary
8.2 Perspectives
A Ranking of Sources
B Résumé en Français
B.1 Ensemble d’expansion de n-uplets
B.1.1 Robot d’aspiration de site Web
B.1.2 Génération de wrappers et extraction de candidats
B.1.3 Classement
B.1.4 Comparaison avec l’état de l’art
B.2 Vérité dans la série d’expansion de n-uplets
B.2.1 Définition du problème
B.2.2 Approche proposée
B.2.3 Évaluation des performances
B.3 Reconstruction de n-uplets
B.3.1 Définition du problème
B.3.2 Approche proposée
B.3.3 Évaluation des performances
B.4 Étiquetage de sujet avec recherche de vérité
B.4.1 Approche proposée
B.4.2 Évaluation des performances
B.5 Définir l’étiquetage
B.5.1 Définition du problème
B.5.2 Approche proposée
B.5.3 Évaluation des performances
B.6 Conclusion .