Discussion about the Concentration of the Partition Function

Get Complete Project Material File(s) Now! »

Sequence Graphs and Ambiguity in Language Models: a Combinatorial Study

The intent of several natural language models is to extract the semantic informa-tion contained in a sentence or a document. Popular methods to achieve this ob-jective is to consider the co-occurences of words within context of size w (Mikolov et al., 2013a; Pennington et al., 2014; Arora et al., 2016b), which can be represented as a weighted graph (which we refer to as sequence graph, or a graph of words). How-ever, these representations induce a level of ambiguity, as displayed in Figure 1.3. Figure 1.3 – The two sequence graphs built for the sentence “To be or not to be” us-ing window sizes 3 (a) and 2 respectively (b). In the second case, the information to reconstruct the sentence is insufficient, hence creating ambiguity, since any circular permutation of the words admits the same representation. Chapter 3 explores this question in generality. In Chapter 3, we are interested in questions related to the level of ambigu-ity generated with these models. In particular, we define the notion of sequence graph, a combinatorial object encoding their information. Two main theoretical combinatorial properties are investigated, in particular the existence of a sequence given an arbitrary graph, and the number of possible sequence (i.e sentences or documents) from which a given graph can originate from. We also show an experimental comparison between these context-based mod-els and recent sequential derived from recurrent neural networks, which by con-struction do not share this property.

Entity Identification and Analysis of a Knowledge Graph

Despite the form of ambiguity induced by co-occurence based models as discussed in Chapter 3, we will show in Chapter 4 they can be very efficient for information retrieval problems. The structure of a document can be analysed under the prism of the role of central entities in the text, motivating the question of identifying them automati-cally. In the context of natural language processing, the definition of an entity is not universal but will be properly defined in Chapter 2 and Chapter 4. In this Sec-tion, let us consider that an entity is a real-world object and usually has a physical existence, and usually denoted with a proper name. The problem of Entity discov-ery consists in detecting these entities and identifying them within a database. For instance, in the sentence: John Kennedy was the president of the United States. A named entity recognition program should be able to label John Kennedy and United States as entities (and the remaining words as non-entities), and classify the type of entities within categories such as Person, Place, or Organisation. The process of identification, which we will be interested in Chapter 4 consists in identifying the pertinent corresponding entry in a database. These two problems are inter-esting in the sense thay they allow to create a structure from a natural language document, and thereby helping to contribute to its automatic treatment. More precisely, in the frame of Chapter 4, this “database” is represented by a Knowledge base, or Knowledge graph. A Knowledge graph is a form of relational database providing supplementary descriptive and semantic information about entities. The semantic information is contained in a graph structure, where a node represents an entity, and an edge represents a semantic relation. Each node of the graph potentially contains metadata, such as text description or classification via an Ontology. Therefore, in Chapter 4, we suppose that the identification of entities within text documents always relies on the existence of background knowledge. We also provide in Section 8.2 of the Appendix an analysis of a knowledge graph con-structed from economic and financial data.

Study of a Geometrical Conjecture and Partition Function Property for Word Vectors

Co-occurence based models are also used to generate word or sentence represen-tations in vectorial form. Indeed, the majority of machine learning algorithms take vectorial input, whereas the native structure of text are discrete sequences. Several procedure exist to generate word embeddings, which we present in Chapter 2. In Chapter 5, we will discuss several properties of theses representations. First, we discuss a geometrical conjecture relating semantic analogies and word vectors. Figure 1.4 – Example of a geometric property of word vectors for analogies (the x and y axis represent the coordinates of the word vectors, and each colour corre-spond to a quadruplet of words implied in an analogy), discussed in Chapter 5. Besides, we discuss a probabilistic property for word vectors, named the con-centration the partition function. This property was presented in previous work in a model for word vectors (Arora et al., 2016a), where the text generation process is driven by a random walk ( ct j 1 t T ), representing a latent discourse vector. If wt is the word at step t, ct the discourse vector, and vw vector of word w then this model supposes the following conditional model: P(wt = wjct) / exp(hct; vwi)
And the partition function Zc, similarly to statistical physics, is computed as a sum of states Zc = v exp(hv; ci). Then, under some assumptions, it can be proven that this partition function concentrates around its mean with high probability, and was suggested that it played an important role in the co-occurrence based models since it allows to derive formal relations between pointwise mutual information (PMI), which is pure statistical information based of co-occurrences of words, and the scalar product of word vectors, which is a geometric quantity. By weakening the assumptions of the generative model, in particular that the words vectors are generated uniformly and independently, and that the discourse vector are close to a sphere of radius R 2, we show that a similar concentration phenomenon happens, suggesting this property is not specific to these assumptions.

READ The Topological-Anti-Topological Massive Metric

Efficient Representations with Distance Geometry

Traditional methods to construct word representations are the output of optimiza-tion schemes solved by Stochastic Gradient Descent (for a detailed explanation of this notion, cf. Chapter 2) . In Chapter 6, we propose a new word embedding method based on the Dis-tance Geometry Problem (DGP), whose aim is to find object positions based on a subset of their pairwise distances. Considering the empirical Pointwise Mutual In-formation (PMI) as an inner product approximation, we discuss two algorithms to obtain approximate solutions of the underlying Euclidean DGP on large instances. The resulting algorithms are considerably faster than state-of-the-art algorithms, with similar performance for classification tasks. The main advantage of our ap-proach for practical use is its significantly lower computational complexity, which allows us to train representations much faster with a negligible quality loss, a use-ful property for domain specific corpora.

Table of contents :

Résumé
Abstract
Publications
Remerciements
Acknowledgments
Contents
List of Figures
List of Tables
1 Introduction
1.1 A bit of History: Computer science and Language
1.1.1 Formal languages
1.1.2 Formal Language, Natural Language and Artificial Intelligence
1.1.3 “Learning” Machines
1.2 Examples of Related Problems
1.3 Overview of the Contributions
2 Background Notions and Terminology
2.1 Graph Theory and Combinatorics
2.2 Complexity Theory
2.3 Distance Geometry
2.4 Analysis and Linear Algebra
2.5 Empirical Evaluation
2.6 Other Definitions
2.6.1 Linguistics
2.6.2 Natural language processing
2.7 Datasets
3 Sequence Graphs and Ambiguity in Language Models:
a Combinatorial Study
3.1 Introduction
3.2 Definitions and Problem Statement
3.3 Motivations, Summary of the Theoretical Results and Relation with Language Models
3.4 Complexity Results over 2-Sequence Graphs
3.5 General Sequence Graphs
3.5.1 Direct Approach
3.5.2 REALIZABLEw: Linear Integer Programming Formulation
3.5.3 NUMREALIZATIONSw: Dynamic Programming Formulation
3.6 Complexity Study
3.6.1 Preliminaries
3.6.2 Main Complexity Results
3.7 Application to Sequential Models
3.7.1 Number of Equivalent Sequences forWeighted Sequence Digraphs
3.7.2 Comparison with a Recurrent Neural Network
3.8 Conclusion
4 Entity Identification
4.1 Introduction
4.1.1 Basic Concepts and Definitions
4.1.2 Contributions
4.1.3 Another Example of Knowledge Base: Ownership network
4.2 Related Work
4.2.1 Graphs for NEL
4.2.2 Probabilistic Graphical Models
4.2.3 Embeddings and Deep Architectures
4.3 Methodology
4.3.1 Entity Filtering
4.3.2 Graph-based Identification
4.4 Experimental Setup and Evaluation
4.4.1 Datasets, Entity Types and Ontology
4.4.2 Results
4.5 Conclusion
5 Geometry, Words and Representations
5.1 Introduction
5.1.1 Context and Motivations
5.1.2 Contributions
5.2 RelatedWork
5.2.1 Word Embeddings
5.2.2 Relation Embeddings for Named Entities
5.2.3 Word Embeddings, Linear Structures and PMI
5.3 Discussion about the Concentration of the Partition Function
5.3.1 Preliminaries
5.3.2 Main Inequality
5.4 Experiments with Existing Representations
5.4.1 Sanity Check
5.4.2 Analogies Protocol
5.4.3 Turning the Protocol into an Algorithm
5.4.4 Supervised Classification
5.5 Parallelism for Analogies with Graph Propagation
5.6 Experiments with New Embeddings
5.6.1 Classification of Analogies
5.6.2 Text classification: comparison using k-NN
5.7 Conclusion
6 Distance Geometry and Representations
6.1 Introduction
6.2 Distance geometry
6.3 Methodology
6.3.1 Co-occurrences and PMI estimator
6.3.2 Noisy distances and the positive definite assumption
6.3.3 Geometric Build-Up Algorithm
6.3.4 Vertex Ordering Problem
6.3.5 Divide and Conquer Algorithm
6.3.6 Relationship with Other PMI-based Methods
6.4 Complexity Analysis
6.4.1 PMI Eigendecomposition: Arnoldi-Lanczos Algorithm
6.4.2 Complexity Comparison of GB and DC with Other co-occurrence
6.5 Experiments
6.5.1 Description of The Experiments
6.5.2 Discussion
6.6 Conclusion
7 Concluding Remarks
7.1 Summary of Contributions
7.2 Perspectives and Future Work
8 Appendix
8.1 Chapter 3: Supplementary Figures and Experiments
8.1.1 p = 10, w = 3
8.1.2 p = 10, w = 4
8.1.3 p = 20, w = 3
8.1.4 p = 20, w = 4
8.1.5 p = 20, w = 5
8.1.6 p = 20, w = 10
8.1.7 p = 40, w = 3
8.1.8 p = 40, w = 10
8.2 Chapter 4: Analysis of a Knowledge Graph
8.2.1 Introduction
8.2.2 Ownership Network
Motivation and Justification of the Methods Used in the Analysis
Preliminary definitions
Influence Analysis
8.2.3 Discussion
8.2.4 Conclusion
8.3 Chapter 6: Other distances?
8.3.1 Distances on Random Variables
8.3.2 A distance candidate: variation of information
Bibliography