Get Complete Project Material File(s) Now! »
The CS-CC database
Dataset creation. For our experiments, we needed a dataset of scienti c papers dense enough so that this is likely that there exist links between results from di erent papers. The dataset must not be too big so that our scripts will not take too much time. It is possible to download bulks of papers from arXiv (and other websites), but publications are sorted by months and paper published the same month are less likely to quote each other. Moreover, arXiv enables to download both the pdf and the source of papers published on it (when available).
Instead, I crawled the export version of the arXiv open archives, and downloaded all PDFs and sources of papers published between 2010 and 2020 under the cs-cc category. cs-cc stands for \Computer Science – Com-putational Complexity ». We chose this category because computational com-plexity is a domain with a lot of mathematical results and close enough so that papers will very probably quote each other.
The dataset I gathered contains 6:000 scienti c papers. The second step was to get the links between the di erent articles of the dataset. The rst methods I used to obtain them were not very e cient, but it enabled me to familiarize myself with some useful tools.
Extracting citations. Indeed, I rst tried to use the dissem.in API. If one sends the title of an article, the publication year and the names of the authors to the dissem.in API, the API returns the DOI of the paper. The idea was to nd the DOI of each paper in the dataset, and then the DOI of articles in the bibliography of all papers in the dataset. Once this is done, we just have to nd matching DOI. I tried to extract the information required by the API from the LATEX source code, but, as it is shown in Table 1, more than one fth of all papers in the dataset do not have any latex le in their sources (or no sources at all). Only 62% have a le for the bibliography, often a .bbl le, which is not ideal because its format can vary from one paper to another. Only 3% of them have a .bib le for the bibliography.
Instead of directly using the source les, I decided to extract information directly from the PDF using grobid [2, 15], a tool based on machine learning and able to convert a PDF of a scienti c paper into an XML le. The returned XML le contains structured information on the paper, such as header data (title, authors name, universities, etc.), body data (sections, gures, equations, references, etc.) and most importantly bibliographic data (title, authors name and publication year of each quoted article). How-ever, the results were still not convincing since grobid and dissem.in have their own bugs. Fortunately, the digital library SemanticScholar for scienti c literature o ers an API which only needs the arXiv id of an article and returns a de-tailed JSON le, containing the arXiv id of every arXiv paper referenced in the bibliography. Thanks to that, I was able to obtain a dense graph of articles and inter-articles dependencies.
Graph analysis. 2:666 of the 6:000 papers are cited by at least one other paper from the corpus, and 3:190 of them are citing another paper of the corpus. Figure 3 shows the distribution of the number of citations intra-corpus per article. This highlights the fact that despite its small size, this dataset is dense enough for our experiments.
Figure 3: Distributions of the number of citations from (left panel) and to (right panel) other papers of the corpus.
The longest path in this graph is of length 19 and the biggest con-nected component contains 3:167 papers and 8:106 edges between papers. Figure 4 shows a representation of this component obtained using the Net-workX library on Python.
Figure 4: Representation of the graph of dependencies between papers of the corpus. The x-coordinate of a point represents its publication date (between 2009 and 2020) and the y-coordinate represents its number of citations from papers in the corpus.
Theorem Matching
This section is dedicated to my rst attempt at linking results from di erent papers. The idea was to detect pairs of theorems which are similar. For example, it would detect if a theorem in some paper is a restatement of a theorem in one of the papers referenced in the bibliography.
The theorem dataset
For each paper in the cs-cc dataset, I extracted the set of mathematical results and their statement from the LATEX source les. To do so, I sim-ply searched for each \nbeginftheoremg » and \nendftheoremg » using the TexSoup library in Python. For each result, a regexp detects if some paper is cited inside the result and we check if the cited paper is in the cs-cc dataset. If a citation to another paper is detected, we use a classi er to nd which particular results of the cited paper is associated to this one. The di erent classi ers we tried are described in the next sections. Note that we only consider citations which are inside the statement and not the ones which are in the proofs. Indeed, in the rst case, it is more likely that the result is a restatement of a result in the cited paper, and in the second case, the result probably relies on a result of the cited paper.
A total of 41; 167 results were successfully extracted from LATEX les and 308 theorems are referencing another paper from the corpus (there might be duplicates). To see the e ciency of the di erent methods, I selected a subset of 39 theorems in which the precise result of the cited paper we want to nd is already mentioned in the header (e.g « Theorem 3 from [4] »).
The remainder of this section is dedicated to the di erent methods I tried to detect the similarities between di erent theorems. The accuracy obtained with each method is reported in Table 2.
TF-IDF Vectorizer
The rst model I used is based on sentence vectorization using the fre-quencies of words in the sentence. I tried several vectorizers, but the most e cient one was the TDF-IDF Vectorizer.
TF-IDF stands for Term Frequency Inverse Document Frequency. Given a set of documents D, we will associate to each pair of document and term (di; tj) a weight t dfi;j. This weight is the product of two numbers:
1. Term frequency : tfi;j is equal to the number of occurrences of the term j in the document i.
2. Inverse Document Frequency : The idfj of the term j is a function of the inverse of the proportion of documents containing this precise term. For instance, it can be the log of it.
jDj
idfj = log jfdi j tj 2 digj
Finally, we can compute the t df weight for the pair (di; tj) :
t dfi;j = tfi;j idfj
In our case, documents are theorems taken in a bunch of papers, con-taining the citing paper and the cited paper. We denote the set of theo-rems T . Terms are simply words appearing in these theorems. We can associate to each result a vector v(ti) = (t dfi;1; ; tfdif1;m). I used the TfidfVectorizer function from sklearn library, with english stopwords, which directly returns the vector of every theorem given the set of theorems.
If we want to nd the result in a paper P2 T which is the most similar to a result t1 in a paper P1 T , we simply take tmax = arg max v(t1)tv(t2) t22P2
This method works well when the two theorems are almost identical. However, it gets complicated when there is a lot of theorems in the paper P2, since it is very likely that theorems of the same paper contain almost the same terms. I also tried to use N-grams instead of unique words, but the results were worse.
Autoencoder
I also tried to create my own theorem vectorizer using a variational au-toencoder, inspired by [24, 6] and more generally by autoencoders seen in class and in [20].
First, I needed to choose embeddings for words. Given that all the theorems share a scienti c vocabulary, I tried to train embeddings using the CBOW method [17]. However, this did not work well, maybe because the dataset of theorems was too small. I decided to use pretrained Glove embeddings [18] with 50 dimensions and 6 billion tokens instead.
In our autoencoder, the input of the decoder is a fraction of the state-ment and the output of the encoder and it must guess the masked words of the sentence. The fraction of words which is masked is called the word dropout. For instance, when the word dropout is equal to 0:3 we obtain the loss shown in Figure 5. Both encoder and decoder use a GRU recurrent neural network. The best results I obtained using this method are not really convincing (see Table 2), and much worse than the results obtained using TF-IDF.
I did not investigate more on this topic, because the dataset was too small and it was hard to train a classi er in an unsupervised fashion.
Table of contents :
1 Introduction
2 Related work
2.1 Information extraction from scientific publications
2.2 Scientific digital libraries and open archives
2.3 References extraction and analysis
2.4 Knowledge bases of theorems
3 The CS-CC database
4 Theorem Matching
4.1 The theorem dataset
4.2 TF-IDF Vectorizer
4.3 Autoencoder
5 The graph extraction algorithm
5.1 From PDF to XML
5.2 Extracting results
5.3 Associate tags and papers
5.4 Results on the CS-CC dataset
6 The arXiv Database
6.1 The dataset
6.2 Quantitative Analysis
6.3 Qualitative Analysis
7 Future work
7.1 Improve the extraction algorithm
7.2 Crowdsourcing