Get Complete Project Material File(s) Now! »
Cross-lingual Similarity Measures
A document similarity measure is a function that quanti es the similarity between two text documents [Manning and Raghavan, 2008]. This function gives a numeric score to quantify the similarity. In data mining and machine learning, Euclidean or Manhattan metrics are usually used to measure the similarity of two objects, while in text mining and information retrieval, cosine similarity is commonly used to measure the similarity of two text documents. Cosine similarity is computed as follows: cosine(ds; dt) = = n n (2.1) iP n ds dt =1dsi dti kdsk kdtk r r i=1 (dsi )2 i=1(dti )2 P P where ds and dt are document vectors. To generate a document vector, the text document is transformed into Bag Of Words (BOW), i.e., to treat the text document as a set of unstructured words. Representing documents in a collection as BOW is called Vector Space Model (VSM) or term-document matrix. Usually, most important words are considered and less important words can be ignored. For instance, a pre-de ned list of stop words (the, to, from, . . . ) are usually removed since they have no sense in the unstructured BOWs. In addition, stemming or lemmatization are usually applied to reduce words into their base form (root or stem). This will address the problem of word variability when estimating the similarity between documents. The number of dimensions in VSM is equal to the number of unique terms in the document collection. In large documents collection, the term-document matrix becomes sparse.
CL-LSI Methods
As introduced earlier, document similarity can be estimated on term level or on semantically related terms. Generally, semantic similarity is a metric that quanti es the likeness (similarity) of documents or terms based on the meaning or semantics of the contents. Semantic similarity can be measured based on a pre-de ned ontology, which speci es the distance between concepts, or can be measured using statistical methods, which correlate terms and contexts from a text corpus. One of the methods in the latter category is Latent Semantic Indexing (LSI).
LSI is also referred in the literature as Latent Semantic Analysis (LSA). LSI is de-signed to solve the problems of the ordinary IR system, which depends on lexical matching [Dumais, 2007]. For instance, some irrelevant information can be retrieved because some literal words have di erent meanings. On the contrary, some relevant information can be missed because there are di erent way to describe the object. The VSM in ordinary IR system is sparse (most of elements are zero), while LSI space is dense. In the sparse VSM, terms and documents are loosely coupled, while terms and documents in LSI space are coupled (correlated) with certain weights. VSM is usually high dimensional space for large document collection, while LSI is a reduced space. The number of dimension is LSI is lower than the number of unique terms.
Emotion Identication
Emotion identication is the automatic detection of emotions that are expressed in a text. It is useful for many applications such as market analysis, text-to-speech synthesis, and human-computer interaction [Pang and Lee, 2008].
The basic six human emotions, which are reported in a psychological study by [Ekman, 1992], are widely adopted in emotion identication [Pang and Lee, 2008]. These emotions are anger, disgust, fear, joy, sadness, and surprise. There are various studies that addressed emotions identication in texts. For example, the work of [Zhe and Boucouvalas, 2002] introduces a text-to-emotions engine that generates expressive images of user’s emotions in chat conversations. The authors reported that the system can be useful for real time gaming and real time communication systems, where transmitting video is restricted due to low bandwidth of the connection.
Wikipedia Comparable Corpus
Wikipedia is an open source encyclopedia written by contributors in several languages. Anyone can edit and write Wikipedia articles. Therefore, articles are usually written by dierent authors. Some of Wikipedia articles of some languages are translations of the corresponding English versions, and others are written independently. Wikipedia provide a free copy of all available contents of the Encyclopedia (articles, revisions, discussion of contributors). These copies are called dumps6. Because Wikipedia contents change with time, the dumps are provided regularly every month. Wikipedia dumps can be downloaded in XML format. Our Wikipedia corpus is extracted by parsing Wikipedia dumps of December 2011, which are composed of 4M English, 1M French, and 200K Arabic articles.
English Wikipedia started in 2001 with 2.7K articles, and French Wikipedia started in in the same year with 895 articles, while Arabic Wikipedia started in 2003 with 655 articles. Table 3.4 shows the rank of English, French and Arabic Wikipedias according to the number of articles [Wikimedia, 2014]. To judge these ranks, we need to compare the number of speakers and articles in each language. By August 2014, 335M English speakers added 4.7M articles, 456M French speakers added 1.5M articles, and 422M Arabic speakers added 315K articles. Thus, Arabic people added a very few articles compared to other language speakers. Despite that, the growth rate of Arabic articles is the highest compared to English and French [Wikimedia, 2014].
Table of contents :
1 Introduction
1.1 Motivation
1.2 Overview
2 Related Work
2.1 Comparable Corpora
2.2 Cross-lingual Similarity Measures
2.2.1 Dictionary-based Methods
2.2.2 CL-IR Methods
2.2.3 CL-LSI Methods
2.3 Sentiment Analysis
2.4 Emotion Identication
3 Collected and Used Corpora
3.1 Introduction
3.2 Arabic Language
3.3 Comparable Corpora
3.3.1 Wikipedia Comparable Corpus
3.3.2 Euronews Comparable Corpus
3.4 Parallel Corpora
4 Cross-lingual Similarity Measures
4.1 Introduction
4.2 Cross-lingual Similarity Using Bilingual Dictionary
4.2.1 Results
4.2.2 Conclusion
4.3 Cross-lingual Similarity Using CL-LSI
4.3.1 Experiment Procedure
4.3.2 Results
4.3.3 Conclusion
4.4 Aligning Comparable Documents Collected From Dierent Sources
4.4.1 The Proposed Method
4.4.2 Results
4.5 Conclusion
5 Cross-lingual Sentiment Annotation
5.1 Introduction
5.2 Cross-lingual Sentiment Annotation Method
5.3 Experimental Setup
5.4 Statistical Language Models of Opinionated texts
5.4.1 Opinionated Language Models
5.4.2 Testing Language Models across domains
5.4.3 Testing Language Models on Comparable Corpora
5.5 Conclusion
6 Comparing Sentiments and Emotions in Comparable Documents
6.1 Introduction
6.2 Agreement Measures
6.3 Comparing Sentiments in Comparable Documents
6.4 Comparing Emotions in Comparable Documents
6.5 Conclusion
7 Conclusion and Future Work
A Samples of Comparable Documents
A.1 Samples from Wikipedia Corpus
A.1.1 Wikipedia English Article
A.1.2 Wikipedia French Article
A.1.3 Wikipedia Arabic Article
A.2 Samples from Euro-news Corpus
A.2.1 Euro-news English Article
A.2.2 Euro-news French Article
A.2.3 Euro-news Arabic Article
A.2.4 English Translation of Euro-news Arabic Article
A.3 Samples from BBC-JSC corpus
A.3.1 BBC-JSC English Article
A.3.2 BBC-JSC Arabic Article
A.3.3 English Translation of BBC-JSC Arabic Article