Get Complete Project Material File(s) Now! »
Identity Management Services
Identity management services share the common goal of helping users or applications to identify IRIs referring to the same real world entity, and distinguish similar labels referring to different real world entities. For instance, in order to avoid using a resource referring to the river of Niger, while intending in using one referring to the country Niger, one could benefit from such services for re-using an existing universal identifier that unambiguously refers to a certain real-world entity (e.g. the river of Niger). Such type of services have a more centralized vision for identity management in the Web of Data, in which each real-world entity is referenced by a single centralized IRI. On the other hand, one can make use of other types of identity management services to find all identifiers referring to the river of Niger, and discover additional descriptions.
Such services can play an important role in enabling large scale identity analysis in theWeb, implementing and optimising linked data queries in the presence of co-reference [Schlegel et al., 2014], and detecting erroneous identity assertions [de Melo, 2013, Cuzzola et al., 2015, Valdestilhas et al., 2017].
In the early days of the Web, it was originally conceived that resource identifiers would fall into two classes: locators (URLs) to identify resources by their locations in the context of a particular access protocol such as HTTP or FTP, and names (URNs). URNs [Mealling and Daniel, 1999], were supposed to be the standard for assigning location-independent, globally unique, and persistent identifiers to arbitrary subjects. Each identifier has a defined namespace that is registered with the Internet Assigned Numbers Authority (IANA). For instance, ‘ISBN’ is a registered namespace that unambiguously identifies any edition of a text-based monographic publication that is available to the public. For instance, urn:isbn:0451450523 is a URN that identifies the book “The Last Unicorn”, using the ISBN namespace. Because of the lack of a well-defined resolution mechanism, and the organizational hurdle of requiring registration with IANA, URNs are hardly used (a total of 47K URNs in the 2015 copy of the LOD, with only 73 registered3 URN namespaces with IANA at the time of writing). Since 2005, the use of the terms URNs and URLs has been deprecated in technical standards in favour of the term Uniform Resource Identifier (URI), which encompasses both, and the term Internationalized Resource Identifier (IRI) which extends the URI character set that only supports ASCI encoding.
A more recent proposal for a centrally managed naming service was proposed by [Bouquet et al., 2007]. This public entity name service (ENS), named Okkam4, intends to establish a global digital space for publishing and managing information about entities. Every entity is uniquely identified with an unambiguous universal URI known as an OKKAM ID, with the idea of encouraging people to reuse these identifiers instead of creating new ones. Each OKKAM ID is matched to a set of existing identifiers (e.g. DBpedia and Wikidata IRIs), using several data linking algorithms that are available in the public entity name service hosted at http://okkam.org. For instance, the company ‘Apple’ has a profile with an Okkam ID5, which is linked to other non-centrally managed IDs (e.g. dbpedia/resource/Apple Inc). For each OKKAM entity, a set of attributes are collected and stored in the service for the purpose of finding and distinguishing entities from another. However, the public entity name service is no longer maintained, with no information on the number of existing entities, links, and the covered datasets.
Finally, [Glaser et al., 2009] introduced the Consistent Reference Service (CRS), that finds for a given IRI, the list of identifiers that belong to the same identity bundle. These identity bundles are the result of the transitive closure of a mix of identity and similarity relationships (such as owl:sameAs, umbel:isLike, skos:closeMatch, and vocab:similarTo). This service is based on 346M triples harvested from multiple RDF dumps and SPARQL endpoints, and hosted at http://sameas.org. This large collection of triples linking over 203M IRIs, and resulting in 62.6M identity bundles, has been the basis for many subsequent approaches that aim to detect erroneous identity links (e.g. [de Melo, 2013, Cuzzola et al., 2015, Valdestilhas et al., 2017]).
Discussion
Identity management services play an important role in facilitating the understanding and re-use of IRIs. However we believe that centralized naming authorities such as OKKAM, although they might be adopted within some dedicated domains and applications, they will be of limited use in the context of the Web. As acknowledged by its authors [Bouquet et al., 2007], encouraging people to adopt and accept such Entity Naming Systems would be challenging, as the idea of having to go through an authority in order to use a new name somewhat goes against the philosophy of the ad-hoc, and scale-free nature of the Web, where “anybody is able to say anything about anything”. In addition, such systems can only be truly successful once sufficient added value over the use of non-centrally managed identifiers is provided, specifically in providing efficient and high-quality search results, and offering high coverage of real-world entities. Finally, centralizing all names into one system would raise many privacy and security concerns, in a time where the paradigm is shifting towards more decentralization of the Web [Verborgh et al., 2017].
The Consistent Reference Service proposed by [Glaser et al., 2009], is more adopted in Linked Data applications [de Melo, 2013, Cuzzola et al., 2015, Valdestilhas et al., 2017]. However, in its current architecture and status, it faces some limitations. Firstly, identity bundles in the sameAs.org service are the result of the transitive closure of a mix of identity and similarity relationships (such as umbel:isLike and skos:exactMatch). The system does not keep the original predicates, meaning that a user cannot identify if two terms in the same bundle are actually the same, similar or just closely related (e.g. skos:closeMatch). The presence of several identity and similarity relations, with different semantics, means that the overall closure is not semantically interpretable (e.g. can not be used by a DL reasoner for inferring new facts). In addition, since no service can guarantee the coverage of all the triples in the Web of Data, one way of ensuring better transparency would be by listing the exploited data sources. This would allow users to evaluate the pertinence of this data in their applications and contexts. The Consistent Reference Service does not provide such information.
Detection of Erroneous Identity Links
An important aspect of managing identity in the Web of Data is the detection of incorrectly asserted identity links. In order to detect such erroneous links, different kinds of information may be exploited: RDF triples related to the linked resources, domain knowledge that is described in the ontology or that is obtained from experts, or owl:sameAs network metrics. In this section, we present existing approaches that detect erroneous identity links, based on three –eventually overlapping– categories of approaches: inconsistency-based (2.3.2), content-based (2.3.3), and network-based approaches (2.3.4). Table 2.1 provides a summary of these approaches, stating their characteristics, requirements, and the data in which the experiments were conducted.
Evaluation Measures
An approach of erroneous link detection can be evaluated using the classic evaluation measures of precision, recall, and accuracy. In Table 2.1 we present these measures as reported in each paper, when available. These evaluation measures can be defined for the problem of detection of erroneous links as follows:
Precision. Represents the number of links classified by the approach as incorrect, and are indeed incorrect identity links (True Positives), over the total number of links classified as incorrect by the approach (True Positives + False Positives).
Recall. Represents the number of links classified by the approach as incorrect, and are indeed incorrect identity links (True Positives), over the total number of incorrect identity links existing in the dataset (True Positives + False Negatives).
Accuracy. Represents the number of links classified by the approach as incorrect, and are indeed incorrect identity links (True Positives), and the number of validated and actually correct identity links (True Negatives), over the total number of identity links classified as incorrect by the approach (True Positives + False Positives), and the total number of identity links validated as correct by the approach (True Negatives + False Negatives).
Inconsistency-based Detection Approaches
These approaches hypothesize that owl:sameAs links that lead to logical inconsistencies have higher chances of erroneousness than logically consistent owl:sameAs.
Conflicting owl:sameAs and owl:differentFrom The first approach for detecting erroneous identity assertions in the Web of Data was introduced by [CudreMauroux et al., 2009], who presented idMesh: a probabilistic and decentralized framework for entity disambiguation. This approach hypothesizes that owl:sameAs and owl:differentFrom links published by trusted sources, are more likely to be correct than links published by untrustworthy ones. For initialising the sources’ trust values, the approach relies on a reputation-based trust mechanisms from P2P networks, on online communities trust metrics, or on the used domains (e.g. closed domains such as is available, a default 0.5 value is initialized for the source. The approach detects conflicting owl:sameAs and owl:differentFrom statements based on a graph-based constraint satisfaction problem that exploits the owl:sameAs symmetry and transitivity. They resolve the detected conflicts based on the iteratively refined trustworthiness of the sources declaring the statements (i.e. creating an autocatalytic process where constraint-satisfaction helps discovering untrustworthy sources, and where trust management delivers in return more reasonable prior values for the links). The approach shows high accuracy (75 to 90%) in discovering the equivalence and non-equivalence relations between entities even when 90% of the sources are actually spammers feeding erroneous information. However, this type of approach requires the presence of a large number of owl:differentFrom statements, which is not the case in the Web of Data. In addition, scalability evaluation, only conducted on synthetic data, demonstrate a maximum scale involving 8,000 entities and 24,000 links, over 400 machines, focusing solely on network traffic and message exchange as opposed to time. The precision and recall are not reported.
Ontology Axioms Violation
[Hogan et al., 2012] introduced a scalable entity disambiguation approach based on detecting inconsistencies in the equality sets that result from the owl:sameAs equivalence closure. This approach detects inconsistent equality sets, by exploiting ten OWL 2 RL/RDF rules expressing the semantics of axioms such as differentFrom, AsymmetricProperty, complementOf. When resources causing inconsistencies are detected, they are separated into different seed equivalence classes, in which the approach assigns the remaining resources into one of the seed equivalence classes based on their minimum distance in the nontransitive equivalence class, or using in a case of tie, a concurrence score that is based on the pairs’ shared inter- and intra- links. The authors have evaluated their approach on a set of 3.7M unique owl:sameAs triples derived from a corpus of 947M unique triples, crawled from 3.9M RDF/XML web-documents in 2010. From the resulting 2.8M equivalence classes, the approach detects only three types of inconsistencies in a total of 280 classes: 185 inconsistencies through disjoint classes, 94 through distinct literal values for inverse-functional properties, and one through owl:differentFrom assertions. On average, repairing an equivalence class requires its partition into 3.23 consistent partitions. After manually evaluating 503 pairs randomly chosen from the 280 inconsistent classes, the results show that 85% of the pairs that were separated from the same equivalence class are indeed different (i.e. precision), leading to the separation of 40% of the pairs evaluated as wrong by the judges (i.e. recall). This result shows that consistency does not imply correctness, with 60% of the pairs evaluated as different still belong to the same (now consistent) equivalence classes.
Hence suggesting that the recall could be much lower than 40%, as the approach is not capable of detecting different pairs from the other 2.8M consistent equivalence classes. The total runtime of this approach is 2.35 hours.
Table of contents :
1 Introduction
1.1 Objectives & Contributions
1.2 Thesis Outline
2 State of the Art
2.1 Identity Analysis
2.2 Identity Management Services
2.3 Detection of Erroneous Identity Links
2.3.1 Evaluation Measures
2.3.2 Inconsistency-based Detection Approaches
2.3.3 Content-based Approaches
2.3.4 Network-based Approaches
2.4 Alternative Identity Links
2.4.1 Weak-Identity and Similarity Predicates
2.4.2 Contextual Identity
2.5 Conclusion
3 Identity Analysis and Management Service
3.1 Approach
3.1.1 Explicit Identity Network: Extraction
3.1.2 Explicit Identity Network: Compaction
3.1.3 Implicit Identity Network: Closure
3.2 Implementation & Experiments
3.2.1 Data Graph
3.2.2 Explicit Identity Network: Extraction
3.2.3 Explicit Identity Network: Compaction
3.2.4 Implicit Identity Network: Closure
3.3 Data analytics
3.3.1 Explicit Identity Network Analysis
3.3.2 Implicit Identity Network Analysis
3.3.3 Schema Assertions About Identity
3.4 Dataset & Web Service
3.4.1 Dataset
3.4.2 Web Service
3.5 Conclusion
4 Erroneous Identity Link Detection
4.1 Community Structure
4.1.1 Overview
4.1.2 Graph Partitioning Algorithms
4.1.3 Louvain Algorithm
4.2 Approach
4.2.1 Identity Network Construction
4.2.2 Links Ranking
4.3 Implementation & Experiments
4.3.1 Data Graph
4.3.2 Explicit Identity Network Extraction
4.3.3 Identity Network Construction
4.3.4 Graph Partitioning
4.3.5 Links Ranking
4.4 Analysis & Evaluation
4.4.1 Community Structure Analysis
4.4.2 Links Ranking Evaluation
4.5 Conclusion
5 Contextual Identity Relation
5.1 Contextual Identity Definition
5.1.1 RDF Knowledge Graph
5.1.2 Problem statement
5.1.3 Identity Contexts
5.1.4 Contextual Identity
5.2 Detection of Contextual Identity Links
5.2.1 Experts Knowledge
5.2.2 Contextual Identity in RDF
5.2.3 DECIDE – Algorithm for Detecting Contextual Identity
5.2.4 Contextual Identity Links Examples
5.3 Conclusion
6 Contextual Identity for Life Sciences Knowledge Graphs
6.1 Five Star Knowledge Graph for Life Sciences
6.1.1 Application Domain
6.1.2 Conceptual Model
6.1.3 Knowledge Graph Construction
6.2 Detection of Contextual Identity in Scientific Experiments
6.2.1 DECIDE Results
6.2.2 Use of Experts Constraints
6.3 Contextual Identity Links for Rule Detection
6.4 Results Summary
6.5 Conclusion
7 Conclusion & Perspectives
7.1 Summary of Results
7.2 Discussion and Future Work
A R´esum´e en Fran¸cais
A.1 Introduction
A.2 Etat de l’art
A.3 Service de gestion et d’analyse d’identit´e
A.4 M´ethode de d´etection des liens d’identit´e erron´es
A.5 Relation d’identit´e contextuelle
A.6 Graphes de connaissance pour les sciences de la vie