Get Complete Project Material File(s) Now! »
Peer-to-Peer (P2P) Networks
The continuous and widespread growth of Internet-based applications in terms of number of users and computational resources, has challenged the centralised nature of the client- server paradigm. One of the most relevant requirements of such applications is scalability, which is dened in [Neu94] as follows.
A system is said to be scalable if it can handle the addition of users and resources without suering a noticeable loss of performance or increase in administrative complexity. Services are maintained on a central entity composed of one or more servers. It is widely accepted that this paradigm cannot accomplish the denition of scalability without an enormous amount of eort in terms of resources. Besides, its centralised organisation is prone to resource bottlenecks.
The Peer-to-Peer (P2P) network and computing paradigm aim to decentralise services. It proposes a dierent model composed of autonomous entities called peers or nodes that share computational resources (CPU, storage, and bandwidth) in order to achieve a common task. Figure 2.1b presents its general model. Nodes are connected to a set of neighbours through internet links in a logical network called peer-to-peer overlay network or simply peer-to-peer network. The term Peer-to-Peer has be dened in [Ora01] and rened in [SW04] as follows.
A self-organising system of equal, autonomous entities (peers) which aim for the shared usage of distributed resources in a network environment avoiding central services. This basic denition gives peer-to-peer networks the following desirable properties.
1. Self-organisation. Nodes are able to interact with each other without any central control. This is the main property of peer-to-peer systems which shifts the paradigm away from the client-server paradigm.
2. Decentralised resource usage. The available resources of nodes (CPU, storage, and bandwidth) are distributed and shared with the best eort regarding its load distribution.
3. Scalability. It achieves the denition of scalability stated above (see denition 1).
4. Fault tolerance. The failure of a single node must not compromise the correct operation of the system.
Usually, a resource r is represented by a resource identier id(r) which belongs to a resource identier space I(r). The basic functions supported by most of peer-to-peer networks are the following.
Put(n; r): Add the resource r on the node n.
Get(n; id(r)): Acquire the resource represented by the identier id(r) from the node n.
Query(q; nq): Get all the node identiers of nq which hold resources that accomplish the query predicate q.
Many solution towards achieving these features have been proposed in the literature. They can be classied in two main groups: unstructured overlay networks and structured overlay networks
Unstructured P2P overlay networks
In an unstructured overlay network, a node can freely join the network independently of its state. These networks typically present a graph structure or a hierarchical structure. The main dierence between these two structures is that in the former all nodes are seen as equal entities, whereas the second adds a layer composed of nodes called super-peers. A super-peer represents a set of peers which interact with the larger system through a central node. Figure 2.2 presents a schema of these two models. A popular example of systems that use this kind of overlay is the rst generation of peer-to-peer le sharing applications such as Gnutella [Rip01]. It is possible to characterise this kind of P2P network through the followings properties.
Network Knowledge. A node knows a set of neighbours (local knowledge). That is, no node maintains the entire state of the system.
Resource Knowledge. There is no implicit knowledge about the location of resources. That is, a node only knows about its own resources.
Search algorithm. Since there is no resource oversight, the search must be propagated through neighbour nodes using ooding techniques such as the breadth-rst search algorithm (BFS) [Cor09]. A search message is forwarded to all neighbours which, in turn, forward it until a time-to-life (TTL) number of steps have been completed. In order to avoid cyclic forwarding, every node can remember recent messages and drop duplicate messages.
Limitations of unstructured overlays
The main advantage of unstructured overlay networks is their low cost of maintenance. However, the lack of resource oversight induces a high message trac for performing searches. In the worst case, all nodes may be visited in order to know if a resource exists in the system. Several improvements have been proposed to reduce the number of nodes visited in the BFS algorithm. They include selective forwarding [SBR04], iterative deepening [YGM02], and random walk [LCC+02]. These algorithms signicantly reduce the cost of the search, but they still may require to visit all the nodes in order to nd a data item.
Structured P2P overlay networks
Structured P2P overlay networks [SMK+01, RD01, RFH+01] aim to address the lack of resource knowledge observed on unstructured overlays in order to reduce the high lookup message trac cost. They achieve this goal through a strong network topology which indexes resources to nodes using hash functions. Concretely, a hash function H : id ! S maps node identiers and resource identiers to the same domain space S. This class of solutions are called Distributed Hash Tables (DHTs).
A DHT is a distributed data structure which provides scalability, fault tolerance, and load balancing for internet-scale applications. Most DHTs provide support for the followings three operations.
Get(key).
Delete(key).
Put(key; value).
When a node joins a DHT, it receives an identier called nodeId, in the domain S = [0; 2s 1]. The value of s depends on the number of bits used in the hash function. Every node maintains a state composed of a routing table and a neighbour set. The routing table stores the nodeIds and IP addresses of a small fraction of nodes compared to the total number of nodes in the network. The neighbour set stores the nodeIds and IP addresses of close nodes in the identier space S. This state supports a routing interface that can reach any key k in O(log(N)) routing hops, where N is the number of nodes in the network.
Fault tolerance is achieved through replication on a set of nodes called replica set. Several implementations of this distributed data structure have been proposed in the literature. Pastry [RD01], Chord [SMK+01], and CAN [RFH+01] are prominent examples DHT implementations and they will be brie y reviewed in the next sections.
Comparison of DHTs architectures
Pastry [RD01] and Chord [SMK+01] achieve similar properties. They provide a lookup interface that can route messages in O(log(N)) routing hops with a space cost of O(log(N)) links per node. However, the former presents advantages in terms of network latency and fault tolerance. This is due to two strategies used in the design of the routing table. (i) the distance metric used to choose an entry of the table, and (ii) the diversity of routes of the Pastry routing table compared to the nger table. On the other hand, CAN [RFH+01] proposes a dierent approach which provides a lookup interface that can route messages in O(d N1=n) with a cost of O(d) links. The main advantage of this overlay is that the number of links maintained per node does not depend of the total amount of nodes N. Furthermore, choosing d = log(N) the lookup latency and cost can be comparable to Pastry [RD01] and Chord [SMK+01]. However, the architecture proposed in CAN [RFH+01] assumes a xed value of d (dened at startup) and therefore d cannot vary as N does. Therefore, a variation of the number of nodes can degrade the latency of lookups.
Limitations of Distributed Hash Tables
DHTs provide a simple but powerful building block that supports a highly scalable get/put interface for internet-based applications. This interface has allowed the implementationof diverse services such as publish/suscribe systems [CDKR02], le systems [SAZ+02], indirection systems [SAZ+02], among many others. However, more complex queries such as location-temporal range queries are dicult to implement because the hash function destroys data locality for the sake of load balancing. Building data locality in a multi dimensional space while maintaining the base properties of scalability, fault tolerance, random access, and self-organisation is a dicult task. In the next sections, dierent approaches towards supporting these queries will be reviewed. These solutions can be classied into three groups: (i) Over-DHT solutions, (ii) Overlay-dependent solutions, and (iii) Non-DHT solutions.
Over-DHT solutions
Over-DHT solutions aim to build indexing structures using the DHT as a building block in order to inherit all its desirable properties. They can be classied into mainly two groups: Distributed prex trees and distributed binary trees. Although most of these solutions do not target multidimensional data, they can use Space Filling Curves (SFC) [Sam06] in order to process a n-dimensional range query as an one-dimensional range query. Space Filling curves(SFC) [Sam06] allow to map an n-dimensional space to a one- dimensional space while loosely preserving data locality. That is, close points in the multidimensional space are relatively close in the one-dimensional space. Examples of these curves are z-ordering curves [Mor66] and Hilbert curves [Hil91]. The main drawback of SFCs is that the locality properties of the curve degrade with the number of dimensions [Sam06]. The consequence of this fact is that a variable amount of false positives can be introduced.
Figure 2.7 presents two examples of query processing using these curves. The query Q1 is performed without false positives because it matches the trajectory of the query in both cases, the query Q2 introduces more false positives using Z-ordering, and the query Q3 introduces more false positives using a Hilbert curve.
DHT-dependent solutions
DHT-dependent solutions dier from over-DHT solutions in the sense that they aim to adapt the DHT instead of using it as a building block in order to support range queries. This section presents a brief summary of these solutions as well as their main strengths and drawbacks.
Table of contents :
1 Introduction
1.1 Motivation
1.2 Contributions
1.3 Outline
1.4 Publications
2 State of the Art: Location-temporal Range Queries on P2P Networks
2.1 Peer-to-Peer (P2P) Networks
2.1.1 Unstructured P2P overlay networks
2.1.1.1 Limitations of unstructured overlays
2.1.2 Structured P2P overlay networks
2.1.2.1 Chord
2.1.2.2 Pastry
2.1.2.3 CAN
2.1.2.4 Comparison of DHTs architectures
2.2 Location-temporal range queries over DHTs
2.2.1 Limitations of Distributed Hash Tables
2.3 Over-DHT solutions
2.3.1 Distributed Prex Trees
2.3.1.1 Prex Hash Tree
2.3.1.2 Place Lab
2.3.1.3 LIGHT
2.3.1.4 DRING
2.3.1.5 ECHO
2.3.1.6 Limitations of Distributed Prex Trees
2.3.2 Binary Trees
2.3.2.1 Distributed Segment Tree
2.3.2.2 Range Search Tree
2.4 DHT-dependent solutions
2.4.1 MAAN
2.4.2 Mercury
2.4.3 Squid
2.4.4 Saturn
2.5 Non-DHT solutions
2.5.1 Skip-List based solutions
2.5.2 Tree-based solutions
2.5.3 Voroni-Based Solutions
2.6 Discussion
2.6.1 Over-DHT solutions
2.6.2 DHT-dependent solutions
2.6.3 Non-DHT solutions
2.7 Summary
3 Big-LHT: An index for n-recent geolocalized queries
3.1 Introduction
3.2 Design
3.2.1 Query denition
3.2.2 Indexing structure
3.2.3 Mapping
3.2.4 LHT maintenance
3.2.5 Managers maintenance
3.2.6 Query processing
3.3 Evaluation
3.3.1 Theoretical evaluation
3.3.2 Experimental evaluation
3.4 Discussion
3.4.1 Advantages
3.4.2 Limitations
3.5 Conclusion
4 GeoTrie: An index for location-temporal range queries
4.1 Introduction
4.2 Design
4.2.1 Query denition
4.2.2 Indexing structure
4.2.3 Mapping
4.2.4 Index maintenance
4.2.5 Query Processing
4.2.6 Caching optimisation
4.3 Evaluation
4.3.1 Theoretical evaluation
4.3.2 Experimental evaluation
4.4 Discussion
4.4.1 Advantages
4.4.2 Limitations
4.5 Conclusion
5 Conclusion and Future Work
5.1 Conclusion
5.2 Future Work
Bibliography