Get Complete Project Material File(s) Now! »
Clustering Quality Evaluation
To evaluate the quality of a clustering method, several quality assessment measures have been proposed [28{30]. The commonly used clustering quality indices are the Silhouette Coe cient, the Silhouette Coe cient with Noise, and the Density-Based Clustering Validation which are de ned in this section. Silhouette Coe cient As clustering is an unsupervised machine learning technique, a ground truth is usually not available. The most frequently used methodology to evaluate the quality of the clustering result is the silhouette analysis [28]. This methodology analyses the resulting clusters by evaluating the similarity between the elements within the cluster and the elements of other clusters. The analytical output is called the Silhouette Coe cient (SC) which is close to 1 for a good clustering.
The silhouette coe cient of each data item i in a resulting cluster cj of the clustering technique, 1 j l, is calculated as follows: SC(i) = b(i) a(i) : (2.18) max(a(i); b(i)).
where a(i) is the mean distance between element i and each of other elements within the same cluster, and b(i) is the minimum mean distance between i and any of other elements in di erent clusters.
Some elements may remain unclustered after the execution of the clustering algorithm. Such elements are considered as noise. Although SC allows powerful insights about how tightly elements are clustered, it does not take outliers that remain noise into account. Thus, an additional penalty measure is needed to evaluate the SC with noise as described in [29], namely SCnoise. The SC with noise penalty is calculated as in Equation 2.19: SCnoise(i) = SC(i): p n : (2.19).
Summary of Machine Learning techniques
In this chapter, three particular machine learning techniques have been presented, namely neural network, TRACLUS, and data aggregation. A neural network is composed of input, hidden, and output layers, and these layers consist of many neurons, which are simply some functions. These functions can contain some non-linear (or complex) operations such as sigmoid, division, etc. Moreover, TRACLUS also involves complex operations such as sine, division, etc. In the next chapters, we introduce some solutions to obtain e cient and accurate/quali ed evaluation results when designing privacy-preserving variants of these machine learning techniques based on cryptographic techniques.
Ideal/Real Simulation paradigm.
In the simulation paradigm, the ideal security setting is outsourcing inputs of both parties who run the protocol to a trusted third party who can perform the computations and return the output. In the real-world setting, the security goal is to show that if an adversary A can attack the protocol in the real world, then the attack can be also performed by an adversary S in the ideal world. Since the attacks of S are not successful in the ideal setting, the attacks in the real world also fail and the protocol is proved to be secure in the real world. In De nitions 3.1.1 and 3.1.2, we provide the formal de nitions for security and indistinguishability from [31].
De nition 3.1.1 (Computational Indistinguishability). Let X(a; ) and Y (a; ) be two probability ensembles where a 2 f0; 1g is the input of the parties and is the security parameter. X(a; ) and Y (a; ) are computationally indistinguishable (i.e. X(a; ) Y (a; )) if there exists a negligible function ( ) for every nonuniform polynomial time algorithm D, and for every a 2 f0; 1g such that jPr [D(X(a; )) = 1] Pr [D(Y (a; )) = 1]j ( ): (3.1).
De nition 3.1.2 (De nition of Security). P1 and P2 are two parties who want to run a protocol over their inputs x and y, respectively to compute a functionality f(x; y) which outputs f1(x; y) and f2(x; y) for P1 and P2, respectively. In the execution of , the view of parties are
view1 (x; y; ) = (x; r1; m1; m2; ; mt); (3.2).
Secure Multi-party Computation
Secure multi-party computation is introduced in early 1980s by Yao [34, 35] with the de nition of secure two-party computation (2PC) and de nes Yao’s Millionnaire problem: Two millionnaires want to learn who is richest one without disclosing their actual wealth. They solve this problem by comparing their wealth using secure computation to ensure that they learn only the richest one and nothing else is revealed. Then, the problem was generalised to multiple parties by Goldreich et al. in [36].
Secure multi-party computation (MPC) is de ned as a system in which a group of data owners can jointly compute a function of their private inputs without disclosing the underlying inputs, but the output of the function. Formally, let P1; : : : ; Pn be n parties and each of them having input x1; : : : ; xn, respectively. As illustrated in Figure 3.1, these parties want to jointly compute function f over all inputs fx1; : : : ; xng and learn the output without revealing their input.
MPC should ensure the following two properties [37], at least: (i) input privacy, i.e., parties’ inputs should remain private and only the output of the function is learned; and (ii) correctness, i.e., even if some parties maliciously act, the correct output is obtained.
Existing MPCs can leverage Oblivious Transfer (OT) [38], Yao’s Garbled circuits [34, 35], or secret sharing (additive or Boolean) [39] which are de ned in the next section.
Yao’s Garbled Circuits
Yao’s sharing (a.k.a. Garbled Circuits (GC)) is a secure two-party computation that allows two parties, Alice and Bob, each of them holding one input x1 or x2, to jointly evaluate a function f(x1; x2) in the presence of a semi-honest adversary. Yao’s GC protocol contains three steps: (i) Converting the function f into a Boolean circuit composed of addition and/or multiplication gates (i.e., a gate is a circuit that realises a Boolean operation such as AND); (ii) garbling the Boolean circuit; and (iii) evaluating the garbled circuit. Let Alice be the Garbler and Bob be the Evaluator. While Steps (i) and (ii) are performed by Alice, Bob computes the last step, namely Step (iii). In more details, Alice builds a garbled version of a circuit for the function f by obfuscating all possible outputs: Alice, the garbler, assigns two keys that correspond to bit values 0 and 1 for each wire of the circuit. Then, Alice computes four ciphertexts for each binary gate with the input and output wires. After obtaining ciphertexts, Alice randomly orders (or permutes) these four outcoming values. The resulting garbled circuit and Alice’s garbled input GI(x1) are sent to Bob. Alice also provides a map from the garbled-circuit outputs to the actual bit values. Once receiving this circuit, Bob, the evaluator, uses 1-out-of-2 OT [38] with Alice to obliviously obtain his garbled circuit value GI(x2) without revealing it to Alice. Bob further evaluates the function f(x1; x2) using GI(x1) and GI(x2).
Homomorphic Encryption
Homomorphic encryption (HE) is a breakthrough cryptographic technique that allows an untrusted third party such as a cloud server to perform the addition and/or multiplication operations over ciphertexts without the need for decrypting or having access to the underlying or resulting data [45{47]. HE is initially proposed by Rivest, Adleman, and Dertouzos in 1978 [48] and further developed by Gentry in 2009 [49, 50]. We denote [:]i to represent an HE ciphertext encrypted using the key of party i.
A typical scenario can be described as follows: The party Alice can delegate some of her computations over sensitive data to an untrusted third party, Bob. Alice encrypts her data with her public key (in green in Figure 3.2) and sends it to Bob. When receiving the data, Bob can evaluate a function over Alice’s encrypted inputs and obtain the encrypted result. Bob sends the result back to Alice who is the only party able to decrypt it with her private (or secret) key (in red in Figure 3.2). Note that a HE scheme can be either symmetric or asymmetric; in this example, we illustrate the asymmetric one.
Table of contents :
Abstract
Resume [Francais]
Acknowledgements
Contents
List of Figures
List of Tables
Acronyms
Publications
1 Introduction
1.1 Machine Learning as a Service
1.2 Data Privacy vs Machine Learning Techniques
1.3 Privacy-preserving protocols for Machine Learning Techniques
1.4 Contributions
1.5 Organisation
2 Machine Learning Techniques
2.1 Neural Networks
2.1.1 Convolutional Layer
2.1.2 Fully Connected Layer
2.1.3 Activation Layer
2.1.4 Pooling Layer
2.1.5 Complementary Functions in Neural Networks
2.1.6 Neural Network Model Structure
2.1.7 Neural Networks Accuracy Evaluation
2.2 Clustering
2.2.1 k-means
2.2.2 DBSCAN
2.2.3 TRACLUS
2.2.4 Clustering Quality Evaluation
2.3 Data Aggregation
2.4 Summary of Machine Learning techniques
3 Cryptographic Techniques
3.1 Security Notions
3.1.1 Notations
3.1.2 Ideal/Real Simulation paradigm.
3.1.3 Adversarial Models and Attacks
3.1.4 Chosen Plaintext Attack
3.2 Secure Multi-party Computation
3.2.1 Oblivious Transfer
3.2.2 Yao’s Garbled Circuits
3.2.3 Secret Sharing
3.2.4 Available Libraries
3.3 Homomorphic Encryption
3.3.1 Partially Homomorphic Encryption
3.3.2 Somewhat Homomorphic Encryption
3.3.3 Fully Homomorphic Encryption
3.3.4 Available Libraries
3.4 Proxy Re-encryption
3.4.1 Homomorphic Proxy Re-encryption
3.5 Multi-key Fully Homomorphic Encryption
3.5.1 Asymmetric Multi-key Fully Homomorphic Encryption
3.5.2 Symmetric Multi-key Fully Homomorphic Encryption
3.6 Threshold Fully Homomorphic Encryption
3.7 Hybrid Protocol
3.8 Summary of Cryptographic techniques
I Privacy-preserving single-server machine learning techniques
4 Privacy-preserving Neural Network Classication
4.1 Introduction
4.2 Privacy vs. Neural Network
4.3 Prior Work
4.3.1 2PC-based solutions
4.3.2 HE-based solutions
4.3.3 Hybrid solutions
4.3.4 Solutions based on other cryptographic techniques
4.4 PAC: Privacy-preserving Arrhythmia Classication with neural networks .
4.4.1 Problem Statement
4.4.2 PAC: Description
4.4.3 Discussion on Principle Component Analysis
4.4.4 SIMD circuits
4.4.5 Implementation
4.4.6 Security Evaluation
4.4.7 Performance Evaluation
4.4.8 Summary
4.5 SwaNN: Switching among Cryptographic Tools for Privacy-Preserving Neural Network Predictions
4.5.1 Problem Statement
4.5.2 SwaNN: Description
4.5.3 Security Evaluation
4.5.4 Performance Evaluation
4.5.5 Summary
4.6 ProteiNN: Privacy-preserving one-to-many Neural Network classication .
4.6.1 Problem Statement
4.6.2 Threat Model
4.6.3 ProteiNN: Description
4.6.4 Security Evaluation
4.6.5 Performance Evaluation
4.6.6 Summary
4.7 Conclusion of privacy-preserving neural network classication
5 Privacy-preserving Clustering
5.1 Introduction
5.2 Privacy vs. Clustering
5.3 Prior Work
5.3.1 k-means
5.3.2 DBSCAN
5.3.3 Trajectory Analysis
5.4 pp-TRACLUS: Privacy-preserving TRAjectory CLUStering
5.4.1 pp-TRACLUS: Description
5.4.2 Security Evaluation
5.4.3 Performance Evaluation
5.4.4 Summary
5.5 Conclusion of privacy-preserving clustering
II Privacy-preserving two-server machine learning techniques
6 Privacy-preserving Neural Network Classication
6.1 Introduction
6.2 Prior Work
6.3 Two-server SwaNN
6.3.1 SwaNN: Description
6.3.2 Security Evaluation
6.3.3 Performance Evaluation
6.3.4 Summary
6.4 Conclusion of privacy-preserving neural network classication
7 Privacy-preserving Clustering
7.1 Introduction
7.2 Privacy vs. Clustering
7.3 Prior Work
7.4 Two-server pp-TRACLUS
7.4.1 Problem Statement
7.4.2 PHE-based pp-TRACLUS: Description
7.4.3 2PC-based pp-TRACLUS: Description
7.5 Conclusion of privacy-preserving clustering
8 Privacy-preserving Data Aggregation
8.1 Introduction
8.2 Privacy vs. Data Aggregation
8.3 Prior Work
8.3.1 DP (and HE)-based solutions
8.3.2 HE-based solutions
8.3.3 MPC/2PC-based solutions
8.3.4 Hybrid solutions
8.4 PRIDA: PRIvacy-preserving data aggregation with multiple Data Analysers
8.4.1 Problem Statement
8.4.2 PRIDA: Detailed description
8.4.3 Security Evaluation
8.4.4 Performance Evaluation
8.5 Conclusion of privacy-preserving data aggregation
9 Conclusion Remarks and Future Research
9.1 Summary
9.2 Future Work
Appendices
A Resume Francais
A.1 Apprentissage automatique en tant que service
A.2 Condentialite des donnees vs techniques d’apprentissage automatique .
A.3 Protocoles preservant la condentialite pour les techniques d’apprentissage automatique
A.4 Contributions