Privacy-preserving frequency mining protocol based on elliptic curve elgamal cryptosystem

Abstract. Privacy-preserving frequency mining is a quite simple technique, but it is very useful in privacy-preserving machine learning and data mining. In this paper, we construct an elliptic curve analog of the ElGamal system-based protocol for privacy-preserving frequency mining in fully distributed setting. In comparison to the original protocol of Yang et al., our solution has much lower communication overhead. Moreover, the experiments show that the executing time of our proposed solution is also lower than that of the original one.

8 trang | Chia sẻ: thanhle95 | Lượt xem: 38 | Lượt tải: 0

Bạn đang xem nội dung tài liệu Privacy-preserving frequency mining protocol based on elliptic curve elgamal cryptosystem, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên

87 HNUE JOURNAL OF SCIENCE DOI: 10.18173/2354-1059.2018-0075 Natural Sciences 2018, Volume 63, Issue 11, pp. 87-94 This paper is available online at PRIVACY-PRESERVING FREQUENCY MINING PROTOCOL BASED ON ELLIPTIC CURVE ELGAMAL CRYPTOSYSTEM Vu Duy Hien 1 , Luong The Dung 2 , Ho Tu Bao 3 and Nguyen Chung Tien 2 1 Faculty of Management Information Systems, Banking Academy of Vietnam 2 Faculty of Information Security, Academy of Cryptography Techniques 3 School of Knowledge Science, Japan Advanced Institute of Science and Technology Abstract. Privacy-preserving frequency mining is a quite simple technique, but it is very useful in privacy-preserving machine learning and data mining. In this paper, we construct an elliptic curve analog of the ElGamal system-based protocol for privacy-preserving frequency mining in fully distributed setting. In comparison to the original protocol of Yang et al., our solution has much lower communication overhead. Moreover, the experiments show that the executing time of our proposed solution is also lower than that of the original one. Keywords: Privacy-preserving data mining, Secure multi-party computation, Elliptic curve cryptosystem. 1. Introduction The term data mining has appeared in the database community since 1990s. This term aims to discover knowledge from large datasets. However, for the data that contains the sensitive and private information (e.g., the patients' disease information, the customers' income), traditional data mining process is incompatible. So, the issues of privacy preservation in data mining has attracted a lot of attention from the research community. This called privacy-preserving data mining (PPDM for short). Basically, a privacy-preserving data mining solution has three basic properties as follows: Accuracy: the accuracy of output result is not lost. Privacy: the sensitive and private information is not disclosed. Efficiency: the PPDM solution’s performance is high enough to be used to develop the practical applications. Where the accuracy and privacy characteristics are strictly required. There are two approaches to construct a PPDM solution: perturbation-based and cryptographic- based approaches. The solutions based on the perturbation approach are very efficient, but have a trade-off between privacy and accuracy. For the PPDM solutions based on cryptography, the privacy of data holders is safely preserved and the output result is accurately guaranteed, but the performance is quite poor [1]. Received October 23, 2018. Revised November 5, 2018. Accepted November 12, 2018. Contact Vu Duy Hien, e-mail: hienvd@bav.edu.vn. Vu Duy Hien, Luong The Dung, Ho Tu Bao and Nguyen Chung Tien 88 In this work, we focus cryptography-based privacy-preserving frequency mining (PPFM for short) protocol that is a quite simple technique, but it is very useful in privacy-preserving machine learning and data mining [2]. Furthermore, we consider the PPFM solution for fully distributed setting where the data set is distributed across a large number of users, and each record is only held by one party. In the literature, many cryptographic solutions have proposed for PPFM in fully distributed setting. They are used to construct the practical applications such as ID3 tree and association rules mining [2], Naive Bayes classifier [2], electronic voting system [3-5]. To the best of our knowledge, the first cryptographic protocol for PPFM in fully distributed scenario was introduced in [2] by Yang et al. This solution does not need communication channels between different users. It also does not require multi-round interaction between any party and the miner. In addition, this protocol provides strong privacy for each user without loss of accuracy. However, because the solution of Yang et al. [2] is based on ElGamal cryptosystem, so the performance of [2] is quite poor. Lately, Hao et al. proposed a series of election voting systems [3, 4] based on a privacy- preserving frequency counting protocol that called 2-round anonymous veto [6]. These protocols can safely protect the information of each voter’s ballot. Moreover, they also guarantee that the voting result is counted correctly. However, the computational complexity and communication cost of each voter in [3] are very expensive. Inspiring from the works [6] and [3], the authors developed the voting scheme [4] using the DRE-i system to compute the restructured public key for each voter. So the voters’ costs reduce greatly, but the total computational complexity of voting system increases, even the performance of [4] is poorer than that of [2]. Based on Boneh-Franklin identity-based encryption, Wu et al. constructed a privacy- preservation protocol [7] for mining of support counts in fully distributed scenario. The authors show that this protocol is very efficient and practical, but its privacy is not guaranteed since the secret master key s is known by all parties. Several other protocols [8, 11] that have the same ideal with PPFM have proposed. However, these solutions have the low privacy level, since they need to use a trusted third party. Recently, Hao et al. proposed the verifiable classroom voting system [5] that is also based on elliptic curve analog of the ElGamal system. Although the computational complexity and communication cost of each voter is optimized, the total computational complexity of the voting system is equal to that of the protocol [4]. In briefly, most of existing solutions for PPFM in fully distributed setting have a trade-off between privacy and efficiency. Therefore, it is very significant to develop the efficient PPFM solutions for fully distributed setting while the accuracy is intact and the privacy is still protected safely. In this paper, our main goal is to develop the efficient solution for PPFM in fully distributed setting. To obtain this goal, we first redesign the original PPFM protocol mentioned in Yang et al.’s protocol [2]. Next, we optimize this redesigned PPFM protocol based on elliptic curve analog of the ElGamal system. And therefore, our solution’s performance is better than that of [2]. To illustrate the efficiency of our solution, we implement it to compute the frequency value for different numbers of users from 2000 to 10000. Privacy-preserving frequency mining protocol based on elliptic curve ElGamal cryptosystem 89 2. Content 2.1. Preliminaries * Problem definition In the fully distributed setting, there are users , in which each user holds a private boolean value , and the miner who needs to find out the sum of all users’ private values ∑ . Inspiring from the work of Yang et al. [2], we design elliptic curve analog of the ElGamal system-based PPFM protocol that allows the miner to compute the value s without knowing the private values. * Definition of privacy In this study, our protocol is based on the semi-honest model that each user must follow the rules of the protocol, but anyone may be corrupted. Thus, we have the definition of privacy for frequency mining in fully distributed setting [2, 12] as follows: Definition 1. Assume that each user has private keys and public keys . A frequency mining protocol protects each user’s privacy against the miner and corrupted users in the semi- honest model if, such that , there exists a probabilistic polynomial-time algorithm M such that: { ( [ ] [ ] )} ([ ] ) Where is computational indistinguishability. This definition states that the computation is secure and the honest users’ privacy is guaranteed, if the miner and the corrupted users learn nothing from the output s and the public values of the honest users. * Elliptic curve analog of the ElGamal system In this section, we review elliptic curve analog of the ElGamal system [13] that is the main facility to construct our solution. Let be an elliptic curve over a finite field with a point at infinity and q be a large prime, in which the discrete logarithm problem on the elliptic curve is hard. In addition, G is a base point of the elliptic curve E with order q (i.e., ). The private key is the random number [ ] and the corresponding public key curve point is . To encrypt the plaintext m, the sender uses the public key to compute the ciphertext from the plaintext m as follows: he randomly chooses k from [ ] and computes the ciphertext where is a point of and . To decrypt the ciphertext using the private key , the receiver may compute , in which Under the decisional Diffie-Hellman assumption for the curve E, elliptic curve analog of the ElGamal system is semantically secure. 2.2. Privacy-preserving frequency mining protocol in fully distributed setting 2.2.1. Setup Let be an elliptic curve with a point at infinity and d be a large prime, in which the discrete logarithm problem on the elliptic curve is hard. In addition, is a base point of the elliptic curve E with order d (i.e., ). Vu Duy Hien, Luong The Dung, Ho Tu Bao and Nguyen Chung Tien 90 Each user keeps a private value {0,1}. Nobody knows this value, beyond him. Before the PPFM protocol starts, each user chooses two private keys , [ ], after that he computes the corresponding public keys = , = . These public keys sent to the miner before the protocol starts. 2.2.2. Protocol The PPFM protocol in fully distributed setting consists of three main phases described in Figure 1. PHASE 1: PRE-COMPUTING  Miner pre-computes the public values: ∑ ∑  Miner  : PHASE 2: COMPUTING THE MESSAGE  computes:   Miner: PHASE 3: SECURE FREQUENCY COMPUTATION  Miner computes: ∑ . Figure 1. A privacy-preserving frequency mining protocol for fully distributed setting 2.2.3. Proof of correctness In this section, we show that the final output of the PPFM protocol in fully distributed setting based on elliptic curve analog of the ElGamal system is the sum of all parties’ private values. To do this, we prove the following theorem. Theorem 1. The protocol for privacy-preserving frequency mining presented in Figure 1 exactly counts the number of 1’s values of all users’ inputs. Proof. We show that, in this protocol, if the miner finds out a value s, then s is the secure sum of all parties’ private values. Suppose that s.G = M. Then: s.G = ∑ s.G = ∑ s.G = ∑ ∑ ∑ ∑ s.G = ∑ ∑ ∑ ∑ ∑ s.G = ∑ Thus, ∑ , and therefore ∑ . Note that the value of s is not too large, so it can be computed by the brute-force method. 2.2.4. Privacy analysis In this section, we first prove that the PPFM protocol in fully distributed setting protects each honest user’s privacy in the semi-honest model under the necessary assumptions. Then, we show Privacy-preserving frequency mining protocol based on elliptic curve ElGamal cryptosystem 91 that this protocol still preserves each honest user’s privacy in the case of parties colluding with the miner. We recall that, each user only sends a point that is the ciphertext of his private value. This point is represented as the following equation: ∑ We easily decide that the ciphertext is equivalent to the first part of an elliptic curve analog of the ElGamal respectively , the private key is ∑ and is uniformly chosen at random from [ ]. Under the decisional Diffie- Hellman assumption on the elliptic curve, the elliptic curve analog of the ElGamal cryptosystem is semantically secure. Thus, our protocol preserves each honest user’s privacy in the semi-honest model. Continuously, we prove that the new privacy-preserving sum protocol protects each user’s privacy (even if there are up to users colluding with the miner) as long as the elliptic curve analog of the ElGamal encryption scheme is secure. We have the following theorem: Theorem 2. The protocol for privacy-preserving frequency mining in fully distributed setting presented in Figure 1 protects each honest user’s privacy against the miner and up to corrupted users. Proof. We construct a simulator M that simulates computing the joint view of the miner and the corrupted users by a polynomial time algorithm. In particular, we give an algorithm that computes the view of the miner and the corrupted users in polynomial time only using the final output s, corrupted users’ knowledge, public keys, and some elliptic curve analog of the ElGamal encryption. Therefore, combining our algorithm with a simulator for the ciphertexts, we obtain a complete proof. Without loss of generality, we assume that and do not collude and . In the protocol presented in Figure 1, each user only sends a point to the miner. So our algorithm only simulates the computation for . Below is the computations of simulator M based on the view of the miner and the corrupted users using some encryption as its input: . Simulator M computes as follows: ∑ ∑ ∑ ∑ Thus, following the definition 1, our PPFM protocol for fully distributed scenario is semantically secure. 2.2.5. Performance evaluation In this section, we implement our solution and the original protocol [2] in the C# language of Visual Studio environment, using the System.Numerics namespace to compare the performance of them (i.e., communication overhead and time complexity). Note that all public key operations in our protocol are defined over the safe curve [14], and the protocol [2] uses private keys and public keys that have the same security level with the curve . Moreover, our experiments run on the laptop with a Intel core processor and memory. Vu Duy Hien, Luong The Dung, Ho Tu Bao and Nguyen Chung Tien 92 For the communication overhead comparison, we consider the number of communication messages and these length (bits) in all phases of our solution and the protocol [2]. For the time complexity comparison, we measure the total executing time of each protocol for different numbers of users, from to . This time consists of the time for each user to perform the necessary computations and the time required for the miner. We assume that all users perform their tasks at the same time, and the network latency is not included in the total executing time. * Communication overhead Considering the protocol of Yang et al. [2], before this protocol starts, each user needs to send two public keys to the miner. After the miner computes two public keys, he sends these keys for all users. In the first phase of [2], each user also needs to send two values ; to the miner. Because each public key is 3072 bits length, the protocol [2] exchanges 6n messages using 18432n bits where n is the number of users. For our solution, before it starts, each user needs to send two public keys (i.e., two points) to the miner. Next, in the first phase, the miner computes two public keys, after that he sends them to all users. In the second phase, each user needs to only send a point to the miner. Because each point of the curve consists of two elements in which each element is 256 bits length, so our solution only exchanges 10n messages using 2560n bits in which n is the number of users. Table 1 presents the communication overhead comparison between our solution and Yang et al.’s protocol [2]. We can see that our solution exchanges more number of messages than the protocol of Yang et al. However, the proposed solution transfers much lower number of bits than the protocol [2]. Table 1. The communication overhead comparison between our solution and Yang et al.’s protocol Protocols The number of messages The number of bits The protocol [2] 6n 18432n Our solution 10n 2560n * Time complexity of the protocol As presented before, the new protocol is improved from the solution [2]. In particular, in Yang et al.’s protocol, each user must compute two values and to send to the miner. Based on the tuples of two values, the miner computes the multiplication of the values . Hence, the computational complexity of the miner is high. Unlike the protocol [2], in our solution, each user only computes a unique point and the miner only computes the sum of the points . However, this only makes each user’s computational complexity increase negligibly. Furthermore, the computational complexity of the miner reduces greatly. Thus, the total executing time of our protocol is much lower than that of the original protocols of Yang et al. as shown in Figure 2. Privacy-preserving frequency mining protocol based on elliptic curve ElGamal cryptosystem 93 Figure 2. The computing frequency value time in fully distributed setting comparisons between our solution and Yang et al.’s protocol According to the comparison results, we can state that our solution is more efficient than the protocol of Yang et al. [2]. 3. Conclusions In this paper, we proposed the protocol based on elliptic curve analog of the ElGamal encryption for privacy-preserving frequency mining in fully distributed scenario. Our solution has the lower communication cost than the original protocol of Yang et al. We also did several experiments to evaluate the new solution’ performance. The experimental results show that our protocol is much more efficient than the original one. As well known, privacy-preserving frequency mining is quite simple, but is very useful in data mining applications. So this work helps the data miner to construct PPDM solutions that require the strong privacy and high efficiency without loss of accuracy. REFERENCES [1] R. Mendes and J. P. Vilela, 2017. Privacy-preserving data mining: methods, metrics, and applications. IEEE Access, Vol. 5, pp. 10562-10582. [2] Z. Yang, S. Zhong, and R. N. Wright, 2005. Privacy-preserving classification of customer data without loss of accuracy. Proceedings of the 2005 SIAM International Conference on Data Mining. pp. 92-102. [3] F. Hao, P. Y. Ryan, and P. Zieli´nski, 2010. Anonymous voting by two-round public discussion. IET Information Security, Vol. 4, No. 2, pp. 62-67. [4] F. Hao, M. N. Kreeger, B. Randell, D. Clarke, S. F Shahandashti, and P. H.-J. Lee, 2014. Every vote counts: Ensuring integrity in large-scale DRE-based electronic voting. The USENIX Journal of Election Technology and Systems, Vol. 2, No. 3, pp. 1-25. [5] F. Hao, D. Clarke, B. Randell, and S. F. Shahandashti, 2018. Verifiable classroom voting in practice. IEEE Security & Privacy, Vol. 16, No. 1, pp. 72-81. 0 500 1000 1500 2000 2500 3000 3500 4000 2000 4000 6000 8000 10000 Ti m e ( m ili se co n d s) The number of users Yang et al.'s protocol Our solution Vu Duy Hien, Luong The Dung, Ho Tu Bao and Nguyen Chung Tien 94 [6] F. Hao and P. Zieli´nski, 2006. A 2-round anonymous veto protocol. International Workshop on Security Protocols, Springer, pp. 202-211. [7] F. Wu, J. Liu, and S. Zhong, 2009. An efficient protocol for private and accurate mining of support counts. Pattern Recognition Letters, Vol. 30, No. 1, pp. 80-86. [8] E. Shi, H. Chan, E. Rieffel, R. Chow, and D. Song, 2011. Privacy-preserving aggregation of time-series data. Annual Network and Distributed System Security Symposium, Internet Society, pp. 1-17. [9] Q. Li, G. Cao, and T. F. La Porta, 2014. Efficient and privacy-aware data aggregation in mobile sensing. IEEE Transactions on Dependable and Secure Computing, Vol. 11, No. 2, pp. 115-129. [10] F. Benhamouda, M. Joye, and B. Libert, 2016. A new framework for privacy preserving aggregation of time-series data. ACM Transactions on Information and System Security, Vol. 18, No. 3, pp. 10:1-10:21. [11] T. Jung, J. Han, and X.-Y. Li, 2016. PDA: Semantically secure time-series data analytics with dynamic subgr