Abstract. Privacy-preserving frequency mining is a quite simple technique, but it is very
useful in privacy-preserving machine learning and data mining. In this paper, we construct an
elliptic curve analog of the ElGamal system-based protocol for privacy-preserving frequency
mining in fully distributed setting. In comparison to the original protocol of Yang et al., our
solution has much lower communication overhead. Moreover, the experiments show that the
executing time of our proposed solution is also lower than that of the original one.

8 trang |

Chia sẻ: thanhle95 | Lượt xem: 30 | Lượt tải: 0
Bạn đang xem nội dung tài liệu **Privacy-preserving frequency mining protocol based on elliptic curve elgamal cryptosystem**, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên

87
HNUE JOURNAL OF SCIENCE DOI: 10.18173/2354-1059.2018-0075
Natural Sciences 2018, Volume 63, Issue 11, pp. 87-94
This paper is available online at
PRIVACY-PRESERVING FREQUENCY MINING PROTOCOL BASED
ON ELLIPTIC CURVE ELGAMAL CRYPTOSYSTEM
Vu Duy Hien
1
, Luong The Dung
2
, Ho Tu Bao
3
and Nguyen Chung Tien
2
1
Faculty of Management Information Systems, Banking Academy of Vietnam
2
Faculty of Information Security, Academy of Cryptography Techniques
3
School of Knowledge Science, Japan Advanced Institute of Science and Technology
Abstract. Privacy-preserving frequency mining is a quite simple technique, but it is very
useful in privacy-preserving machine learning and data mining. In this paper, we construct an
elliptic curve analog of the ElGamal system-based protocol for privacy-preserving frequency
mining in fully distributed setting. In comparison to the original protocol of Yang et al., our
solution has much lower communication overhead. Moreover, the experiments show that the
executing time of our proposed solution is also lower than that of the original one.
Keywords: Privacy-preserving data mining, Secure multi-party computation, Elliptic curve
cryptosystem.
1. Introduction
The term data mining has appeared in the database community since 1990s. This term aims to
discover knowledge from large datasets. However, for the data that contains the sensitive and private
information (e.g., the patients' disease information, the customers' income), traditional data mining
process is incompatible. So, the issues of privacy preservation in data mining has attracted a lot of
attention from the research community. This called privacy-preserving data mining (PPDM for short).
Basically, a privacy-preserving data mining solution has three basic properties as follows:
Accuracy: the accuracy of output result is not lost.
Privacy: the sensitive and private information is not disclosed.
Efficiency: the PPDM solution’s performance is high enough to be used to develop the practical
applications.
Where the accuracy and privacy characteristics are strictly required.
There are two approaches to construct a PPDM solution: perturbation-based and cryptographic-
based approaches. The solutions based on the perturbation approach are very efficient, but have a
trade-off between privacy and accuracy. For the PPDM solutions based on cryptography, the privacy
of data holders is safely preserved and the output result is accurately guaranteed, but the performance
is quite poor [1].
Received October 23, 2018. Revised November 5, 2018. Accepted November 12, 2018.
Contact Vu Duy Hien, e-mail: hienvd@bav.edu.vn.
Vu Duy Hien, Luong The Dung, Ho Tu Bao and Nguyen Chung Tien
88
In this work, we focus cryptography-based privacy-preserving frequency mining (PPFM for short)
protocol that is a quite simple technique, but it is very useful in privacy-preserving machine
learning and data mining [2]. Furthermore, we consider the PPFM solution for fully distributed
setting where the data set is distributed across a large number of users, and each record is only held by
one party.
In the literature, many cryptographic solutions have proposed for PPFM in fully distributed
setting. They are used to construct the practical applications such as ID3 tree and association rules
mining [2], Naive Bayes classifier [2], electronic voting system [3-5].
To the best of our knowledge, the first cryptographic protocol for PPFM in fully distributed
scenario was introduced in [2] by Yang et al. This solution does not need communication
channels between different users. It also does not require multi-round interaction between any
party and the miner. In addition, this protocol provides strong privacy for each user without loss
of accuracy. However, because the solution of Yang et al. [2] is based on ElGamal cryptosystem,
so the performance of [2] is quite poor.
Lately, Hao et al. proposed a series of election voting systems [3, 4] based on a privacy-
preserving frequency counting protocol that called 2-round anonymous veto [6]. These protocols
can safely protect the information of each voter’s ballot. Moreover, they also guarantee that the
voting result is counted correctly. However, the computational complexity and communication
cost of each voter in [3] are very expensive. Inspiring from the works [6] and [3], the authors
developed the voting scheme [4] using the DRE-i system to compute the restructured public key
for each voter. So the voters’ costs reduce greatly, but the total computational complexity of
voting system increases, even the performance of [4] is poorer than that of [2].
Based on Boneh-Franklin identity-based encryption, Wu et al. constructed a privacy-
preservation protocol [7] for mining of support counts in fully distributed scenario. The authors
show that this protocol is very efficient and practical, but its privacy is not guaranteed since the
secret master key s is known by all parties. Several other protocols [8, 11] that have the same ideal
with PPFM have proposed. However, these solutions have the low privacy level, since they need
to use a trusted third party.
Recently, Hao et al. proposed the verifiable classroom voting system [5] that is also based on
elliptic curve analog of the ElGamal system. Although the computational complexity and
communication cost of each voter is optimized, the total computational complexity of the voting
system is equal to that of the protocol [4].
In briefly, most of existing solutions for PPFM in fully distributed setting have a trade-off
between privacy and efficiency. Therefore, it is very significant to develop the efficient PPFM
solutions for fully distributed setting while the accuracy is intact and the privacy is still protected
safely.
In this paper, our main goal is to develop the efficient solution for PPFM in fully distributed
setting. To obtain this goal, we first redesign the original PPFM protocol mentioned in Yang et al.’s
protocol [2]. Next, we optimize this redesigned PPFM protocol based on elliptic curve analog of
the ElGamal system. And therefore, our solution’s performance is better than that of [2]. To
illustrate the efficiency of our solution, we implement it to compute the frequency value for
different numbers of users from 2000 to 10000.
Privacy-preserving frequency mining protocol based on elliptic curve ElGamal cryptosystem
89
2. Content
2.1. Preliminaries
* Problem definition
In the fully distributed setting, there are users , in which each user holds a
private boolean value , and the miner who needs to find out the sum of all users’ private
values ∑
.
Inspiring from the work of Yang et al. [2], we design elliptic curve analog of the ElGamal
system-based PPFM protocol that allows the miner to compute the value s without knowing the
private values.
* Definition of privacy
In this study, our protocol is based on the semi-honest model that each user must follow the
rules of the protocol, but anyone may be corrupted. Thus, we have the definition of privacy for
frequency mining in fully distributed setting [2, 12] as follows:
Definition 1. Assume that each user has private keys and public keys . A frequency
mining protocol protects each user’s privacy against the miner and corrupted users in the semi-
honest model if, such that , there exists a probabilistic polynomial-time
algorithm M such that:
{ ( [ ] [ ] )}
([ ]
)
Where
is computational indistinguishability.
This definition states that the computation is secure and the honest users’ privacy is
guaranteed, if the miner and the corrupted users learn nothing from the output s and the public
values of the honest users.
* Elliptic curve analog of the ElGamal system
In this section, we review elliptic curve analog of the ElGamal system [13] that is the main
facility to construct our solution.
Let be an elliptic curve over a finite field with a point at infinity and q be a large
prime, in which the discrete logarithm problem on the elliptic curve is hard. In addition, G is a
base point of the elliptic curve E with order q (i.e., ). The private key is the random
number [ ] and the corresponding public key curve point is .
To encrypt the plaintext m, the sender uses the public key to compute the ciphertext
from the plaintext m as follows: he randomly chooses k from [ ] and computes the
ciphertext where is a point of and . To decrypt
the ciphertext using the private key , the receiver may compute , in which
Under the decisional Diffie-Hellman assumption for the curve E, elliptic curve analog of the
ElGamal system is semantically secure.
2.2. Privacy-preserving frequency mining protocol in fully distributed setting
2.2.1. Setup
Let be an elliptic curve with a point at infinity and d be a large prime, in which the
discrete logarithm problem on the elliptic curve is hard. In addition, is a base point of the
elliptic curve E with order d (i.e., ).
Vu Duy Hien, Luong The Dung, Ho Tu Bao and Nguyen Chung Tien
90
Each user keeps a private value {0,1}. Nobody knows this value, beyond him. Before
the PPFM protocol starts, each user chooses two private keys , [ ], after that he
computes the corresponding public keys = , = . These public keys sent to the miner
before the protocol starts.
2.2.2. Protocol
The PPFM protocol in fully distributed setting consists of three main phases described in
Figure 1.
PHASE 1: PRE-COMPUTING
Miner pre-computes the public values:
∑
∑
Miner :
PHASE 2: COMPUTING THE MESSAGE
computes:
Miner:
PHASE 3: SECURE FREQUENCY COMPUTATION
Miner computes:
∑
.
Figure 1. A privacy-preserving frequency mining protocol for fully distributed setting
2.2.3. Proof of correctness
In this section, we show that the final output of the PPFM protocol in fully distributed setting
based on elliptic curve analog of the ElGamal system is the sum of all parties’ private values. To
do this, we prove the following theorem.
Theorem 1. The protocol for privacy-preserving frequency mining presented in Figure 1 exactly
counts the number of 1’s values of all users’ inputs.
Proof. We show that, in this protocol, if the miner finds out a value s, then s is the secure sum
of all parties’ private values.
Suppose that s.G = M. Then:
s.G = ∑
s.G = ∑
s.G = ∑ ∑ ∑
∑
s.G = ∑ ∑ ∑
∑
∑
s.G = ∑
Thus, ∑
, and therefore ∑
. Note that the value of s is not too large,
so it can be computed by the brute-force method.
2.2.4. Privacy analysis
In this section, we first prove that the PPFM protocol in fully distributed setting protects each
honest user’s privacy in the semi-honest model under the necessary assumptions. Then, we show
Privacy-preserving frequency mining protocol based on elliptic curve ElGamal cryptosystem
91
that this protocol still preserves each honest user’s privacy in the case of parties
colluding with the miner.
We recall that, each user only sends a point that is the ciphertext of his private value.
This point is represented as the following equation:
∑
We easily decide that the ciphertext is equivalent to the first part of an elliptic curve
analog of the ElGamal respectively , the private key is
∑ and is uniformly chosen at random from [ ]. Under the decisional Diffie-
Hellman assumption on the elliptic curve, the elliptic curve analog of the ElGamal cryptosystem is
semantically secure. Thus, our protocol preserves each honest user’s privacy in the semi-honest model.
Continuously, we prove that the new privacy-preserving sum protocol protects each user’s
privacy (even if there are up to users colluding with the miner) as long as the elliptic curve
analog of the ElGamal encryption scheme is secure. We have the following theorem:
Theorem 2. The protocol for privacy-preserving frequency mining in fully distributed setting
presented in Figure 1 protects each honest user’s privacy against the miner and up to
corrupted users.
Proof. We construct a simulator M that simulates computing the joint view of the miner and
the corrupted users by a polynomial time algorithm. In particular, we give an algorithm that
computes the view of the miner and the corrupted users in polynomial time only using the final
output s, corrupted users’ knowledge, public keys, and some elliptic curve analog of the ElGamal
encryption. Therefore, combining our algorithm with a simulator for the ciphertexts, we obtain a
complete proof.
Without loss of generality, we assume that and do not collude and . In
the protocol presented in Figure 1, each user only sends a point to the miner. So our algorithm
only simulates the computation for . Below is the computations of simulator M based on
the view of the miner and the corrupted users using some encryption as its input:
.
Simulator M computes as follows:
∑
∑
∑
∑
Thus, following the definition 1, our PPFM protocol for fully distributed scenario is
semantically secure.
2.2.5. Performance evaluation
In this section, we implement our solution and the original protocol [2] in the C# language of
Visual Studio environment, using the System.Numerics namespace to compare the
performance of them (i.e., communication overhead and time complexity). Note that all public
key operations in our protocol are defined over the safe curve [14], and the protocol [2]
uses private keys and public keys that have the same security level with the
curve . Moreover, our experiments run on the laptop with a Intel core processor
and memory.
Vu Duy Hien, Luong The Dung, Ho Tu Bao and Nguyen Chung Tien
92
For the communication overhead comparison, we consider the number of communication
messages and these length (bits) in all phases of our solution and the protocol [2].
For the time complexity comparison, we measure the total executing time of each protocol
for different numbers of users, from to . This time consists of the time for each user
to perform the necessary computations and the time required for the miner. We assume that all
users perform their tasks at the same time, and the network latency is not included in the total
executing time.
* Communication overhead
Considering the protocol of Yang et al. [2], before this protocol starts, each user needs to
send two public keys to the miner. After the miner computes two public keys, he sends these keys
for all users. In the first phase of [2], each user also needs to send two values ; to the
miner. Because each public key is 3072 bits length, the protocol [2] exchanges 6n messages using
18432n bits where n is the number of users.
For our solution, before it starts, each user needs to send two public keys (i.e., two points) to
the miner. Next, in the first phase, the miner computes two public keys, after that he sends them to
all users. In the second phase, each user needs to only send a point to the miner. Because each
point of the curve consists of two elements in which each element is 256 bits length, so our
solution only exchanges 10n messages using 2560n bits in which n is the number of users.
Table 1 presents the communication overhead comparison between our solution and Yang et al.’s
protocol [2]. We can see that our solution exchanges more number of messages than the protocol
of Yang et al. However, the proposed solution transfers much lower number of bits than the
protocol [2].
Table 1. The communication overhead comparison between our solution
and Yang et al.’s protocol
Protocols The number of messages The number of bits
The protocol [2] 6n 18432n
Our solution 10n 2560n
* Time complexity of the protocol
As presented before, the new protocol is improved from the solution [2]. In particular, in
Yang et al.’s protocol, each user must compute two values and to send to the miner. Based
on the tuples of two values, the miner computes the multiplication of the values
. Hence, the
computational complexity of the miner is high.
Unlike the protocol [2], in our solution, each user only computes a unique point and the
miner only computes the sum of the points . However, this only makes each user’s
computational complexity increase negligibly. Furthermore, the computational complexity of the
miner reduces greatly. Thus, the total executing time of our protocol is much lower than that of
the original protocols of Yang et al. as shown in Figure 2.
Privacy-preserving frequency mining protocol based on elliptic curve ElGamal cryptosystem
93
Figure 2. The computing frequency value time in fully distributed setting comparisons
between our solution and Yang et al.’s protocol
According to the comparison results, we can state that our solution is more efficient than the
protocol of Yang et al. [2].
3. Conclusions
In this paper, we proposed the protocol based on elliptic curve analog of the ElGamal
encryption for privacy-preserving frequency mining in fully distributed scenario. Our solution has
the lower communication cost than the original protocol of Yang et al. We also did several
experiments to evaluate the new solution’ performance. The experimental results show that our
protocol is much more efficient than the original one. As well known, privacy-preserving
frequency mining is quite simple, but is very useful in data mining applications. So this work
helps the data miner to construct PPDM solutions that require the strong privacy and high
efficiency without loss of accuracy.
REFERENCES
[1] R. Mendes and J. P. Vilela, 2017. Privacy-preserving data mining: methods, metrics, and
applications. IEEE Access, Vol. 5, pp. 10562-10582.
[2] Z. Yang, S. Zhong, and R. N. Wright, 2005. Privacy-preserving classification of customer
data without loss of accuracy. Proceedings of the 2005 SIAM International Conference on
Data Mining. pp. 92-102.
[3] F. Hao, P. Y. Ryan, and P. Zieli´nski, 2010. Anonymous voting by two-round public
discussion. IET Information Security, Vol. 4, No. 2, pp. 62-67.
[4] F. Hao, M. N. Kreeger, B. Randell, D. Clarke, S. F Shahandashti, and P. H.-J. Lee, 2014.
Every vote counts: Ensuring integrity in large-scale DRE-based electronic voting. The
USENIX Journal of Election Technology and Systems, Vol. 2, No. 3, pp. 1-25.
[5] F. Hao, D. Clarke, B. Randell, and S. F. Shahandashti, 2018. Verifiable classroom voting in
practice. IEEE Security & Privacy, Vol. 16, No. 1, pp. 72-81.
0
500
1000
1500
2000
2500
3000
3500
4000
2000 4000 6000 8000 10000
Ti
m
e
(
m
ili
se
co
n
d
s)
The number of users
Yang et al.'s protocol Our solution
Vu Duy Hien, Luong The Dung, Ho Tu Bao and Nguyen Chung Tien
94
[6] F. Hao and P. Zieli´nski, 2006. A 2-round anonymous veto protocol. International Workshop
on Security Protocols, Springer, pp. 202-211.
[7] F. Wu, J. Liu, and S. Zhong, 2009. An efficient protocol for private and accurate mining of
support counts. Pattern Recognition Letters, Vol. 30, No. 1, pp. 80-86.
[8] E. Shi, H. Chan, E. Rieffel, R. Chow, and D. Song, 2011. Privacy-preserving aggregation of
time-series data. Annual Network and Distributed System Security Symposium, Internet
Society, pp. 1-17.
[9] Q. Li, G. Cao, and T. F. La Porta, 2014. Efficient and privacy-aware data aggregation in
mobile sensing. IEEE Transactions on Dependable and Secure Computing, Vol. 11, No. 2,
pp. 115-129.
[10] F. Benhamouda, M. Joye, and B. Libert, 2016. A new framework for privacy preserving
aggregation of time-series data. ACM Transactions on Information and System Security, Vol. 18,
No. 3, pp. 10:1-10:21.
[11] T. Jung, J. Han, and X.-Y. Li, 2016. PDA: Semantically secure time-series data analytics
with dynamic subgr