Privacy preserving multivariate classification based on tree-vector product protocol

1. Introduction Data mining has emerged as a significant technology for gaining knowledge from vast quantities of data [1]. Data mining technology allows us to analyze a personal data, or a organizational data. However, that creates threats to privacy, this reason might cause an obstruction to data mining collaboration projects. For example, two companies have a huge data set of records of their customers. These companies want to cooperatively conduct mining on their joint data set for their mutual benefit since this collaboration brings them results of mining which are more accurate than results of local data mining. However, each company does not want to, or does not have permission to disclose private information of other company’s customers. The challenge then is whether the two companies above obtain results of mining while still preserving their data secrecy. Recently, there has been growing focus on finding solutions to this problem [4, 5, 11]. In this paper, we firstly state the secure three-vector product computation problem and develop protocols called tree-vector products to solve this problem. Our protocols are developed based on some basic protocols [3, 4, 7, 8] and their security properties is validated base on the composition theorem for the semi-honest model [9]. We then use the proposed protocols to address a focus problem of privacy preserving multivariate classification.

pdf9 trang | Chia sẻ: thanhle95 | Lượt xem: 386 | Lượt tải: 0download
Bạn đang xem nội dung tài liệu Privacy preserving multivariate classification based on tree-vector product protocol, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
JOURNAL OF SCIENCE OF HNUE FIT., 2011, Vol. 56, pp. 72-80 PRIVACY PRESERVING MULTIVARIATE CLASSIFICATION BASED ON TREE-VECTOR PRODUCT PROTOCOL Luong The Dung(∗) VietNam Information Security Commission Tran Duc Su Academy of Cryptographic Technique (∗)E-mail: thedungluong@gmail.com 1. Introduction Data mining has emerged as a significant technology for gaining knowledge from vast quantities of data [1]. Data mining technology allows us to analyze a personal data, or a organizational data. However, that creates threats to privacy, this reason might cause an obstruction to data mining collaboration projects. For example, two companies have a huge data set of records of their customers. These companies want to cooperatively conduct mining on their joint data set for their mutual benefit since this collaboration brings them results of mining which are more accurate than results of local data mining. However, each company does not want to, or does not have permission to disclose private information of other company’s customers. The challenge then is whether the two companies above obtain results of mining while still preserving their data secrecy. Recently, there has been growing focus on finding solutions to this problem [4, 5, 11]. In this paper, we firstly state the secure three-vector product computation problem and develop protocols called tree-vector products to solve this problem. Our protocols are developed based on some basic protocols [3, 4, 7, 8] and their security properties is validated base on the composition theorem for the semi-honest model [9]. We then use the proposed protocols to address a focus problem of privacy preserving multivariate classification. 2. Content 2.1. Background and problem statement 2.1.1. Problem statement Problem. Consider a data set D that has m observations, n attributes X1, X2, . . . , Xn and a class attribute Y . Where each Xi, (i = 1, . . . , n) takes values as real numbers, Y takes values as category data. The data set D is vertically 72 Privacy preserving multivariate classification based on tree-vector product protocol partitioned over K parties where each party Pi (i = 1, 2, . . . , K) corresponds to a subset Di with some attributes of D. In other words, each Pi can only observe the values of some attributes of D for the same set of m observations, only one or the subset of the parties has attribute Y . We use the notation D = (D1 : D2 : . . . : DK : Y ) to present the union of these K subsets. The target of this work is to find solutions to build a Multivariate Classification model according to the above described circumstance, without disclosing the private data of any parties to the others. Previous works. Privacy preserving Multivariate Classification approach for the vertically partitioned model proposed in [2, 12]. This approach is based on secure matrix product computation techniques, it was showed efficient in cost communica- tion. However, this approach has a limitation that it requires all parties participating in the protocol knowing class attribute Y . We address the problem that only the subset of the parties knows the class attribute and the class attribute is sensitive and it needs be protected. 2.1.2. Cryptography tools Secure scalar product computation protocol. For example there are two parties Alice and Bob, Alice has a vector X = (x1, .., xn) and Bob has a vector Y = (y1, . . . , yn). The scalar product, or inner product, of two vectors Y and Y is defined as 〈X, Y 〉 = n∑ i=1 xiyi. The goal of secure scalar product protocol is to compute the scalar product of two vectors X and Y with no party disclose its private data (vector) to other parties. Secure scalar product protocols are an important primitive for privacy- preserving data mining. Several such protocols have been proposed (e.g. [3], [4], [5], etc.) with varying degrees of security and cost communication. A nice overview of the problem and the security properties of some solutions appears in [6]. Oblivious polynomial evaluation. The problem of “oblivious polynomial evalu- ation” was first considered in [7]. As with oblivious transfer, this problem involves a sender and a receiver. The sender’s input is a polynomial Q of degree k over some finite field F and the receiver’s input is an element z ∈ F (the degree k of Q is public). The protocol is such that the receiver obtains Q(z) without learning anything else about the polynomial Q, and the sender learns nothing. An efficient solution to this problem was presented in [7]. The overhead of that protocol is O(k) exponentiations (using the method suggested in [8] for doing a 1-out-of-N oblivious transfer with O(1) exponentiations). The semi-honest model. In the semi-honest model, each party participating in the protocol have to follow the rules using its correct input, and it can’t use what it sees during execution of the protocol to compromise security. This model is 73 Luong The Dung and Tran Duc Su reasonable for many real situations because parties who want to mine data for their mutual benefit will follow the protocol to get correct results. The definition of secure two-party computation in the semi-honest model is stated in [9]. Basically the definition states that a computation is secure if the view of each party during the execution of the protocol can be effectively simulated by the input and the output of the party. This is not quite the same as saying that private information is protected because it is possible to have privacy breaches by inferring from the final result and local information. In this paper, we also use the Composition Theorem for the semi-honest model. The detail of its discussion and the proof can be seen in [9]. Theorem 2.1. (The Composition Theorem) Suppose that g is privately reducible to f and that there exists a protocol for privately computing f . Then there exists a protocol for privately computing g. 2.2. Tree-vector Product For next works, in this section we state a secure three-vector product compu- tation problem and propose protocols to address this problem. Support that three parties Pa, Pb and Pc have three vectors X = (x1, . . . , xn), Y = (y1, . . . , yn) and Z = (z1, . . . , zn) respectively. We call the three-vectors scalar product of X, Y and Z as 〈X, Y, Z〉 = n∑ i=1 x − iyizi. Our goal is to design a protocol for secure three- vectors product computation with no party disclose its private data to other party, all parties are not able to learn anything other than the final result. 2.2.1. Protocol 1 To address the above problem, we built three-vector product protocols from the two basic protocols: secure scalar product and secure polynomial comes evalu- ation. In actual implementation, these basic protocols have proved be secure and efficient, hence some authors have used these protocols as an important prim- itive to design their protocols. Let R = (r1, . . . , rm) be a random vector. Let (Q1(z), Q2(z), . . . , Qm(z)) = (x1z + x2z + . . . + xmz + rm) be n degree 1 poly- nomials, we call it be a linear polynomial vector. we design the secure three-vector product protocol as below: Input. Three parties Pa, Pb, Pc have three vectors X, Y, Z respectively. Output. 〈X, Y, Z〉. 1. Pa generates a random vector and then defines a linear polynomial vector where be a linear polynomial. Each element of vector R follows a uniform distribution over a user-defined field. 2. Pa and Pb engage in a private evaluation of each Qi(yi), (i = 1, . . . , m)), in which Pb obtains Q = (Q1(y1), Q2(y2), , . . . , Qm(ym)) where Qi(yi) = xiyi + ri 74 Privacy preserving multivariate classification based on tree-vector product protocol 3. Pa and Pc execute the secure product protocol with Pa inputting R and in- putting Z, in which Pc obtains 4. Pb and Pc execute the secure product protocol with Pb inputting and Pc in- putting Z, in which Pb obtains 5. Pb computes 〈X, Y, Z〉 = 〈Q,Z〉 − 〈R,Z〉. 6. Pb sends to Pa and Pc Theorem 2.2. Protocol 1 privately computes the shares of the three-vectors Product value Proof. In this section, we show that Protocol 1 allows us to obtain the three-vectors Scalar Product values and privacy-preserving. We have Q = (Q1(y1), Q2(y2), , . . . , Qm(ym)) = (x1y1 + r1, . . . , xmym + rm), so 〈Q,Z〉 = m∑ i=1 (xiyizi + rizi) and 〈R,Z〉 = ∑ 6mi=1rizi, therefor, 〈X, Y, Z〉 = 〈Q,Z〉 − 〈R,Z〉 = m∑ i=1 xiyizi, this is the three-vectors product value. In order to show that a protocol is secure in the semi-honest model, we simply need to apply the composition theorem, so we need to show that each step in Protocol 1 is secure. We note that Protocol 1 requires the secure scalar product and secure poly- nomial evaluation operation, where there already exists many protocols that are correct and secure ([3,4,5]). During these execution of this protocols, the parties participating in the protocols are not able to learn anything other than the final result. In the protocol 1, the only communication taking place is at steps 2, 3 and 4. In Step 2, parties Pa and Pb engage in a private evaluation of each Qi(yi), (i = 1, . . . , m), neither party share their private, only Pb obtains the vector Q that each Qi(yi) = xiyi + ri is a random number from a uniform distribution over a defined field. In step 3, Pa and Pc execute the secure product protocol, each party only ob- tains 〈R,Z〉 without any other useful information. The results of the scalar product protocol are randomly shared, which can be simulated by both Pa and Pc as shown in (Lemma 4.1 [11]) In step 4, similarly, Pb and Pc only obtain the value 〈Q,Z〉 without knowing other useful information from the other party. Protocol 1 can also be proved by using a simulation method [10]. The basic idea is to show the view of each party during the execution of the protocol that can 75 Luong The Dung and Tran Duc Su be effectively simulated by giving the input and output of that party. Here, we show that both parties are not able to learn anything but the final result. Remark. We should note that using the same random vector R in many iter- ations of Protocol 1 will cause privacy breaches. For example, assume that Pa has a vector X, Pb has two vectors Y and Y ′, and Pc has a vector Z. The parties need to execute two iterations of Protocol 1 to compute 〈X, Y ′, Z〉 and 〈X, Y, Z〉. If we use the same vector R in two above iterations then each xi of Pa, Pb can establish an equation system including two equations: xiyi + ri = qi and xiyi′ + ri = qi′ (qi and qi′ were determined), so Pb can get xi. Thus, to prevent privacy breaches we need to generate a different vector R in each iteration of Protocol 1. Cost Communication. Let pi(m) be an expression for the communication cost of some secure scalar product protocols where is the number of elements of the vectors participating in the protocol. Let pi′(m) be an expression for the communication cost of the private polynomial evaluation protocol in a private evaluation of the linear polynomial vector. In Protocol 1, the communication primarily occurs at steps 2, 3 and 4, so the communication cost of Protocol 1 is 2pi(m) + pi′(m) 2.2.2. Protocol 2 The protocol 1 is secure and it also protects each party’s private information against collusion of other dishonest parties. However, in some determined conditions, we can support that there are no collusions among parties. Thus we propose a secure three- vectors scalar product protocol that can have lower communication cost other than that of Protocol 1. The idea is that, because Pa shares the vector R with Pc, we can ignore the step of computing 〈R,Z〉 to reduce communication costs. However, the drawback of this protocol is that it does not protect private data of Pa against collusion between Pb and Pc. Input. Three parties Pa, Pb and Pc have three vectors X, Y, Z respectively. Output. 〈X, Y, Z〉. 1. Pa and Pc jointly generate a random vector R = (r1, . . . , rm). Each element of vector R follows a uniform distribution over a user-defined field. 2. Pa generates a random vector a linear polynomial vector Q = (Q1(z1), Q2(z2), , . . . , Qm(zm)) where Qi(z) = xiz + ri be a linear polyno- mial. 3. Pa and Pb engage in a private evaluation of each Qi(z), in which Pb obtains Q = (Q1(y1), Q2(y2), , . . . , Qm(ym)) where Qi(z) = xiyi + ri. 4. Pb and Pc execute the secure scalar product protocol with Pb inputting Q and Pc inputting Z, in which they obtain 〈Q,Z〉. 5. Pc computes 〈R,Z〉 and then computes 〈X, Y, Z〉 = 〈Q,Z〉 − 〈R,Z〉. 76 Privacy preserving multivariate classification based on tree-vector product protocol 6. Pc sends 〈X, Y, Z〉 to Pa and Pb Theorem 2.3. Protocol 2 privately computes the shares of the three-vector product value assuming non-collusion. Proof. Similar to the proof of Protocol 1, we can also show that Protocol 2 is privacy-preserving if there is no collusion. Because in Protocol 2, the only com- munication taking place is at steps 3, 4. In step 3, the secure polynomial evaluation protocol executed that only Pb obtains a value random from the defined field. The communication occurs at step 4 with the invocation of the scalar product protocol. The results of the scalar product protocol are random shares, which can be simu- lated as shown in (Lemma 4.1 [11]). Thus, Protocol 1 can be simulated, with the composition theorem being applied to the scalar product protocol and the polyno- mial evaluation protocol. However, Protocol 2 can cause privacy breaches when Pb enters into collusion with Pc, then Pc can share the vector R with Pb, so Pb can get vector X of Pa. Communication cost. In Protocol 2, the communication primarily occurs at steps 3 and 4, so the communication cost of Protocol 1 is pi(m) + pi′(m). Thus, the communication cost of Protocol 2 is lower than that of Protocol 1. The reason for that is because of Pa sharing R with Pc, Pa and Pc do not have to use a scalar product protocol for computation 〈R,Z〉. 2.2.3. Protocol 3 In this section, we consider a case that three vectors X, Y, Z belong to only two parties. For example, X and Y belong to Pa, and Z belongs to Pb. To compute 〈X, Y, Z〉, Pa generates R = (x1y1, . . . , xmym), then Pa and Pb execute the secure product protocol with Pa inputting R and Pb inputting Z, in which they obtain 〈R,Z〉. It should be noted that 〈X, Y, Z〉 = 〈R,Z〉. Security properties of this protocol derive from the secure scalar product protocol. 2.3. Privacy Preserving Multivariate Classification 2.3.1. Multivariate Classification In this section, we introduce the Multivariate Classification method and ana- lyze this method to find the target formula that need be securely computed. Classification is one of the most important technologies of data mining that has wide applications in the various areas. The goal of classification is to build a model which can predict the value of one variable, based on the known values of the other variables. There are various classification methods, such as Statistical methods, Support vector machine, and Decision tree [1]. In this paper we consider a method based on a statistical analysis presented in [2]. An observation is said belonging a class k if the attributes vector of that observation is the closest to the centroid of class k. The Mahalanobis distance of an 77 Luong The Dung and Tran Duc Su observation in a class indicates how far the observation is from the centroid of all observations in that class. Let Sk represents the data set consisting of the vectors of all the observations in class k. Let Sk be the vector of means of observations in class k. Let Sk(j) represents the jth row of matrix Sk, and Ŝk be a matrix where Ŝk(j) = Sk(j) − Sk. The Mahalanobis distance dk between an observation V (its attributes vector is V ) and the centroid of class k can be computed as the following: d2k = (V − Sk)TC−1k (V − Sk) + ln |Ck| where Ck is the covariance matrix in class k: Ck = 1 m− 1〈Sˆk, Sˆk〉 Assume class attribute Y takes values in a discrete domain including *********** values. After computing d1, . . . , dL, we can conclude that the observation V belongs to class i if di = min{d1, d2, ..., dL}. Therefore, in order to determine a class for a new observation, we need to compute Ck and Sk. 2.3.2. Privacy Preserving Multivariate Classification Now we address the privacy preserving classification problem, assume the data set D is vertically partitioned onK parties. We need to design a protocol that allows each party to predict a class label for a new observation based on the joint data set of parties with no party disclosing its private data to other parties. To predict class for a new observation, the parties need to know d1, . . . , dL. So they need to securely compute the covariance matrix Ck and the mean vector Sk for each class k. In this section, we analyze steps to compute Ck and Sk. So finally, we propose a protocol through analysis steps. Xi = (x1, x2, ..., xm) be the column vector ith of the data set D (i = 1, . . . , n). We denote Yk = (y1, y2, ..., ym) be a vector that yi = 1 mk if the row ith of the data set belongs to the class k else yi = 0, where mk be the number of rows of the data set belong to class k. Let Xki be the mean of the vector Xi in the class k, then: X k i = ∑m j=1 xjyj = 〈Xi, Yk〉 It should be noted that Sk = (X k 1, X k 2, .., X k n). Therefor, to compute Sk, we have to address the computation of a scalar product of two vector Xi and Yk. if Xi and Yk belong to the same party, then this work is locally completed without disclosing private data of any parties. If Xi and Yk belong to two different parties, then we use the secure scalar product protocol. To continue, we have to address the computation of Ck. Let Xˆi = (xˆ1, xˆ2, ..., xˆm) be a vector that each element xˆi = xi − Xki . we denote Yk′ = (y1′ , y′2, ..., y′m) be a vector that each yi′ = 1 mk − 1 if the row i th of the data set belongs to class k else yi′ = 0. Each the element (I, j)th of Ck is determined by 78 Privacy preserving multivariate classification based on tree-vector product protocol Ck(i, j) = m∑ t=1 xˆ (i) t xˆ (j) t y ′ k = 〈Xˆi, Xˆj , Yk′〉 We assume without loss of generality that only one party Pc (c ∈ {1, 2, . . . , K}) has the class attribute Y . We can divide this into three possibilities: - Three vectors Xi, Xj and Yk belong to the same party, then C(I, j) is locally computed without disclosing any private information. - Three vectors Xi, Xj and Yk belong to two parties. We can use Protocol 3 to address this case. - Xi, Xj and Yk belong to three different party. We address this problem by Protocol 1 or Protocol 2. The above analysis has showed that we can securely compute the mean vector and the covariance matrix by proposed protocols in section 3. Based on the compo- sition theorem and derivation of proposed protocols, our computations can privately make the shares of the mean vector and the covariance matrix for each class k. 3. Conclusion The major contributions of this paper are a privacy preserving multivariate classification for vertically partitioned data, in which only subsets of parties know class attribute. To address the above problem, three- vector product protocol was proposed in this paper. There is another direction for our future works that devel- oping multiple-vectors products protocol is a non-trivial extension, especially if we consider collusion between parties as well. This goal is to develop other efficient pri- vacy preserving data mining methods such as Association Rules, Network Bayesian, etc. REFERENCES [1] J. Han and M. Kamber. Data Mining Concepts and Techniques. Morgan Kauf- mann Publishers, 2001. [2] W.L. Du, Yunghsiang S. Han and S. Chen. Privacy-preserving multivariate sta-