1. Introduction
Data mining has emerged as a significant technology for gaining knowledge
from vast quantities of data [1]. Data mining technology allows us to analyze a
personal data, or a organizational data. However, that creates threats to privacy,
this reason might cause an obstruction to data mining collaboration projects. For
example, two companies have a huge data set of records of their customers. These
companies want to cooperatively conduct mining on their joint data set for their
mutual benefit since this collaboration brings them results of mining which are more
accurate than results of local data mining. However, each company does not want
to, or does not have permission to disclose private information of other company’s
customers. The challenge then is whether the two companies above obtain results
of mining while still preserving their data secrecy. Recently, there has been growing
focus on finding solutions to this problem [4, 5, 11].
In this paper, we firstly state the secure three-vector product computation
problem and develop protocols called tree-vector products to solve this problem.
Our protocols are developed based on some basic protocols [3, 4, 7, 8] and their
security properties is validated base on the composition theorem for the semi-honest
model [9]. We then use the proposed protocols to address a focus problem of privacy
preserving multivariate classification.
9 trang |
Chia sẻ: thanhle95 | Lượt xem: 425 | Lượt tải: 0
Bạn đang xem nội dung tài liệu Privacy preserving multivariate classification based on tree-vector product protocol, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
JOURNAL OF SCIENCE OF HNUE
FIT., 2011, Vol. 56, pp. 72-80
PRIVACY PRESERVING MULTIVARIATE CLASSIFICATION
BASED ON TREE-VECTOR PRODUCT PROTOCOL
Luong The Dung(∗)
VietNam Information Security Commission
Tran Duc Su
Academy of Cryptographic Technique
(∗)E-mail: thedungluong@gmail.com
1. Introduction
Data mining has emerged as a significant technology for gaining knowledge
from vast quantities of data [1]. Data mining technology allows us to analyze a
personal data, or a organizational data. However, that creates threats to privacy,
this reason might cause an obstruction to data mining collaboration projects. For
example, two companies have a huge data set of records of their customers. These
companies want to cooperatively conduct mining on their joint data set for their
mutual benefit since this collaboration brings them results of mining which are more
accurate than results of local data mining. However, each company does not want
to, or does not have permission to disclose private information of other company’s
customers. The challenge then is whether the two companies above obtain results
of mining while still preserving their data secrecy. Recently, there has been growing
focus on finding solutions to this problem [4, 5, 11].
In this paper, we firstly state the secure three-vector product computation
problem and develop protocols called tree-vector products to solve this problem.
Our protocols are developed based on some basic protocols [3, 4, 7, 8] and their
security properties is validated base on the composition theorem for the semi-honest
model [9]. We then use the proposed protocols to address a focus problem of privacy
preserving multivariate classification.
2. Content
2.1. Background and problem statement
2.1.1. Problem statement
Problem. Consider a data set D that has m observations, n attributes
X1, X2, . . . , Xn and a class attribute Y . Where each Xi, (i = 1, . . . , n) takes values
as real numbers, Y takes values as category data. The data set D is vertically
72
Privacy preserving multivariate classification based on tree-vector product protocol
partitioned over K parties where each party Pi (i = 1, 2, . . . , K) corresponds to a
subset Di with some attributes of D. In other words, each Pi can only observe the
values of some attributes of D for the same set of m observations, only one or the
subset of the parties has attribute Y . We use the notation D = (D1 : D2 : . . . : DK :
Y ) to present the union of these K subsets.
The target of this work is to find solutions to build a Multivariate Classification
model according to the above described circumstance, without disclosing the private
data of any parties to the others.
Previous works. Privacy preserving Multivariate Classification approach for
the vertically partitioned model proposed in [2, 12]. This approach is based on secure
matrix product computation techniques, it was showed efficient in cost communica-
tion. However, this approach has a limitation that it requires all parties participating
in the protocol knowing class attribute Y . We address the problem that only the
subset of the parties knows the class attribute and the class attribute is sensitive
and it needs be protected.
2.1.2. Cryptography tools
Secure scalar product computation protocol. For example there are two parties
Alice and Bob, Alice has a vector X = (x1, .., xn) and Bob has a vector Y =
(y1, . . . , yn). The scalar product, or inner product, of two vectors Y and Y is defined
as
〈X, Y 〉 =
n∑
i=1
xiyi.
The goal of secure scalar product protocol is to compute the scalar product of
two vectors X and Y with no party disclose its private data (vector) to other parties.
Secure scalar product protocols are an important primitive for privacy- preserving
data mining. Several such protocols have been proposed (e.g. [3], [4], [5], etc.) with
varying degrees of security and cost communication. A nice overview of the problem
and the security properties of some solutions appears in [6].
Oblivious polynomial evaluation. The problem of “oblivious polynomial evalu-
ation” was first considered in [7]. As with oblivious transfer, this problem involves
a sender and a receiver. The sender’s input is a polynomial Q of degree k over
some finite field F and the receiver’s input is an element z ∈ F (the degree k of
Q is public). The protocol is such that the receiver obtains Q(z) without learning
anything else about the polynomial Q, and the sender learns nothing. An efficient
solution to this problem was presented in [7]. The overhead of that protocol is O(k)
exponentiations (using the method suggested in [8] for doing a 1-out-of-N oblivious
transfer with O(1) exponentiations).
The semi-honest model. In the semi-honest model, each party participating
in the protocol have to follow the rules using its correct input, and it can’t use
what it sees during execution of the protocol to compromise security. This model is
73
Luong The Dung and Tran Duc Su
reasonable for many real situations because parties who want to mine data for their
mutual benefit will follow the protocol to get correct results.
The definition of secure two-party computation in the semi-honest model is
stated in [9]. Basically the definition states that a computation is secure if the view
of each party during the execution of the protocol can be effectively simulated by
the input and the output of the party. This is not quite the same as saying that
private information is protected because it is possible to have privacy breaches by
inferring from the final result and local information.
In this paper, we also use the Composition Theorem for the semi-honest model.
The detail of its discussion and the proof can be seen in [9].
Theorem 2.1. (The Composition Theorem) Suppose that g is privately reducible
to f and that there exists a protocol for privately computing f . Then there exists a
protocol for privately computing g.
2.2. Tree-vector Product
For next works, in this section we state a secure three-vector product compu-
tation problem and propose protocols to address this problem. Support that three
parties Pa, Pb and Pc have three vectors X = (x1, . . . , xn), Y = (y1, . . . , yn) and
Z = (z1, . . . , zn) respectively. We call the three-vectors scalar product of X, Y and
Z as 〈X, Y, Z〉 =
n∑
i=1
x − iyizi. Our goal is to design a protocol for secure three-
vectors product computation with no party disclose its private data to other party,
all parties are not able to learn anything other than the final result.
2.2.1. Protocol 1
To address the above problem, we built three-vector product protocols from
the two basic protocols: secure scalar product and secure polynomial comes evalu-
ation. In actual implementation, these basic protocols have proved be secure and
efficient, hence some authors have used these protocols as an important prim-
itive to design their protocols. Let R = (r1, . . . , rm) be a random vector. Let
(Q1(z), Q2(z), . . . , Qm(z)) = (x1z + x2z + . . . + xmz + rm) be n degree 1 poly-
nomials, we call it be a linear polynomial vector. we design the secure three-vector
product protocol as below:
Input. Three parties Pa, Pb, Pc have three vectors X, Y, Z respectively.
Output. 〈X, Y, Z〉.
1. Pa generates a random vector and then defines a linear polynomial vector
where be a linear polynomial. Each element of vector R follows a uniform
distribution over a user-defined field.
2. Pa and Pb engage in a private evaluation of each Qi(yi), (i = 1, . . . , m)), in
which Pb obtains Q = (Q1(y1), Q2(y2), , . . . , Qm(ym)) where Qi(yi) = xiyi + ri
74
Privacy preserving multivariate classification based on tree-vector product protocol
3. Pa and Pc execute the secure product protocol with Pa inputting R and in-
putting Z, in which Pc obtains
4. Pb and Pc execute the secure product protocol with Pb inputting and Pc in-
putting Z, in which Pb obtains
5. Pb computes 〈X, Y, Z〉 = 〈Q,Z〉 − 〈R,Z〉.
6. Pb sends to Pa and Pc
Theorem 2.2. Protocol 1 privately computes the shares of the three-vectors Product
value
Proof. In this section, we show that Protocol 1 allows us to obtain the three-vectors
Scalar Product values and privacy-preserving.
We have Q = (Q1(y1), Q2(y2), , . . . , Qm(ym)) = (x1y1 + r1, . . . , xmym + rm),
so 〈Q,Z〉 =
m∑
i=1
(xiyizi + rizi) and 〈R,Z〉 =
∑
6mi=1rizi,
therefor, 〈X, Y, Z〉 = 〈Q,Z〉 − 〈R,Z〉 =
m∑
i=1
xiyizi, this is the three-vectors
product value.
In order to show that a protocol is secure in the semi-honest model, we simply
need to apply the composition theorem, so we need to show that each step in Protocol
1 is secure.
We note that Protocol 1 requires the secure scalar product and secure poly-
nomial evaluation operation, where there already exists many protocols that are
correct and secure ([3,4,5]). During these execution of this protocols, the parties
participating in the protocols are not able to learn anything other than the final
result. In the protocol 1, the only communication taking place is at steps 2, 3 and
4.
In Step 2, parties Pa and Pb engage in a private evaluation of each Qi(yi), (i =
1, . . . , m), neither party share their private, only Pb obtains the vector Q that each
Qi(yi) = xiyi + ri is a random number from a uniform distribution over a defined
field.
In step 3, Pa and Pc execute the secure product protocol, each party only ob-
tains 〈R,Z〉 without any other useful information. The results of the scalar product
protocol are randomly shared, which can be simulated by both Pa and Pc as shown
in (Lemma 4.1 [11])
In step 4, similarly, Pb and Pc only obtain the value 〈Q,Z〉 without knowing
other useful information from the other party.
Protocol 1 can also be proved by using a simulation method [10]. The basic
idea is to show the view of each party during the execution of the protocol that can
75
Luong The Dung and Tran Duc Su
be effectively simulated by giving the input and output of that party. Here, we show
that both parties are not able to learn anything but the final result.
Remark. We should note that using the same random vector R in many iter-
ations of Protocol 1 will cause privacy breaches. For example, assume that Pa has a
vector X, Pb has two vectors Y and Y ′, and Pc has a vector Z. The parties need to
execute two iterations of Protocol 1 to compute 〈X, Y ′, Z〉 and 〈X, Y, Z〉. If we use
the same vector R in two above iterations then each xi of Pa, Pb can establish an
equation system including two equations: xiyi + ri = qi and xiyi′ + ri = qi′ (qi and
qi′ were determined), so Pb can get xi. Thus, to prevent privacy breaches we need to
generate a different vector R in each iteration of Protocol 1.
Cost Communication. Let pi(m) be an expression for the communication cost of
some secure scalar product protocols where is the number of elements of the vectors
participating in the protocol. Let pi′(m) be an expression for the communication cost
of the private polynomial evaluation protocol in a private evaluation of the linear
polynomial vector. In Protocol 1, the communication primarily occurs at steps 2, 3
and 4, so the communication cost of Protocol 1 is 2pi(m) + pi′(m)
2.2.2. Protocol 2
The protocol 1 is secure and it also protects each party’s private information
against collusion of other dishonest parties. However, in some determined conditions,
we can support that there are no collusions among parties. Thus we propose a secure
three- vectors scalar product protocol that can have lower communication cost other
than that of Protocol 1. The idea is that, because Pa shares the vector R with Pc, we
can ignore the step of computing 〈R,Z〉 to reduce communication costs. However,
the drawback of this protocol is that it does not protect private data of Pa against
collusion between Pb and Pc.
Input. Three parties Pa, Pb and Pc have three vectors X, Y, Z respectively.
Output. 〈X, Y, Z〉.
1. Pa and Pc jointly generate a random vector R = (r1, . . . , rm). Each element of
vector R follows a uniform distribution over a user-defined field.
2. Pa generates a random vector a linear polynomial vector
Q = (Q1(z1), Q2(z2), , . . . , Qm(zm)) where Qi(z) = xiz + ri be a linear polyno-
mial.
3. Pa and Pb engage in a private evaluation of each Qi(z), in which Pb obtains
Q = (Q1(y1), Q2(y2), , . . . , Qm(ym)) where Qi(z) = xiyi + ri.
4. Pb and Pc execute the secure scalar product protocol with Pb inputting Q and
Pc inputting Z, in which they obtain 〈Q,Z〉.
5. Pc computes 〈R,Z〉 and then computes 〈X, Y, Z〉 = 〈Q,Z〉 − 〈R,Z〉.
76
Privacy preserving multivariate classification based on tree-vector product protocol
6. Pc sends 〈X, Y, Z〉 to Pa and Pb
Theorem 2.3. Protocol 2 privately computes the shares of the three-vector product
value assuming non-collusion.
Proof. Similar to the proof of Protocol 1, we can also show that Protocol 2
is privacy-preserving if there is no collusion. Because in Protocol 2, the only com-
munication taking place is at steps 3, 4. In step 3, the secure polynomial evaluation
protocol executed that only Pb obtains a value random from the defined field. The
communication occurs at step 4 with the invocation of the scalar product protocol.
The results of the scalar product protocol are random shares, which can be simu-
lated as shown in (Lemma 4.1 [11]). Thus, Protocol 1 can be simulated, with the
composition theorem being applied to the scalar product protocol and the polyno-
mial evaluation protocol. However, Protocol 2 can cause privacy breaches when Pb
enters into collusion with Pc, then Pc can share the vector R with Pb, so Pb can get
vector X of Pa.
Communication cost. In Protocol 2, the communication primarily occurs at
steps 3 and 4, so the communication cost of Protocol 1 is pi(m) + pi′(m). Thus, the
communication cost of Protocol 2 is lower than that of Protocol 1. The reason for
that is because of Pa sharing R with Pc, Pa and Pc do not have to use a scalar
product protocol for computation 〈R,Z〉.
2.2.3. Protocol 3
In this section, we consider a case that three vectors X, Y, Z belong to only
two parties. For example, X and Y belong to Pa, and Z belongs to Pb.
To compute 〈X, Y, Z〉, Pa generates R = (x1y1, . . . , xmym), then Pa and Pb
execute the secure product protocol with Pa inputting R and Pb inputting Z, in
which they obtain 〈R,Z〉. It should be noted that 〈X, Y, Z〉 = 〈R,Z〉. Security
properties of this protocol derive from the secure scalar product protocol.
2.3. Privacy Preserving Multivariate Classification
2.3.1. Multivariate Classification
In this section, we introduce the Multivariate Classification method and ana-
lyze this method to find the target formula that need be securely computed.
Classification is one of the most important technologies of data mining that
has wide applications in the various areas. The goal of classification is to build
a model which can predict the value of one variable, based on the known values
of the other variables. There are various classification methods, such as Statistical
methods, Support vector machine, and Decision tree [1]. In this paper we consider
a method based on a statistical analysis presented in [2].
An observation is said belonging a class k if the attributes vector of that
observation is the closest to the centroid of class k. The Mahalanobis distance of an
77
Luong The Dung and Tran Duc Su
observation in a class indicates how far the observation is from the centroid of all
observations in that class. Let Sk represents the data set consisting of the vectors of
all the observations in class k.
Let Sk be the vector of means of observations in class k. Let Sk(j) represents
the jth row of matrix Sk, and Ŝk be a matrix where Ŝk(j) = Sk(j) − Sk. The
Mahalanobis distance dk between an observation V (its attributes vector is V ) and
the centroid of class k can be computed as the following:
d2k = (V − Sk)TC−1k (V − Sk) + ln |Ck|
where Ck is the covariance matrix in class k: Ck =
1
m− 1〈Sˆk, Sˆk〉
Assume class attribute Y takes values in a discrete domain including ***********
values. After computing d1, . . . , dL, we can conclude that the observation V belongs
to class i if di = min{d1, d2, ..., dL}.
Therefore, in order to determine a class for a new observation, we need to
compute Ck and Sk.
2.3.2. Privacy Preserving Multivariate Classification
Now we address the privacy preserving classification problem, assume the data
set D is vertically partitioned onK parties. We need to design a protocol that allows
each party to predict a class label for a new observation based on the joint data set
of parties with no party disclosing its private data to other parties. To predict class
for a new observation, the parties need to know d1, . . . , dL. So they need to securely
compute the covariance matrix Ck and the mean vector Sk for each class k. In this
section, we analyze steps to compute Ck and Sk. So finally, we propose a protocol
through analysis steps.
Xi = (x1, x2, ..., xm) be the column vector ith of the data set D (i = 1, . . . , n).
We denote Yk = (y1, y2, ..., ym) be a vector that yi =
1
mk
if the row ith of the data
set belongs to the class k else yi = 0, where mk be the number of rows of the data
set belong to class k. Let Xki be the mean of the vector Xi in the class k, then:
X
k
i =
∑m
j=1 xjyj = 〈Xi, Yk〉
It should be noted that Sk = (X
k
1, X
k
2, .., X
k
n). Therefor, to compute Sk, we
have to address the computation of a scalar product of two vector Xi and Yk. if
Xi and Yk belong to the same party, then this work is locally completed without
disclosing private data of any parties. If Xi and Yk belong to two different parties,
then we use the secure scalar product protocol.
To continue, we have to address the computation of Ck. Let Xˆi = (xˆ1, xˆ2, ..., xˆm)
be a vector that each element xˆi = xi − Xki . we denote Yk′ = (y1′ , y′2, ..., y′m) be a
vector that each yi′ =
1
mk − 1 if the row i
th of the data set belongs to class k else
yi′ = 0. Each the element (I, j)th of Ck is determined by
78
Privacy preserving multivariate classification based on tree-vector product protocol
Ck(i, j) =
m∑
t=1
xˆ
(i)
t xˆ
(j)
t y
′
k = 〈Xˆi, Xˆj , Yk′〉
We assume without loss of generality that only one party Pc (c ∈ {1, 2, . . . , K})
has the class attribute Y . We can divide this into three possibilities:
- Three vectors Xi, Xj and Yk belong to the same party, then C(I, j) is locally
computed without disclosing any private information.
- Three vectors Xi, Xj and Yk belong to two parties. We can use Protocol 3 to
address this case.
- Xi, Xj and Yk belong to three different party. We address this problem by
Protocol 1 or Protocol 2.
The above analysis has showed that we can securely compute the mean vector
and the covariance matrix by proposed protocols in section 3. Based on the compo-
sition theorem and derivation of proposed protocols, our computations can privately
make the shares of the mean vector and the covariance matrix for each class k.
3. Conclusion
The major contributions of this paper are a privacy preserving multivariate
classification for vertically partitioned data, in which only subsets of parties know
class attribute. To address the above problem, three- vector product protocol was
proposed in this paper. There is another direction for our future works that devel-
oping multiple-vectors products protocol is a non-trivial extension, especially if we
consider collusion between parties as well. This goal is to develop other efficient pri-
vacy preserving data mining methods such as Association Rules, Network Bayesian,
etc.
REFERENCES
[1] J. Han and M. Kamber. Data Mining Concepts and Techniques. Morgan Kauf-
mann Publishers, 2001.
[2] W.L. Du, Yunghsiang S. Han and S. Chen. Privacy-preserving multivariate sta-