Abstract. Protein is composed of amino acids which are, in turn, made up
of mostly carbon, hydrogen, oxygen, nitrogen. A protein structure consists
of thousands of coordinates of its atoms. Building structure index tables
(often organized by suffix trees or arrays) of proteins is an important phase
for quickly searching or classifying protein structures. Most previous studies
use only structural features to build the index tables, therefore searching and
classifying performances based on these index tables are not good enough.
In this paper, we propose two methods of feature selection to create index
tables that contain not only structural features but also sequantial features.
Experiments on a protein classification dataset (called SCOP) showed that
our proposed feature selection methods considerably improve the searching
and classifying performances when compared with previous feature selection
methods.
12 trang |
Chia sẻ: thanhle95 | Lượt xem: 236 | Lượt tải: 0
Bạn đang xem nội dung tài liệu Feature selection for indexing protein structures, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
JOURNAL OF SCIENCE OF HNUE
Natural Sci., 2011, Vol. 56, No. 7, pp. 32-43
FEATURE SELECTION FOR INDEXING PROTEIN STRUCTURES
Luong Van Hieu
Hanoi Vocational College for Electro-Mechanics
Pham Tho Hoan(∗)
Hanoi National University of Education
(∗)E-mail: hoanpt@hnue.edu.vn
Abstract. Protein is composed of amino acids which are, in turn, made up
of mostly carbon, hydrogen, oxygen, nitrogen. A protein structure consists
of thousands of coordinates of its atoms. Building structure index tables
(often organized by suffix trees or arrays) of proteins is an important phase
for quickly searching or classifying protein structures. Most previous studies
use only structural features to build the index tables, therefore searching and
classifying performances based on these index tables are not good enough.
In this paper, we propose two methods of feature selection to create index
tables that contain not only structural features but also sequantial features.
Experiments on a protein classification dataset (called SCOP) showed that
our proposed feature selection methods considerably improve the searching
and classifying performances when compared with previous feature selection
methods.
Keywords: Protein structure, indexing, feature selection.
1. Introduction
The goal of Life Sciences is to understand the function of biological molecules
such as the protein, DNA, RNA. While biology technologies can now easily deter-
mine the sequence or 3D structure of biological molecules, it is difficult to discover
the functions of the biological molecules. However, the structures of the sequence
proteins, especially 3D structures, may be important information to predict their
functions based on the previously known functions of other proteins with similar
structures.
In general, the problem of structural search and comparison can be solved
through two main phases: firstly, extracting informative feature vectors of 3D
structures; secondly, representing and organizing these feature vectors by some ap-
propriate data structures (example: suffix tree or suffix array) for quickly search-
ing/classifying them [9, 10]. Of the two phases, only the first one (i.e. extracting
informative structural features) influences the accuracy of searching/classifying re-
sults.
32
Feature selection for indexing protein structures
In recent years many different approaches to extract and index feature vec-
tors for searching/classifying protein structures have been developed. For example,
ProGreSS [1] extracts feature vectors from both structure and sequence constituents
of the proteins. These feature vectors are then combined and indexed using a multi-
dimensional indexing method. A better algorithm is PSIST (standing for Protein
Structure Indexing using Suffix Trees). It converts the 3D structure to a sequence
(string) through some small steps: extracting the feature vector, including the dis-
tance between each two residues and the angle between their planes, each component
of vector is then converted to a unique symbol. These structural features sequence
(string) can be input in a suffix tree to speed up the searching process. Another
algorithm, PSISA [10], uses the same approach of extracting feature vectors. How-
ever, instead of using the suffix tree, the PSISA use the suffix array to store the
feature sequences. The experimental results in PSISA showed that indexing using
suffix array improves memory utilization with a factor of more than 35% over the
suffix tree as in PSIST [9].
In this paper, we propose to add other kinds of features to improve the struc-
ture searching/classifying performance. We propose two methods of extracting fea-
ture vectors. The first (called SFAS - Structure Feature Attached Side-Chain) fo-
cuses on protein 3D structures. The second (called SSF - Structure and Sequence
Feature) is a combination of tertiary structure and primary structure. Our initial
experimental results on SCOP classification showed that the proposed feature types
improve considerably the searching/classifying performance.
2. Content
2.1. Methods
2.1.1. The approach of protein structure indexing
Protein is a macro-molecule that consists of many monomers (called amino
acids) linked together in a sequence. Amino acids are a structure of a central carbon
(Cα) linked with three constituents: an amin-group (−NH2), cacboxyl group (-
COOH) and a tail R that characterize properties of each amino acids. There are
only 20 different kinds of amino acids found in the proteins of living organisms [5]
and they are named with 20 alphabet symbols.
If we only consider protein as a sequence of amino acids, we have a primary
structure. Properties (functions) of protein is often determined from its 3D struc-
ture, which means the relative position of atoms (C, H, O, N) constitutes the amino
acid sequence in a any reference system. The 3D structure of protein is usually
called the tertiary structure. In a 3D structure of protein, there are usually three
sub-structure types: Alpha-Helices (H), Beta-Strains (B), and Loops (C). If we view
the structure of protein as a sequence of three these sub-structure types (H, E, C),
we have the secondary structure of protein. For studying protein, using the 3D
33
Luong Van Hieu and Pham Tho Hoan
structure (tertiary) to analyse is always the best. However, data of 3D structure is
usually very large.
Figure 1. Structure of amino acid
Currently there are about 73,800 proteins were determined with 3D struc-
ture which can be downloaded from the internet network that their 3-D structure
have been determined. They have been published in Protein Database Bank -
(The number of protein sequences that have been identified
is much larger since the sequence of a protein is more easily determined). There-
fore, PDB is a very valuable structure library in identifying new protein functions.
Whenever we discover a new protein and its structure. We can predict functions
of a new protein if we know its 3-D structure by searching proteins in PDB with
similar structures.
Searching or classifying homologous protein structures can be performed by
pairwise structure alignment [6, 8]. However, this method is not feasible. They
are not suitable for searching similar structures due to their time complexity [7].
Thus, current structure matching approaches tend to be heuristic-based. Database
indexing and scalable searching approaches satisfy this requirement [1, 4]. That’s
why we first sought to represent the structure by feature vectors and later using the
suffix tree or suffix array structure to represent and search similar structures.
The problem of finding/classifying protein structures is especially challenging
due to a number of reasons:
Feature representation : Each atom is a point in the 3D space, and its
coordinates depend on the origin and axis. The relationship between atoms in 3D
space includes both the translational and rotational degree. Thus, we need a precise
feature representation.
Similarity measurement : The similarity of proteins is usually defined by
the score accumulated from their residues. For a given query, one of the most com-
mon similarity score schemes is the number of votes accumulated from the matching
residues [1, 4]. Another scoring scheme is the p-value of the protein based on the
number of votes, where smaller p-values imply better similarity [1, 4]. However, cur-
rent methods assume that the individual feature vectors are independent from each
other. Thus, these scoring schemes are point-based and do not take into account
34
Feature selection for indexing protein structures
the local segment similarity.
Complexity : Currently, indexing methods such as R∗-trees and hash index-
ing can only retrieve the match points/features for a single feature vector of the
query. Thus, the matching results might include a lot of random hits which do
not contribute to the alignment between the query and template proteins from the
database. So we need to retrieve only the significant matching feature vectors to
reduce the search complexity.
2.1.2. Feature selection
* Comparing and extracting feature vector methods
The protein structure can often be represented as a set of points (atoms) in
3D space. For example, PDB (Protein Data Bank) uses this method by arranging a
protein on an imaginary Cartesian coordinate frame and assign (x, y, z) coordinates
to each atom. This representation serves as a basis of different simplified high level
representations.
The DALI system [8] uses distance matrix method to compare structural re-
lationships between the proteins. It is a 2D representation of the 3D structure, as
it contains all pairwise distances between atoms - in this case Cα atoms.
The SSAP system [3] uses a distance plot-based method to compare internal
geometry between proteins via the Needleman - Wunsch dynamic programming
algorithm. The structure of a protein is represented by describing a structural
environment for each amino acid as a set of vectors from its Cβ atom to Cβ atoms
of all the other amino acids in the protein. If the structural environments in two
protein structures are similar, the structures are supposed to be similar.
The PSIST algorithm [9] extracts local structural feature vectors using a slid-
ing window along the backbone of the protein. For each pair of residues, the Al-
gorithm calculates the distance between their Cα atoms and the angle between the
planes formed by the Cα, N and C atoms of each residue. The protein structure is
converted to a sequence (called the structure-feature sequence or SF-sequence) of
symbols.
All the above mentioned methods are based on measuring internal Cα-Cα
(residue-residue) or Cβ-Cβ (sidechain-sidechain) distances and the angle between
two norms of two planes formed by the Cα, N and C. This paper will adopt two
different ways to represent a protein’s structure. The method 1 uses the distance d
as vectors of Cα − Cβ atoms and the angle (θ) between two normal vectors formed
by the two planes of two residues. The θ angle is used to give more information
for representing torsion angles of protein structure. Since positions of Cα atoms
are variance from time to time, the sidechain’s Cβ atoms are introduced to give
more information for representing a protein’s structure. Because of description of
the relationship between residue and sidechain, this method is called SFAS method
(Structure Feature Attached Sidechain). The second method is feature combination
35
Luong Van Hieu and Pham Tho Hoan
between structure and chemical characteristics of protein. In the structure part, we
extract the features that are distances between the Cα atoms using a sliding window
along the backbone. There are 20 different amino acids that divided into three
types according to chemical characteristics of them (except Gly). In each window
used to extract feature structure, we put features about chemical properties of the
amino acids of the window by assigning for them a score. This method is called SSF
(Structure and Sequence Feature), because it has the combination of structure and
sequence.
* The proposed methods
• The SFAS method (standing for Structure Feature Attached Sidechain): For
a given query, we need to retrieve all the proteins that have a similar structure. In
this method, we will extract the local features from the protein tertiary structure.
A protein is composed of an ordered sequence of residues linked by peptide
bonds. Each residue has Cα, N and C, which constitutes the backbone of the protein.
Although the backbone is topologically linear, it has a complex geometrically. Bond
lengths, bond angles and torsion angles completely define the conformation and
geometry of the protein (see Figure 2).
Figure 2. The torsion angles φ, ψ, ω, χ
and the bond lengths of any amino acid in a peptid
We define distance d between two residues as the Euclidean distance between
Cα atom of first residue and Cβ atom of second one. The angle θ between the two
planes defined by the N-Cα-C triangles, is calculated between their norms having
Cα as the origin (see Figure 3).
Figure 3. The distance and the angle between two residues
36
Feature selection for indexing protein structures
The Euclidean distance between Cα atom and Cβ atom is calculated by their
3D coordinates directly. Equation (2.1) is used to calculate the distance between
two residues.
d(x, y) = ‖x− y‖ =
√√√√ n∑
i=1
(xi − yi)2 (2.1)
The angle between any two residues is calculated by finding the normal vec-
tor of each residue’s plane. Firstly the plane is defined by three points which are
presented by residue’s three atoms N, Cα and C, equation (2.2) is used to calculate
the normal vector of the residue.
−→n =
−−→
NCα ∗ −−→CαC
‖−−→NCα ∗ −−→CαC‖
(2.2)
Next we calculate the normal vector for each plane. Finally, the angle be-
tween the two normal vectors which present the two residues is considered the angle
between the two residues. Equation (2.3) is used to calculate angle between two
residues.
cosθ =
‖−→n1‖2 + ‖−→n2‖2 − ‖−→n2−−→n1‖2
2 ∗ ‖−→n1‖ ∗ ‖−→n2‖
(2.3)
Each of the windows is combined with a feature vector. To describe the local
features between a set of residues, we slide a window of length w along the backbone
of protein. The distances and the angles between the first residue i and all the other
residues j (with j ∈ [i + 1, i + w - 1]) within the window are computed and added
to a feature vector.
Using this method the protein is described as a vector of sub vectors:
P = {pi|1 ≤ i ≤ n}
where pi is the ith-residue along the backbone. The feature vector of protein is
defined as P v = {pvi |1 ≤ i ≤ n−w + 1}, where w is the sliding window size, and pvi
is a feature vector which include several constituents.
{cosθ(pi, pi+1), d(pi, pi+1), ..., cosθ(pi, pi+w−1), d(pi, pi+w−1)}
where d(pi, pj) is the distance between the residues pi and pj , and θ(pi, pj) gives the
angle between the residues pi and pj . Since the constituents of the extracted feature
vector have a great degree of freedom, we apply normalization as used in PSIST for
both angle and distance to reduce the vector space.
37
Luong Van Hieu and Pham Tho Hoan
• The SSF method (Standing for Structure and Sequence Feature): This
method is proposed to extract feature vector including both structure and features
of a protein. The protein is represented as a list of sequence and structure con-
stituents, where each amino acid has a sequence and a structure component. The
sequence component of amino acid at a residue is its amino acid name indicated by
one or three letter codes in the letter alphabet. The structure component consists
of Secondary Structure Element (SSE) types of that residue (α - helix, β - sheet, or
turn), and a 3D vector which shows the position of its carbon-alpha (Cα) atom.
We extract a number of feature vectors on sequence and structure constituents
of each protein in the database by sliding a window. These vectors normalized by
converting to a sequence of symbols.
Structure feature constituents : For a given protein, we slide a window of a
pre-specified size, w, along the proteins. For each window, the distance between Cα
atom of the first residue and the Cα atoms of the other residues in the window is
calculated. It is Euclidean distance which is calculated as described in equation 2.1
in the SFAS method.
Sequence feature constituents : In each sliding window used to extract the
structure feature, we extract additional sequence features by using a chemical clas-
sification property of amino acid [5] based on their name. Thus, each window will
contain w amino acid. These amino acids are assigned a value according to their
chemical property: 0 with hydrophobic amino acid, 1 with polar property, 2 with
charged one. The chemical property of amino acid in protein is showed in Table 1
[5].
Table 1. Classifying according to chemical property of amino acid
Property Amino acid
Hydrophobic Ala, Val, Phe, Ile, Leu, Pro, Met
Polar Ser, Thr, Cys, Tyr, Asn, Gln, His, Trp
Charged Asp, Glu, Lys, Arg
Using this method the protein is described as a vector of vectors. Each sub
vector Pi contains 2w - 1 constituents , where w refers
to the window size and Pi refers to the following feature vector
{tci, d(pi, pi+1), tci+1, ..., d(pi, pi+w−1, tci+w−1)}.
where d(pi, pj) is distance between pi and pj residues, and tci is scored with classifi-
cation property of ith amino acid. Protein is described as a set of vectors Pi (from
P1 to Pn). Finally, We have a set of vectors Pj , one Pj for each protein in dataset.
Each Pj contains n vector Pi where 1 ≤ j ≤ m and m is the number of proteins in
dataset.
Since the constituents d of feature vector have a great degree of freedom, we
apply the normalization step as used in PSIST to reduce the vector space.
38
Feature selection for indexing protein structures
Figure 4. Feature vectors for structure and sequence of protein
2.1.3. Results and discussion
* The Evaluation method
To evaluate the representative powerfulness of features of the protein struc-
tures, we have used the classification database of protein structure: SCOP [2]. SCOP
classifies proteins with four levels, namely: family, super-family, fold and class.
For our tests, the target database D, has proteins from all four classes of
SCOP: all α, all β, α + β and α/β. Our dataset D also includes a total of 1810
proteins taken from 181 super-families which have at least 10 proteins, but only 10
proteins are chosen from each super-family. We then selected 181 query structures
that each is randomly chosen from each of 181 above superfamilies. So the size
of the query set Dq is also 181. This is the same dataset used in several previous
indexing studies [1, 9, 10].
Figure 5. SCOP simple hierarchy
We perform two different tests. The retrieval tests is to find the number of
matching structures from the same super-family as the query among the top − k
scoring proteins, and the classification tests try to classify the query at the super-
family and class levels. Our proposed methods are implemented in Java, and all
experiments reported below were done on a PC with core i3 2.4GHz processor and
2GB RAM on windows 7 OS.
39
Luong Van Hieu and Pham Tho Hoan
* Retrieval test
We compare the performance of the proposed features in this paper with the
best previous extracting features and indexing studies in PSISA [10], in finding the
181 structures Dq from set of 1810 structures D. In this experiment, we used the
method of representation and search of features using suffix array [10], using the
code of indexing based suffix array provided by Ahmed Salah [10]. We compare the
results of our SFAS method with PSISA for two different parameter values (b = 2
and b = 4), and the other parameters are fixed (w = 3, ε = 0, l = 15). We ran
experiments with both methods of extracting feature vectors to obtain the number
of proteins found from the same super-family for each of the 181 queries. Since each
super-family has 10 proteins, including the query, there can be at most 10 correct
matching proteins from the same super-family.
There are five parameters used in our approach. W is the size of the window
used to extract the feature vectors, b is the range used to normalize the feature
vector, ε is the distance threshold based on the normalize feature vector. l is the
minimum length of the maximal matches, and k is the number of top scoring proteins
reported. We first show how the two approaches perform for different values of w,
ε, b, l and k.
Figure 6. The ability to accurately match the protein
for varying parameters
Retrieval results are determined by counting the same super-family proteins
with the query from the top k of the proteins found. We in