Feature selection for indexing protein structures

Abstract. Protein is composed of amino acids which are, in turn, made up of mostly carbon, hydrogen, oxygen, nitrogen. A protein structure consists of thousands of coordinates of its atoms. Building structure index tables (often organized by suffix trees or arrays) of proteins is an important phase for quickly searching or classifying protein structures. Most previous studies use only structural features to build the index tables, therefore searching and classifying performances based on these index tables are not good enough. In this paper, we propose two methods of feature selection to create index tables that contain not only structural features but also sequantial features. Experiments on a protein classification dataset (called SCOP) showed that our proposed feature selection methods considerably improve the searching and classifying performances when compared with previous feature selection methods.

pdf12 trang | Chia sẻ: thanhle95 | Lượt xem: 221 | Lượt tải: 0download
Bạn đang xem nội dung tài liệu Feature selection for indexing protein structures, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
JOURNAL OF SCIENCE OF HNUE Natural Sci., 2011, Vol. 56, No. 7, pp. 32-43 FEATURE SELECTION FOR INDEXING PROTEIN STRUCTURES Luong Van Hieu Hanoi Vocational College for Electro-Mechanics Pham Tho Hoan(∗) Hanoi National University of Education (∗)E-mail: hoanpt@hnue.edu.vn Abstract. Protein is composed of amino acids which are, in turn, made up of mostly carbon, hydrogen, oxygen, nitrogen. A protein structure consists of thousands of coordinates of its atoms. Building structure index tables (often organized by suffix trees or arrays) of proteins is an important phase for quickly searching or classifying protein structures. Most previous studies use only structural features to build the index tables, therefore searching and classifying performances based on these index tables are not good enough. In this paper, we propose two methods of feature selection to create index tables that contain not only structural features but also sequantial features. Experiments on a protein classification dataset (called SCOP) showed that our proposed feature selection methods considerably improve the searching and classifying performances when compared with previous feature selection methods. Keywords: Protein structure, indexing, feature selection. 1. Introduction The goal of Life Sciences is to understand the function of biological molecules such as the protein, DNA, RNA. While biology technologies can now easily deter- mine the sequence or 3D structure of biological molecules, it is difficult to discover the functions of the biological molecules. However, the structures of the sequence proteins, especially 3D structures, may be important information to predict their functions based on the previously known functions of other proteins with similar structures. In general, the problem of structural search and comparison can be solved through two main phases: firstly, extracting informative feature vectors of 3D structures; secondly, representing and organizing these feature vectors by some ap- propriate data structures (example: suffix tree or suffix array) for quickly search- ing/classifying them [9, 10]. Of the two phases, only the first one (i.e. extracting informative structural features) influences the accuracy of searching/classifying re- sults. 32 Feature selection for indexing protein structures In recent years many different approaches to extract and index feature vec- tors for searching/classifying protein structures have been developed. For example, ProGreSS [1] extracts feature vectors from both structure and sequence constituents of the proteins. These feature vectors are then combined and indexed using a multi- dimensional indexing method. A better algorithm is PSIST (standing for Protein Structure Indexing using Suffix Trees). It converts the 3D structure to a sequence (string) through some small steps: extracting the feature vector, including the dis- tance between each two residues and the angle between their planes, each component of vector is then converted to a unique symbol. These structural features sequence (string) can be input in a suffix tree to speed up the searching process. Another algorithm, PSISA [10], uses the same approach of extracting feature vectors. How- ever, instead of using the suffix tree, the PSISA use the suffix array to store the feature sequences. The experimental results in PSISA showed that indexing using suffix array improves memory utilization with a factor of more than 35% over the suffix tree as in PSIST [9]. In this paper, we propose to add other kinds of features to improve the struc- ture searching/classifying performance. We propose two methods of extracting fea- ture vectors. The first (called SFAS - Structure Feature Attached Side-Chain) fo- cuses on protein 3D structures. The second (called SSF - Structure and Sequence Feature) is a combination of tertiary structure and primary structure. Our initial experimental results on SCOP classification showed that the proposed feature types improve considerably the searching/classifying performance. 2. Content 2.1. Methods 2.1.1. The approach of protein structure indexing Protein is a macro-molecule that consists of many monomers (called amino acids) linked together in a sequence. Amino acids are a structure of a central carbon (Cα) linked with three constituents: an amin-group (−NH2), cacboxyl group (- COOH) and a tail R that characterize properties of each amino acids. There are only 20 different kinds of amino acids found in the proteins of living organisms [5] and they are named with 20 alphabet symbols. If we only consider protein as a sequence of amino acids, we have a primary structure. Properties (functions) of protein is often determined from its 3D struc- ture, which means the relative position of atoms (C, H, O, N) constitutes the amino acid sequence in a any reference system. The 3D structure of protein is usually called the tertiary structure. In a 3D structure of protein, there are usually three sub-structure types: Alpha-Helices (H), Beta-Strains (B), and Loops (C). If we view the structure of protein as a sequence of three these sub-structure types (H, E, C), we have the secondary structure of protein. For studying protein, using the 3D 33 Luong Van Hieu and Pham Tho Hoan structure (tertiary) to analyse is always the best. However, data of 3D structure is usually very large. Figure 1. Structure of amino acid Currently there are about 73,800 proteins were determined with 3D struc- ture which can be downloaded from the internet network that their 3-D structure have been determined. They have been published in Protein Database Bank - (The number of protein sequences that have been identified is much larger since the sequence of a protein is more easily determined). There- fore, PDB is a very valuable structure library in identifying new protein functions. Whenever we discover a new protein and its structure. We can predict functions of a new protein if we know its 3-D structure by searching proteins in PDB with similar structures. Searching or classifying homologous protein structures can be performed by pairwise structure alignment [6, 8]. However, this method is not feasible. They are not suitable for searching similar structures due to their time complexity [7]. Thus, current structure matching approaches tend to be heuristic-based. Database indexing and scalable searching approaches satisfy this requirement [1, 4]. That’s why we first sought to represent the structure by feature vectors and later using the suffix tree or suffix array structure to represent and search similar structures. The problem of finding/classifying protein structures is especially challenging due to a number of reasons: Feature representation : Each atom is a point in the 3D space, and its coordinates depend on the origin and axis. The relationship between atoms in 3D space includes both the translational and rotational degree. Thus, we need a precise feature representation. Similarity measurement : The similarity of proteins is usually defined by the score accumulated from their residues. For a given query, one of the most com- mon similarity score schemes is the number of votes accumulated from the matching residues [1, 4]. Another scoring scheme is the p-value of the protein based on the number of votes, where smaller p-values imply better similarity [1, 4]. However, cur- rent methods assume that the individual feature vectors are independent from each other. Thus, these scoring schemes are point-based and do not take into account 34 Feature selection for indexing protein structures the local segment similarity. Complexity : Currently, indexing methods such as R∗-trees and hash index- ing can only retrieve the match points/features for a single feature vector of the query. Thus, the matching results might include a lot of random hits which do not contribute to the alignment between the query and template proteins from the database. So we need to retrieve only the significant matching feature vectors to reduce the search complexity. 2.1.2. Feature selection * Comparing and extracting feature vector methods The protein structure can often be represented as a set of points (atoms) in 3D space. For example, PDB (Protein Data Bank) uses this method by arranging a protein on an imaginary Cartesian coordinate frame and assign (x, y, z) coordinates to each atom. This representation serves as a basis of different simplified high level representations. The DALI system [8] uses distance matrix method to compare structural re- lationships between the proteins. It is a 2D representation of the 3D structure, as it contains all pairwise distances between atoms - in this case Cα atoms. The SSAP system [3] uses a distance plot-based method to compare internal geometry between proteins via the Needleman - Wunsch dynamic programming algorithm. The structure of a protein is represented by describing a structural environment for each amino acid as a set of vectors from its Cβ atom to Cβ atoms of all the other amino acids in the protein. If the structural environments in two protein structures are similar, the structures are supposed to be similar. The PSIST algorithm [9] extracts local structural feature vectors using a slid- ing window along the backbone of the protein. For each pair of residues, the Al- gorithm calculates the distance between their Cα atoms and the angle between the planes formed by the Cα, N and C atoms of each residue. The protein structure is converted to a sequence (called the structure-feature sequence or SF-sequence) of symbols. All the above mentioned methods are based on measuring internal Cα-Cα (residue-residue) or Cβ-Cβ (sidechain-sidechain) distances and the angle between two norms of two planes formed by the Cα, N and C. This paper will adopt two different ways to represent a protein’s structure. The method 1 uses the distance d as vectors of Cα − Cβ atoms and the angle (θ) between two normal vectors formed by the two planes of two residues. The θ angle is used to give more information for representing torsion angles of protein structure. Since positions of Cα atoms are variance from time to time, the sidechain’s Cβ atoms are introduced to give more information for representing a protein’s structure. Because of description of the relationship between residue and sidechain, this method is called SFAS method (Structure Feature Attached Sidechain). The second method is feature combination 35 Luong Van Hieu and Pham Tho Hoan between structure and chemical characteristics of protein. In the structure part, we extract the features that are distances between the Cα atoms using a sliding window along the backbone. There are 20 different amino acids that divided into three types according to chemical characteristics of them (except Gly). In each window used to extract feature structure, we put features about chemical properties of the amino acids of the window by assigning for them a score. This method is called SSF (Structure and Sequence Feature), because it has the combination of structure and sequence. * The proposed methods • The SFAS method (standing for Structure Feature Attached Sidechain): For a given query, we need to retrieve all the proteins that have a similar structure. In this method, we will extract the local features from the protein tertiary structure. A protein is composed of an ordered sequence of residues linked by peptide bonds. Each residue has Cα, N and C, which constitutes the backbone of the protein. Although the backbone is topologically linear, it has a complex geometrically. Bond lengths, bond angles and torsion angles completely define the conformation and geometry of the protein (see Figure 2). Figure 2. The torsion angles φ, ψ, ω, χ and the bond lengths of any amino acid in a peptid We define distance d between two residues as the Euclidean distance between Cα atom of first residue and Cβ atom of second one. The angle θ between the two planes defined by the N-Cα-C triangles, is calculated between their norms having Cα as the origin (see Figure 3). Figure 3. The distance and the angle between two residues 36 Feature selection for indexing protein structures The Euclidean distance between Cα atom and Cβ atom is calculated by their 3D coordinates directly. Equation (2.1) is used to calculate the distance between two residues. d(x, y) = ‖x− y‖ = √√√√ n∑ i=1 (xi − yi)2 (2.1) The angle between any two residues is calculated by finding the normal vec- tor of each residue’s plane. Firstly the plane is defined by three points which are presented by residue’s three atoms N, Cα and C, equation (2.2) is used to calculate the normal vector of the residue. −→n = −−→ NCα ∗ −−→CαC ‖−−→NCα ∗ −−→CαC‖ (2.2) Next we calculate the normal vector for each plane. Finally, the angle be- tween the two normal vectors which present the two residues is considered the angle between the two residues. Equation (2.3) is used to calculate angle between two residues. cosθ = ‖−→n1‖2 + ‖−→n2‖2 − ‖−→n2−−→n1‖2 2 ∗ ‖−→n1‖ ∗ ‖−→n2‖ (2.3) Each of the windows is combined with a feature vector. To describe the local features between a set of residues, we slide a window of length w along the backbone of protein. The distances and the angles between the first residue i and all the other residues j (with j ∈ [i + 1, i + w - 1]) within the window are computed and added to a feature vector. Using this method the protein is described as a vector of sub vectors: P = {pi|1 ≤ i ≤ n} where pi is the ith-residue along the backbone. The feature vector of protein is defined as P v = {pvi |1 ≤ i ≤ n−w + 1}, where w is the sliding window size, and pvi is a feature vector which include several constituents. {cosθ(pi, pi+1), d(pi, pi+1), ..., cosθ(pi, pi+w−1), d(pi, pi+w−1)} where d(pi, pj) is the distance between the residues pi and pj , and θ(pi, pj) gives the angle between the residues pi and pj . Since the constituents of the extracted feature vector have a great degree of freedom, we apply normalization as used in PSIST for both angle and distance to reduce the vector space. 37 Luong Van Hieu and Pham Tho Hoan • The SSF method (Standing for Structure and Sequence Feature): This method is proposed to extract feature vector including both structure and features of a protein. The protein is represented as a list of sequence and structure con- stituents, where each amino acid has a sequence and a structure component. The sequence component of amino acid at a residue is its amino acid name indicated by one or three letter codes in the letter alphabet. The structure component consists of Secondary Structure Element (SSE) types of that residue (α - helix, β - sheet, or turn), and a 3D vector which shows the position of its carbon-alpha (Cα) atom. We extract a number of feature vectors on sequence and structure constituents of each protein in the database by sliding a window. These vectors normalized by converting to a sequence of symbols. Structure feature constituents : For a given protein, we slide a window of a pre-specified size, w, along the proteins. For each window, the distance between Cα atom of the first residue and the Cα atoms of the other residues in the window is calculated. It is Euclidean distance which is calculated as described in equation 2.1 in the SFAS method. Sequence feature constituents : In each sliding window used to extract the structure feature, we extract additional sequence features by using a chemical clas- sification property of amino acid [5] based on their name. Thus, each window will contain w amino acid. These amino acids are assigned a value according to their chemical property: 0 with hydrophobic amino acid, 1 with polar property, 2 with charged one. The chemical property of amino acid in protein is showed in Table 1 [5]. Table 1. Classifying according to chemical property of amino acid Property Amino acid Hydrophobic Ala, Val, Phe, Ile, Leu, Pro, Met Polar Ser, Thr, Cys, Tyr, Asn, Gln, His, Trp Charged Asp, Glu, Lys, Arg Using this method the protein is described as a vector of vectors. Each sub vector Pi contains 2w - 1 constituents , where w refers to the window size and Pi refers to the following feature vector {tci, d(pi, pi+1), tci+1, ..., d(pi, pi+w−1, tci+w−1)}. where d(pi, pj) is distance between pi and pj residues, and tci is scored with classifi- cation property of ith amino acid. Protein is described as a set of vectors Pi (from P1 to Pn). Finally, We have a set of vectors Pj , one Pj for each protein in dataset. Each Pj contains n vector Pi where 1 ≤ j ≤ m and m is the number of proteins in dataset. Since the constituents d of feature vector have a great degree of freedom, we apply the normalization step as used in PSIST to reduce the vector space. 38 Feature selection for indexing protein structures Figure 4. Feature vectors for structure and sequence of protein 2.1.3. Results and discussion * The Evaluation method To evaluate the representative powerfulness of features of the protein struc- tures, we have used the classification database of protein structure: SCOP [2]. SCOP classifies proteins with four levels, namely: family, super-family, fold and class. For our tests, the target database D, has proteins from all four classes of SCOP: all α, all β, α + β and α/β. Our dataset D also includes a total of 1810 proteins taken from 181 super-families which have at least 10 proteins, but only 10 proteins are chosen from each super-family. We then selected 181 query structures that each is randomly chosen from each of 181 above superfamilies. So the size of the query set Dq is also 181. This is the same dataset used in several previous indexing studies [1, 9, 10]. Figure 5. SCOP simple hierarchy We perform two different tests. The retrieval tests is to find the number of matching structures from the same super-family as the query among the top − k scoring proteins, and the classification tests try to classify the query at the super- family and class levels. Our proposed methods are implemented in Java, and all experiments reported below were done on a PC with core i3 2.4GHz processor and 2GB RAM on windows 7 OS. 39 Luong Van Hieu and Pham Tho Hoan * Retrieval test We compare the performance of the proposed features in this paper with the best previous extracting features and indexing studies in PSISA [10], in finding the 181 structures Dq from set of 1810 structures D. In this experiment, we used the method of representation and search of features using suffix array [10], using the code of indexing based suffix array provided by Ahmed Salah [10]. We compare the results of our SFAS method with PSISA for two different parameter values (b = 2 and b = 4), and the other parameters are fixed (w = 3, ε = 0, l = 15). We ran experiments with both methods of extracting feature vectors to obtain the number of proteins found from the same super-family for each of the 181 queries. Since each super-family has 10 proteins, including the query, there can be at most 10 correct matching proteins from the same super-family. There are five parameters used in our approach. W is the size of the window used to extract the feature vectors, b is the range used to normalize the feature vector, ε is the distance threshold based on the normalize feature vector. l is the minimum length of the maximal matches, and k is the number of top scoring proteins reported. We first show how the two approaches perform for different values of w, ε, b, l and k. Figure 6. The ability to accurately match the protein for varying parameters Retrieval results are determined by counting the same super-family proteins with the query from the top k of the proteins found. We in