Mining Patterns from Protein Structures
Wei Wang, University of North Carolina at Chapel Hill, USA
Abstract
One of the next great frontiers in molecular biology is to understand,
and predict protein function. Proteins are simple linear chains of
polymerized amino acids (residues) whose biological functions are
determined by the three-dimensional shapes that they fold into. Hence,
understanding proteins requires a unique combination of chemical and
geometric analysis. A popular approach to understanding proteins is to
break them down into structural sub-components called motifs. Motifs
are recurring structural and spatial units that are frequently
correlated with specific protein functions. Traditionally, the discovery
of motifs has been a laborious task of scientific exploration.
In this talk, I will discuss recent data-mining algorithms for
automatically identifying potential spatial motifs. These methods
automatically find frequently occurring substructures within graph-based
representations of proteins. We represent each protein's structure as a
graph, where vertices correspond to residues. Two types of edges connect
residues: sequence edges connect pairs of adjacent residues in the
primary sequence, and proximity edges represent physical distances,
which are indicative of intra-molecular interactions. Such interactions
are believed to be key indicators of the protein's function.
This representation allows us to apply innovative graph mining
techniques to explore protein databases and associated protein families.
The complexity of protein structures and corresponding graphs poses
significant computational challenges. The kernel of this approach is an
efficient subgraph-mining algorithm that detects all (maximal) frequent
subgraphs from a graph database with a user-specified minimal frequency.