Protein structure prediction is one of the most significant problems tackled in computational molecular biology. It has the aim of determining the three-dimensional structure of proteins from their amino acid sequences. In more formal terms, this is the prediction of protein tertiary structure from primary structure. Given the usefulness of known protein structures in such valuable tasks as rational drug design, this is a highly active field of research.

Table of contents
1 Overview
2 De novo protein modelling
3 Comparative protein modelling


The practical role of protein structure prediction is now more important than ever. Massive amounts of protein sequence data may be derived from modern large-scale DNA sequencing efforts of, for example, the Human Genome Project. Protein structure determination, typically by time-consuming and relatively expensive X-ray crystallography or NMR spectroscopy, is lagging far behind.

A number of factors exist that make protein structure prediction a very difficult task, including:

  • The number of possible structures that proteins may possess is extremely large, as highlighted by the Levinthal paradox.
  • The physical basis of protein structural stability is not fully understood.
  • The primary sequence may not fully specify the tertiary structure. For example, proteins known as chaperonins have the ability to induce proteins to fold in specific ways.

Despite the above hinderances, much progress is being made by the many research groups that are interested in the task. Prediction of structures for small proteins is now a perfectly realistic goal. A wide range of approaches are routinely applied for such predictions. These approaches may be classified into two broad classes; de novo modelling and comparative modelling.

De novo protein modelling

De novo protein modelling methods seek to build three-dimensional protein models "from scratch". There are many possible procedures that either attempt to mimic protein folding or apply some stochastic method to search possible solutions. These procedures tend to require vast computational resources, and will be tackled by the powerful Blue Gene supercomputer when it goes online.

Comparative protein modelling

Comparative protein modelling uses previously solved structures as starting points, or templates. These methods may also be split into two groups:

  • Homology modelling is based on the reasonable assumption that two homologous proteins will share very similar structures. Given the amino acid sequence of a unknown structure and the solved structure of a homologous protein, each amino acid in the solved structure is mutated, computationally, into the corresponding amino acid from the unknown structure.
  • Protein threading scans the amino acid sequence of a unknown structure against a database of solved structures. In each case, a scoring function is used to assess the compatibility of the sequence to the structure, thus yielding possible three-dimensional models.