DNA motif

A DNA motif is a nucleic acid or amino acid sequence pattern that has, or is conjectured to have, some biological significance. Normally, the pattern is fairly short and is known to recur in different genes or several times within a gene. DNA motifs are often associated with structural motifs found in proteins.

An example is the N-glycosylation site motif:

Asn, followed by anything but Pro, followed by either Ser or Thr, followed by anything but Pro

where the three-letter abbreviations are the conventional designations for amino acids (see Genetic code).

This pattern may be written as N{P}[ST]{P}

where N=Asn, P=Pro, S=Ser, T=Thr

and {X} means any amino acid except X; and [XY] means either X or Y.

The notation [XY] does not give any indication of the probability of X or Y occurring in the pattern. Sometimes patterns are defined in terms of a probabilistic model such as a hidden Markov model.

Table of contents

1 Motifs and consensus sequences
2 Discovery of DNA motifs

2.1 Software
2.2 Discovery through evolutionary conservation

3 Pattern Description Notations

3.3 PROSITE Pattern Notation

4 See also
5 References

Motifs and consensus sequences

The notation [XYZ] means X or Y or Z, but does not indicate the likelihood of any particular match. For this reason, two or more patterns are often associated with a single DNA motif - the defining pattern, and various typical patterns.

For example, the defining sequence for the IQ motif may be taken to be:

[FILV]Qxxx[RK]Gxxx[RK]xx[FILVWY]

where x signifies any amino acid, and the square brackets indicate an alternative (see below for further details about notation).

Usually, however, the first letter is I, and both [RK] choices resolve to R. Since the last choice is so wide, the pattern IQxxxRGxxxR is sometimes equated with the IQ motif itself, but a more accurate description would be a consensus sequence for the IQ motif.

Discovery of DNA motifs

Software

There are software programs which, given multiple input sequences, attempt to identify one or more candidate motifs. One example is MEME (see References below), which generates statistical information for each candidate.

Discovery through evolutionary conservation

DNA motifs have been discovered by studying similar genes in different species. For example, by aligning the amino acid sequences specified by the GCM (glial cells missing) gene in man, mouse and D. melanogaster, Akiyama and others discovered a pattern which they called the GCM motif. It spans about 150 amino acid residues, and begins as follows:

WDIND*.*P..*...D.F.*W***.**.IYS**...A.*H*S*WAMRNTNNHN

Here each . signifies a single amino acid or a gap, and each * indicates one member of a closely-related family of amino acids.

The authors were able to show that the motif has DNA binding activity.

Pattern Description Notations

Several notations for describing motifs are in use but most of them are variants of standard notations for regular expressions and use these conventions:

there is an alphabet of single characters, each denoting a specific amino acid or a set of amino acids;
a string of characters drawn from the alphabet denotes a sequence of the corresponding amino acids;
any string of characters drawn from the alphabet enclosed in square brackets matches any one of the corresponding amino acids; e.g. [abc] matches any of the amino acids represented by a or b or c.

The fundamental idea behind all these notations is the matching principle, which assigns a meaning to a sequence of elements of the pattern notation:

a sequence of elements of the pattern notation matches a sequence of amino acids if and only if the latter sequence can be partitioned into subsequences in such a way that each pattern element matches the corresponding subsequence in turn.

Thus the pattern [AB] [CDE] F matches the six amino acid sequences corresponding to ACF, ADF, AEF, BCF, BDF, and BEF.

Different pattern description notations have other ways of forming pattern elements. One of these notations is the PROSITE notation, described in the following subsection.

PROSITE Pattern Notation

The PROSITE notation uses the IUPAC one-letter codes and conforms to the above description with the exception that a concatenation symbol, '-', is used between pattern elements, but it is often dropped between letters of the pattern alphabet.

PROSITE allows the following pattern elements in addition to those described previously:

The lower case letter 'x' can be used as a pattern element to denote any amino acid.
A string of characters drawn from the alphabet and enclosed in braces (curly brackets) denotes any amino acid except for those in the string. For example, {ST} denotes any amino acid other than S or T.
If a pattern is restricted to the N-terminal of a sequence, the pattern is prefixed with '<'.
If a pattern is restricted to the C-terminal of a sequence, the pattern is suffixed with '>'.
The character '>' can also occur inside a terminating square bracket pattern, so that S[T>] matches both "ST" and "S>".
If e is a pattern element, and m and n are two decimal integers with m <= n, then:
e(m) is equivalent to the repetition of e exactly m times;
e(m,n) is equivalent to the repetition of e exactly k times for any integer k satisfying: m <= k <= n.

Some examples:

x(3) is equivalent to x-x-x
x(2,4) matches any sequence that matches x-x or x-x-x or x-x-x-x

The signature of the C2H2-type zinc finger domain is: C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H

References

Akiyama, Y. et al. The gcm-motif: a novel DNA-binding motif conserved in Drosophila and mammals. Proc. Natl. Acad. Sci. USA (1996) 93:14912-14916.

PROSITE Database of protein families and domains

The MEME/MAST System for Motif Discovery and Search

MEME Documentation