Formal language

In mathematics, logic and computer science, a formal language is a set of finite-length words (or "strings") over some finite alphabet. Note that we can talk about formal language in many contexts (scientific, legal and so on), meaning a mode of expression more careful and accurate than everyday speech. Use of a particular formal language in the sense intended here is an 'ultimate' version of that usage: formal enough to be used in written form for automatic computation, is a possible criterion.

A typical alphabet would be {a, b}, a typical string over that alphabet would be "ababba", and a typical language over that alphabet containing that string would be the set of all strings which contain the same number of a's as b's. The empty word is allowed and is usually denoted by e, ε or λ. Note that while the alphabet is a finite set and every string has finite length, a language may very well have infinitely many member strings.

Some examples of formal languages:

the set of all words over {a, b},
the set { aⁿ | n is a prime number },
the set of syntactically correct programs in some programming language, or
the set of inputs upon which a certain Turing machine halts.

A formal language can be specified in a great variety of ways, such as:

Strings produced by some formal grammar (see Chomsky hierarchy)
Strings produced by a regular expression
Strings accepted by some automaton, such as a Turing machine or finite state automaton
From a set of related YES/NO questions those ones for which the answer is YES, see decision problem

Several operations can be used to produce new languages from given ones. Suppose L₁ and L₂ are languages over some common alphabet.

The concatenation L₁L₂ consists of all strings of the form vw where v is a string from L₁ and w is a string from L₂.
The intersection of L₁ and L₂ consists of all strings which are contained in L₁ and also in L₂.
The union of L₁ and L₂ consists of all strings which are contained in L₁ or in L₂.
The complement of the language L₁ consists of all strings over the alphabet which are not contained in L₁.
The right quotient L₁/L₂ of L₁ by L₂ consists of all strings v for which there exists a string w in L₂ such that vw is in L₁.
The Kleene star L₁* consists of all strings which can be written in the form w₁w₂...w_n with strings w_i in L₁ and n ≥ 0. Note that this includes the empty string ε because n = 0 is allowed.
The reverse L₁^R contains the reversed versions of all the strings in L₁.
The shuffle of L₁ and L₂ consists of all strings which can be written in the form v₁w₁v₂w₂...v_nw_n where n ≥ 1 and v₁,...,v_n are strings such that the concatenation v₁...v_n is in L₁ and w₁,...,w_n are strings such that w₁...w_n is in L₂.

A typical question asked about a formal language is how difficult it is to decide whether a given word belongs to the language. This is the domain of computability theory and complexity theory.