Index of coincidence

In cryptography, coincidence counting is the technique of putting two texts side-by-side and counting the number of times that a letter appears next to itself in both copies. This count, as a ratio of the total, is known as the index of coincidence. The technique is used to cryptanalyze the Vigenère Cipher, for example.

Coincidence counting can help determine when two texts are written in the same language, using the same alphabet. For such texts, the coincidence count will be distinctly higher than the coincidence count for texts in different languages, or using different alphabets, or gibberish texts. The technique has been applied to examine the purported Bible code.

To see why, imagine an "alphabet" of only the two letters A and B. Suppose that in our "language", the letter A is used 75% of the time, and the letter B is used 25% of the time. If two texts in this "language" are laid side-by-side, then the following pairs can be expected:

Pair   Probability
-----+-------------
 AA  | 56.25%
 BB  |  6.25%
 AB  | 18.75%
 BA  | 18.75%

Overall, the probability of a "coincidence" is 62.5%.

Now suppose that we place two messages side-by-side: one message in this "language"; one message encrypted using the substitution cipher which replaces A with B and vice versa. The following pairs can now be expected:

Pair   Probability
-----+-------------
 AA  | 18.75%
 BB  | 18.75%
 AB  | 56.25%
 BA  |  6.25%

Now the probability of a coincidence is only 37.5%, noticeably lower than the probability when same-language, same-alphabet texts were used. In effect, coincidences were more likely because the most frequent letters in each text were the same, so the odds that those letters would appear side-by-side are maximized.

The same principle applies to real languages like English. Certain letters, like E, occur much more frequently than other letters--a fact which is used in frequency analysis of substitution ciphers. Coincidences involving the letter E, for example, are rather likely. So when any two English texts are compared, the coincidence count will be higher than when an English text and a foreign-language text are used.

It can easily be imagined that this effect can be subtle. For example, similar languages will have a higher coincidence count than dissimilar languages. And it isn't hard to generate random text with a frequency distribution similar to real text, artificially raising the coincidence count. Nevertheless, this technique can be used effectively to identify when two texts contain meaningful information in the same language using the same alphabet.