N-Grams

Definitions

n-gram
- sequence of n words
bigram
- two word sequence
trigram
- three word sequence
n-gram model
- using probabilistic methods to predict what the nth word will be given the previous n-1 words

N-grams play important role in speech recognition — mimicking how humans discern words in context.
Cf. Lady Mondegreen or kissthisguy.com.

Another area is the grammar assessment part of automatic spell checkers.

Aspects

Problem Specification

E.g., given the string of tokens We have met the enemy and ... what is the next token likely to be?

P(he | We have met the enemy and ) =

   Count(We have met the enemy and he)
   -----------------------------------
     Count(We have met the enemy and)

Problem — only gives probability for current corpus, can not even extend to related corpora.

We know:
P(X1X2...Xn) =
P(X1)×P(X2|X1)× P(X3|X1X2)× ... × P(Xn|X1X2...Xn-1)

I.e., in terms of word tokens:
P(w1w2...wn) =
P(w1)×P(w2|w1)× P(w3|w1w2)× ... × P(wn|w1w2...wn-1)
=
Πk=1n P(wk|w1w2...wk-1)

The bigram model assumes
P(wn|wn-1) ≈
P(w1)×P(w2|w1)× P(w3|w1w2)× ... × P(wn|w1w2...wn-1)

This can be computed as:
P(wn|wn-1) =

C(wn-1wn)
-----------
  C(wn-1)
i.e., the number of times the consecutive words wn-1wn appear divided by the total number of times the word wn-1 appears.

Example

<s> To be is to do </s>
<s> To do is to be </s>
<s> Do be do be do </s>

Total count = 21

C(<s>) = 3
C(be) = 4
C(do) = 5
C(is) = 2
C(to) = 4
C(</s>) = 3

C(<s>|<s>) = 0
C(<s>|be) = 0
C(<s>|do) = 0
C(<s>|is) = 0
C(<s>|to) = 0
C(<s>|</s>) = 2

C(be|<s>) = 0
C(be|be) = 0
C(be|do) = 2
C(be|is) = 0
C(be|to) = 2
C(be|</s>) = 0

C(do|<s>) = 1
C(do|be) = 2
C(do|do) = 0
C(do|is) = 0
C(do|to) = 2
C(do|</s>) = 0

C(is|<s>) = 0
C(is|be) = 1
C(is|do) = 1
C(is|is) = 0
C(is|to) = 0
C(is|</s>) = 0

C(to|<s>) = 2
C(to|be) = 0
C(to|do) = 1
C(to|is) = 2
C(to|to) = 0
C(to|</s>) = 0

C(</s>|<s>) = 0
C(</s>|be) = 1
C(</s>|do) = 2
C(</s>|is) = 0
C(</s>|to) = 0
C(</s>|</s>) = 0

P(<s>|<s>) = 0/3
P(<s>|be) = 0/4
P(<s>|do) = 0/5
P(<s>|is) = 0/2
P(<s>|to) = 0/4
P(<s>|</s>) = 2/3
P(be|<s>) = 0/3
P(be|be) = 0/4
P(be|do) = 2/5
P(be|is) = 0/2
P(be|to) = 2/4
P(be|</s>) = 0/3
P(do|<s>) = 1/3
P(do|be) = 2/4
P(do|do) = 0/5
P(do|is) = 0/2
P(do|to) = 2/4
P(do|</s>) = 0/3
P(is|<s>) = 0/3
P(is|be) = 1/4
P(is|do) = 1/5
P(is|is) = 0/2
P(is|to) = 0/4
P(is|</s>) = 0/3
P(to|<s>) = 2/3
P(to|be) = 0/4
P(to|do) = 0/5
P(to|is) = 2/2
P(to|to) = 0/4
P(to|</s>) = 0/3
P(</s>|<s>) = 0/3
P(</s>|be) = 1/4
P(</s>|do) = 2/5
P(</s>|is) = 0/2
P(</s>|to) = 0/4
P(</s>|</s>) = 0/3

So if the word do is encountered, we can compute the most likely next word. Its probability is:

max      { P(wordi | do) } =
1 <= i < n

max      { P(<s> | do), P(be | do), P(do | do), P(is | do), P(to | do), P(</s> | do) } =

max      { 0, 2/5, 0, 1/5, 0, 2/5 } = 2/5

And this corresponds to both the end of sentence and preceding the word be. I.e., when the word do is encountered it is more likely to be at the end of the sentence or before the word be than anywhere else.

Similarly given the word to the most likely next word can be found from

max      { P(<s> | to), P(be | to), P(do | to), P(is | to), P(to | to), P(</s> | to) } =

max      { 0, 2/4, 2/4, 0, 0, 0 } = 2/4 = 1/2

corresponding to either the word be or the word do. I.e., when the word to is encountered it is more likely to be followed by the word be than anything else.

One can get more reliable predictions using tri-grams or higher order n-grams, the same way a person can predict phrasal patterns given adequate contextual information: