\chapter{SEQUENCES OF WORDS}
\section{Words}
Words surround us. Not just in the literal sense of the words on
billboards, road signs, cereal packets, and in books and
magazines, but also in a more abstract sense. Our DNA is defined
by a word over the language of nucleotides. The bar codes on our
groceries are words in the computer language of zeroes and ones.
Further, in mathematics there are words that avoid certain
patterns, such as repeating blocks which can be explored purely
for their own interest, and some that have applications in such
areas as the study of linear polymer molecules in chemical
physics.
In order to explore the behavior of such a wide range of words we must first
introduce a format by which words are defined, and some basic terminology that
will be used throughout this work. My choice of notation is based on my
frequent reliance on Maple to perform calculations.
\begin{notation} Let $V$ be the alphabet over which our
language is defined.
\end{notation}
E.g. in English $V=\{a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s,
t, u, v, w, x, y, z\}$.
In computing $V=\{0,1\}$.
\begin{def1}
A {\bf word}, $w$, over the alphabet $V$ is an ordered sequence of
letters from $V$, $w=[w_1, w_2,..., w_n]$ where $w_i \in V$ for $1 \leq i \leq
n$.
\end{def1}
E.g. the English word ``alphabet'' becomes $[a, l, p, h, a, b, e, t]$.
\begin{notation}$V^*$ is the set of all possible words
over the alphabet $V$.
\end{notation}
\begin{def1}
A {\bf sub-word} of $w$ is any of the $\binom{n+1}{2}$ possible
sub-sequences $[w_i, w_{i+1},\ldots,w_j]$ where $1\leq i \leq j
\leq n$.
\end{def1}
Thus $[a, l, p], [h, a]$ and $[b, e, t]$ are all sub-words of $[a,
l, p, h, a, b, e, t]$.
\begin{notation}
The {\bf empty word} is considered to be a sub-word of all words
and belongs to $V^*$ for every $V$. It will be denoted $[ \; ]$
\end{notation}.
\begin{def1}
The {\bf length} of a word $l(w)$ is the number of letters in the word,
counting multiplicity.
\end{def1}
E.g. $l([a, l, p, h, a, b, e, t])=8$.
Note $l([ \; ])=0$.
One of the main areas of research into words is their limiting
behavior. That is if $a_n$ is the number of words in our language
of length $n$ we want to find $\mu:
=\lim_{n\rightarrow\infty}[a_n]^{1/n}$, if it exists.
Clearly if no constraints are put upon our choice of words and if
$k$ is the number of letters in our alphabet $V$ then $ a (n)=k^n$
and hence $\mu=k$. This leads us to believe our quest for limits
will not prove fruitless.
Often it is useful to use the model $a_n=n^\theta \mu ^n$. Zinn's
method can be used to obtain good approximations of this type.
\section{Avoiding the Bad}
Most of the sequences $a_n$ considered in this text are ones whose
words avoid specific sub-words. We consider the sub-words we wish
to avoid as the bad words (or mistakes), and the set of all such
words will be denoted $B$. The set of all bad words up to length
$k$ will be denoted $B_k$.
As an example consider the case of binary square-free words. That
is words over a two letter alphabet that avoid any non-trivial
sub-word being repeated directly after itself. In this case $B_4=
\{ [0,0], [1,1], [0,1,0,1], [1,0,1,0] \}$.
It should be noted that $B$ and $B_k$ are always minimal in the
sense that no member of $B$ (or $B_k$) is a sub-word of any other
member of $B$(or $B_k$). In the above example note $[1,1,1,1]$ is
omitted from $B_4$ because it contains $[1,1]$ as a sub-word.
In fact in this case $B=B_4$ and $a_n=[1,2,2,2,0,0,0,\ldots]$,
which is not a very interesting sequence. The more interesting
case of ternary square-free words is discussed by Noonan
\cite{NZ99}.
\section{Substantive Sequences}
Many of the sequences we will be discussing are
sub-multiplicative. That is that $a_{n+m}\leq a_n a_m$. In
sequences where $a_n \neq 0$ we have that $log(a_{n+m})\leq
log(a_n)+log(a_m)$ which shows that the sequence $\{log(a_n)\}$ is
subadditive ($c_{n+m}\leq c_n+c_m$). This fact can be used to
show that the $\mu$ exists and is in fact the $\inf a_n^(1/n)$
\begin{lem} Let $\{c_n\}$ be a subadditive sequence of real
numbers. Then the $\lim_{n\rightarrow\infty}\frac{c_n}{n}$ exists
and equals $\inf_{n\geq1}\frac{c_n}{n}$.
\end{lem}
The above lemma is attributed to Hammersley and Morton (1954).
\noindent {\bf Proof of Lemma}: Let $C_k=max_{1 \leq r \leq
k}c_r$. Then for any given $n$ we can find $j$ such that $n=jk+r$
with $1 \leq r \leq k$.
Using the subadditivity of $\{c_n\}$ we obtain
\begin{equation}
c_n \leq j c_k+c_r \leq \frac{n}{k}c_k+C_k
\end{equation}
Then we divide both sides by $n$ and take the
$\limsup_{n\rightarrow\infty}$ to obtain
\begin{equation}
\limsup_{n\rightarrow\infty} \frac{c_n}{n} \leq
\limsup_{n\rightarrow\infty} (\frac{c_k}{k}+\frac{C_k}{n}) \leq
\frac{c_k}{k}
\end{equation}
Finally we take the $\liminf_{k\rightarrow\infty}$ and obtain that
the $\limsup \leq \liminf$ thus proving the limit exists.
As the limit exists it equals the $\limsup$ and so as this is less
than $\frac{a_k}{k}$ for all $k$ we obtain
\begin{equation}
\lim_{n\rightarrow\infty}\frac{c_n}{n}=\inf_{n\geq1}\frac{c_n}{n}
\end{equation}
\begin{tem}\label{submul}
If $\{a_n\}$ is a sequence of positive terms for which $a_{n+m}
\leq a_n a_m$ then
$\mu=\lim_{n\rightarrow\infty}{a_n}^{\frac{1}{n}}$ exists. Further
$\mu \leq {a_n}^{\frac{1}{n}}$.
\end{tem}
\noindent {\bf Proof}: As discussed above $a_{n+m} \leq a_n a_m$
implies that the sequence $\{\log a_n\}$ is subadditive. This
means $\log \mu = \lim_{n\rightarrow\infty} \frac{\log a_n}{n}
=\lim_{n\rightarrow\infty} \log a_n^{\frac{1}{n}}$ exists and
further $\log \mu = \inf_{n \geq 1}\frac{\log a_n}{n} =\inf_{n
\geq 1}\log {a_n}^{\frac{1}{n}} \leq \log {a_n}^{\frac{1}{n}}$ for
all $n$.And this gives the required results.
\section{The Naive Approach}
At this point we are only considering linear sequences. Later in
this Thesis we will investigate the case of cyclic sequences.
For any given word we define its $k-$weight at follows:
\begin{equation}
W_k([w_1,w_2,\ldots,w_n])=s^n \prod_{i=1}^k
\prod_{j=1}^{n-k+1}x[w_j \ldots w_{j+k-1}]
\end{equation}
For example the 3-weight of the word apple would be
$s^5
x[a]x[p]^2x[l]x[e]x[a,p]x[p,p]x[p,l]x[l,e]x[a,p,p]x[p,p,l]x[p,l,e]$
So that the coefficient of $s^n$ will give us the number of words
of length $n$ and when necessary their form. This means are goal
becomes to find the generating function that has all words of
length $n$ (or often just the number of them) that meet our
criteria as the coefficient of $s^n$.
One method for doing this is to use a matrix, $A$, to analyze the
interaction between all possible blocks of length $k$ then by
taking $(1-A)^{-1}$ and adding all the resulting entries we obtain
a generating function for all words over the chosen alphabet. We
then set any blocks that are disallowed equal to zero and obtain
the generating function for the desired set of words.
This method I call the Naive Approach because it produces all
possible words without taking into account the bad words until the
very end. For example if we were to take the English alphabet and
look for all words that did not contain any bad``4-letter'' words
we would need a matrix that was $26^4$ by $26^4$ and worse yet
need to find the inverse of such a matrix a very slow task, even
for a computer. Thus this approach is only useful in very small
cases and as a check for our clever techniques, like the
Goulden-Jackson Method.
\section{The Goulden-Jackson Method}
One method used throughout this dissertation is the
Goulden-Jackson Cluster Method \cite{GJ79}. This method can be
used to find the generating function $f(s)=
\sum_{n=0}^{\infty}a_ns^n$. In many cases we can not find $f(s)$
explicitly as we are looking at infinite sets of mistakes, but we
can obtain $f_k(s)$ which gives correct values for $a_n$ when $n
\leq k$ and good over estimates for $n>k$.
We will discuss briefly this method, for a more in depth
explanation see \cite{GJ79}.