\[ \newcommand{\C}{\mathbb{C}} \newcommand{\haar}{\mathsf{m}} \DeclareMathOperator{\cont}{\mathsf{C}} \newcommand{\contc}{\cont_\mathsf{c}} \newcommand{\conto}{\cont_\mathsf{0}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\N}{\mathbb{N}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\g}{>} \newcommand{\l}{<} \newcommand{\intd}{\,\mathsf{d}} \newcommand{\Re}{\mathsf{Re}} \newcommand{\area}{\mathop{\mathsf{Area}}} \newcommand{\met}{\mathop{\mathsf{d}}} \newcommand{\orb}{\mathop{\mathsf{orb}}} \newcommand{\emptyset}{\varnothing} \newcommand{\B}{\mathscr{B}} \DeclareMathOperator{\borel}{\mathsf{Bor}} \DeclareMathOperator{\lpell}{\mathsf{L}} \newcommand{\lp}[1]{\lpell^{\!\mathsf{#1}}} \newcommand{\Lp}[1][p]{\mathsf{L}^{\!\mathsf{#1}}} \renewcommand{\|}{|\!|} \newcommand{\ent}{\operatorname{\mathsf{H}}} \]

Markov chains

Consider the set \[ Y = \{ x \in \{0,1\}^\N : x(n) = 1 \Rightarrow x(n+1) = 0 \} \] of sequences in which every 1 is followed by a zero. This is certainly invariant for the shift map: if $y \in Y$ then $T(y)$ belongs to $Y$ as well.

To study the dynamics of $T$ on $Y$ using ergodic theory we want a measure on $Y$ that is $T$ invariant. For each $0 \le p \le 1$ we have the $(1-p,p)$ coin measure $\mu_p$ on $\{0,1\}^\N$ and each is $T$ invariant. However, only one - the point measure $\mu_0$ - gives positive measure to $Y$. For example, since the number $w_n$ of cylinders of length $n$ that intersects $Y$ satisfies the recurrence \[ w_1 = 2 \quad w_2 = 3 \quad w_{n+2} = w_{n+1} + w_n \] for all $n \in \N$ we have \[ \mu_{1/2}(Y) \le \dfrac{w_n}{2^n} \to 0 \] as $n \to \infty$. How can we equip $Y$ with a probability measure $\mu$ that is $T$ invariant?

Walks on directed graphs

The coin measure were inappropriate because they see all cylinders: they assign $[01]$ and $[11]$ positive measure whereas only one of those cylinders intersects $Y$. We can think of the points in $Y$ as the result of all possible infinite walks on the directed graph with vertices $0$, $1$ and directed edges as follows.

To any endless journey on the graph one associates a sequence in $Y$ by recording the labels associated with the visited vertices. As the vertex labelled 1 cannot be visited consecutively, and as there are not other restrictions, we get all sequences in $Y$ this way.

Let us assign probabilities to each traversal. We encode these in a matrix \[ \begin{bmatrix} q(0,0) & q(0,1) \\ q(1,0) & q(1,1) \end{bmatrix} \] with $0 \le q(i,j) \le 1$ representing the probability that one moves in a single step from vertex $i$ to vertex $j$. We must have \[ q(0,0) + q(0,1) = 1 \qquad q(1,0) + q(1,1) = 1 \] and we must also have $q(1,1) = 0$ as we forbid consecutive ones. To entirely determine the measure we also need the probability of the starting location. Fix $0 \le p(i) \le 1$ with \[ p(0) + p(1) = 1 \] where $p(i)$ is the probability that one begins at vertex $i$. With this information - values for all $p(i)$ and all $q(i,j)$ - we define \[ \nu([\epsilon_1 \cdots \epsilon_r]) = p(\epsilon_1) \prod_{i=1}^{r-1} q(\epsilon_i, \epsilon_{i+1}) \] on all cylinder sets $[\epsilon_1 \cdots \epsilon_r]$. For example \[ \nu([01]) = p(0) q(0,1) \] is the probability of beginning at 0 multiplied by the probability that the first step is from 0 to 1.

Note that if $\mu$ is to be an invariant measure then we must have \[ p(i) = \mu([i]) = \mu(T^{-1} [i]) = p(0) q(0,i) + p(1) q(1,i) \] which is to say \[ \begin{bmatrix} p(0) & p(1) \end{bmatrix} \begin{bmatrix} q(0,0) & q(0,1) \\ q(1,0) & q(1,1) \end{bmatrix} = \begin{bmatrix} p(0) & p(1) \end{bmatrix} \] holds so we assume this of our parameters.

We will take for granted that the above formula defines a measure $\nu$ on $\{0,1\}^\N$. Let us verify that it is $T$ invariant. Recall that it suffices to check $\nu(C)$ and $\nu(T^{-1} C)$ agree for all cylinder sets $C$ because such sets form a π system. Write $C = [\epsilon_1 \cdots \epsilon_r]$. We calculate \[ \begin{align*} \nu(T^{-1} C) &{} = \nu([0 \epsilon_1 \cdots \epsilon_r]) + \nu( [1 \epsilon_1 \cdots \epsilon_r]) \\ &{} = \Big( p(0) q(0,\epsilon_1) + p(1) q(1,\epsilon_1) \Big) \prod_{i=1}^{r-1} q(\epsilon_i, \epsilon_{i+1}) \\ &{} = p(\epsilon_1) \prod_{i=1}^{r-1} q(\epsilon_i, \epsilon_{i+1}) = \nu(C) \end{align*} \] so $\nu$ is $T$ invariant.

We have total freedom in the parameters $p(0)$ and $q(0,0)$. Fixing their values then determines $p(1)$ and $q(0,1)$ by the laws of total probability. What is the best way to choose their values? Absent any other information about the dynamics, or any other quantity that we might be interested in, it is often reasonable to choose the values that maximize the entropy.

Proposition

For the above measure the quantity \[ - p(0) q(0,0) \log q(0,0) - p(0) q(0,1) \log q(0,1) \] is the entropy of $T$.

Proof:

As $\xi = ( [0], [1] )$ is a generator for the Borel σ algebra on $X$ the limit \[ \lim_{N \to \infty} \dfrac{1}{N} \ent \left( \, \bigvee_{n=0}^{N-1} T^{-n} \xi \right) \] is the entropy $\ent(T)$ we want to calculate. We have \[ \begin{align*} & - \ent \left( \, \bigvee_{n=0}^{N-1} T^{-n} \xi \right) \\ = {}& \sum_{\epsilon \in \{0,1\}^N} \mu( [\epsilon_1 \cdots \epsilon_r]) \log \mu( [\epsilon_1 \cdots \epsilon_r]) \\ = {}& \sum_{\epsilon \in \{0,1\}^\N} p(\epsilon_1) q(\epsilon_1,\epsilon_2) \cdots q(\epsilon_{N-1},\epsilon_N) \log p(\epsilon_1) q(\epsilon_1,\epsilon_2) \cdots q(\epsilon_{N-1},\epsilon_N) \\ = {}& \sum_{i \in \{0,1\}} p(i) \log p(i) + (N-1) \sum_{i,j \in \{0,1\}} p(i) q(i,j) \log q(i,j) \end{align*} \] by writing a sum of logarithms and then using repeatedly both \[ p(i) = p(0) q(0,i) + p(1) q(1,i) \qquad q(i,0) + q(i,1) = 1 \] for $i = 1,2$. Dividing by $N$ and taking the limit as $N \to \infty$ gives the desired result as $q(1,1) = 0$ and $q(1,0) = 1$ in our special case.

The Parry measure

We would like to maximize \[ - p(0) q(0,0) \log q(0,0) - p(0) q(0,1) \log q(0,1) \] for values in $[0,1]$ subject to \[ \begin{gather*} p(0) + p(1) = 1 \qquad q(0,0) + q(0,1) = 1 \\ p(0) q(0,0) + p(1) = p(0) \qquad p(0) q(0,1) = p(1) \end{gather*} \] which is not so simple an optimization problem.

If we attempt to be as unbiased as possible in our walk on the graph, choosing which edge to traverse from vertex 0 each time by a fair coin toss then we can assert \[ q(0,0) = \tfrac{1}{2} = q(0,1) \] which then forces $p(0) = \tfrac{2}{3}$ and $p(1) = \tfrac{1}{3}$. For these particular choices we get an entropy value of \[ \ent(T) = -2 \cdot \dfrac{1}{3} \dfrac{1}{2} \log \dfrac{1}{2} = \dfrac{2}{3} \log 2 \approx 0.200686664 \] but it is not clear that this is maximal.

Theorem (Parry)

Let $B$ be a $k \times k$ matrix with entries from $\{0,1\}$. Suppose there is $r \in \N$ with all entries of $B^r$ positive. Let $\lambda$ be the largest positive eigenvalue of $B$. Fix left and right eigenvectors $u$ and $v$ of $B$ respectively with \[ u(1) v(1) + \cdots + u(k) v(k) = 1 \] such that both have an eigenvalue of $\lambda$. With \[ p(i) = u(i) v(i) \] and \[ q(i,j) = \dfrac{B(i,j)}{\lambda} \dfrac{v(j)}{v(i)} \] the corresponding Markov measure maximizes the entropy for $T$ on the set \[ Y = \{ x \in \{1,\dots,k \}^\N : B(x(n),x(n+1)) = 1 \textup{ for all } n \in \N \} \] where transitions are determined by $B$.

Proof:

This is Theorem 8.10 in Walters.

The resulting measure is the Parry measure on the Markov chain. For our example we have \[ B = \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix} \] with eigenvalues \[ \dfrac{1 - \sqrt{5}}{2} \qquad \dfrac{1 + \sqrt{5}}{2} \] and the latter must be $\lambda$. The vectors \[ U = \begin{bmatrix} 1 + \sqrt{5} & 2 \end{bmatrix} \qquad V = \begin{bmatrix} 1 + \sqrt{5} \\ 2 \end{bmatrix} \] are left and right eigenvectors respectively of $B$ with eingevalue $\lambda$. Since the ratio of the eigenvectors is unchanged by scaling we conclude that \[ q(0,0) = \dfrac{B(0,0)}{\lambda} \dfrac{V(0)}{V(0)} = \dfrac{1}{\lambda} \] and that \[ q(0,1) = \dfrac{B(0,1)}{\lambda} \dfrac{V(1)}{V(0)} = \dfrac{1}{\lambda^2} \] are the values that will maximize the entropy. As \[ p(0) q(0,1) = p(1) = 1 - p(0) \] we conclude that \[ p(0) = \dfrac{\lambda^2}{1 + \lambda^2} \qquad p(1) = \dfrac{1}{1 + \lambda^2} \] giving an entropy value of \[ \ent(T) = \log \lambda \approx 0.20898764 \] which is indeed larger than entropy one gets from our naive guess $q(0,0) = \tfrac{1}{2}$. In fact, the entropy of the Parry measure is always $\log \lambda$.