\[ \newcommand{\C}{\mathbb{C}} \newcommand{\haar}{\mathsf{m}} \DeclareMathOperator{\cont}{\mathsf{C}} \newcommand{\contc}{\cont_\mathsf{c}} \newcommand{\conto}{\cont_\mathsf{0}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\N}{\mathbb{N}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\g}{>} \newcommand{\l}{<} \newcommand{\intd}{\,\mathsf{d}} \newcommand{\Re}{\mathsf{Re}} \newcommand{\area}{\mathop{\mathsf{Area}}} \newcommand{\met}{\mathop{\mathsf{d}}} \newcommand{\orb}{\mathop{\mathsf{orb}}} \newcommand{\emptyset}{\varnothing} \newcommand{\B}{\mathscr{B}} \DeclareMathOperator{\borel}{\mathsf{Bor}} \DeclareMathOperator{\lpell}{\mathsf{L}} \newcommand{\lp}[1]{\lpell^{\!\mathsf{#1}}} \newcommand{\Lp}[1][p]{\mathsf{L}^{\!\mathsf{#1}}} \renewcommand{\|}{|\!|} \newcommand{\ent}{\operatorname{\mathsf{H}}} \]

Entropy

In the previous section we saw that an irrational rotation $x \mapsto x + \alpha$ on $[0,1)$ and the doubling map $x \mapsto 2x \bmod 1$ are essentially different dynamical systems because the former has lots of eigenfunctions while the latter has none. Let us further pursue the topic of distinguishing dynamical systems by analysing the shift map \[ (T x)(n) = x(n+1) \] on $X = \{0,1\}^\N$ with respect to different coin measures $\mu_p$ and $\mu_q$. Are the measure-preserving systems \[ (X,\B,\mu_p, T) \qquad (X,\B,\mu_q,T) \] distinguishable? Both are mixing, so both have no eigenfunctions. We cannot distinguish them by properties of their Koopman operators.

It turns out that these transformations are not the same, and that they can be distinguished by an invariant originating in statistical physics and information theory - entropy - that is not detected by the Koopman operator. For some of the history on the extent to which the Koopman operator determines the point dynamics, see this article.

Defining entropy

Fix a measure space $(X,\B,\mu)$ with $\mu(X) = 1$. By a partition of $(X,\B,\mu)$ we mean a finite tuple \[ \xi = (A_1,\dots,A_r) \] of sets in $\B$ that are pairwise disjoint and cover $X$. We think of a partition as classifying the points of $X$ according to some rule. For example \[ \xi = ([00],[01],[10],[11]) \] is a partition of $\{0,1\}^\N$ classifying points according to their first two terms, and \[ \eta = ( [0,\tfrac{1}{3}), [\tfrac{1}{3},\tfrac{2}{3}), [\tfrac{2}{3},1) ) \] is a partition of $[0,1)$ classifying points based on the first digits in their ternary expansions.

The measure $\mu$ assigns a size to each set in a given partition. If we take the view that sets of smaller size are less likely, we might therefore consider ourselves to have gained more or less information about a $\mu$-random point depending on whether it is to be found in a smaller or larger set in the partition.

Definition

Given a partition \[ \xi = (A_1,\dots,A_r) \] of a probability space $(X,\B,\mu)$ the quantity \[ \ent(\xi) = - \sum_{i=1}^r \mu(A_i) \log \mu(A_i) \] is called its entropy.

To make the above definition we take $0 \log 0 = 0$.

We imagine that the entropy of a partition tells us the amount of information we gain by finding out to which member of our partition a random point in $X$ belongs.

Example

The partition \[ \xi = ( [00], [01], [10], [11] ) \] has, for the fair coin measure $\mu$ an entropy of \[ \begin{align*} \ent(\xi) &{} = - \mu([00]) \log \mu([00]) - \mu([01]) \log \mu([01]) \\ &{} \quad\quad - \mu([10]) \log \mu([10]) - \mu([11]) \log \mu([11]) \\ &{} = - 4 ( \tfrac{1}{4} \log \tfrac{1}{4} ) \\ &{} = \log 4 \end{align*} \] and for any other partition $\eta$ of $\{0,1\}^\N$ into four sets we have \[ 0 \le \ent(\eta) \le \log 4 \] by convexity.

Given a partition $\xi$ we can form a new partition \[ ( T^{-1} A_1, \dots,T^{-1} A_r) \] using $T$. The new partition $T^{-1} \xi$ corresponds to performing the experiment corresponding to $\xi$ after one iteration of the dynamics. Since $T$ is measure-preserving the partitions $\xi$ and $T^{-1} \xi$ have the same entropy. How much information do we gain from performing the experiment corresponding to $\xi$ now and after one iteration of the dynamics? This is the same as performing the experiment corresponding to the partition \[ \xi \vee T^{-1} \xi = ( A_i \cap T^{-1} A_j : 1 \le i,j \le r) \] where the symbol $\vee$ is the join of two partitions.

It may be that entropy gained \[ \ent(\xi \vee T^{-1} \xi) - \ent(\xi) \] by performing the experiment a second time is zero. On the other hand it could be quite large.

Example

For the partition $\xi = ([00],[10],[01],[11])$ we have \[ \xi \vee T^{-1} \xi = ( [000], [100], [010], [110], [001], [101], [011], [111] ) \] and its entropy for the fair coin measure is $\log 8$. Note that \[ \ent(\xi \vee T^{-1} \xi) - \ent(\xi) = \log 2 \] and we can think of this difference as representing the extra bit needed to store the outcome of the experiment $\xi \vee T^{-1} \xi$ compared with the outcome of experiment $\xi$.

Definition

Fix a measure-preserving transformation $T$ on a probability space $(X,\B,\mu)$. The quantity \[ \ent(T, \xi) = \lim_{N \to \infty} \dfrac{1}{N} \ent \left( \bigvee_{n=0}^{N-1} T^{-n} \xi \right) \] is the entropy of $T$ for the partition $\xi$.

The entropy of $T$ for the partition $\xi$ is the exponential growth rate of the amount of information obtained by repeatedly performing the experiment corresponding to $T$ after more and more iterations of the dynamics. We imagine that positivity of $\ent(T, \xi)$ tells us something about the randomness of $T$. If $\ent(T,\xi)$ is positive then $T$ is unpredictable in the sense that no matter how many iterations of the experiment one carries out, there is still information to be gained by performing the experiment more often.

Example

Let us calculate the entropy of the full shift $T$ on $\{0,1\}^\N$ for the partition $\xi = \{ [0], [1] \}$ and the fair coin measure $\mu$. Since \[ \xi \vee T^{-1} \xi \vee \cdots \vee T^{-(N-1)} \xi \] is equal to the partition of $\{ 0,1\}^\N$ into cylinder sets of length $N$ and each such cylinder has measure $1/2^N$ we see that \[ \ent \left( \bigvee_{n=0}^{N-1} T^{-n} \xi \right) = \log(2^N) \] and therefore \[ \ent(T,\xi) = \lim_{N \to \infty} \dfrac{1}{N} \log(2^N) = \log 2 \] is the entropy of $T$ with respect to $\xi$.

The transformation $T$ may very well have different entropies with respect to different partitions of $X$. To form an invariant of the transformation itself we look for the experiment that maximizes the entropy.

Definition

The entropy of a measure-preserving map $T$ on a probability space $(X,\B,\mu)$ is the supermum \[ \ent(T) = \sup \{ \ent(T,\xi) : \xi \textup{ a partition of } (X,\B) \textup{ with finite entropy} \} \] of all possible entropies of $T$ with respect to partitions that themselves have finite entropy.

The following important theeorem gives us a way to calculate the entropy of a measure-preserving transformations.

Theorem (Kolmogorov-Sinai)

If $\xi$ is a partition such that \[ \sigma \left( \, \bigvee_{n=0}^{N-1} T^{-n} \xi \right) = \B \] then $\ent(T) = \ent(T,\xi)$.

Since the cylinder sets generate the Borel σ-algebra on $\{0,1\}^\N$ we can calculate the entropy of a measure-preserving transformation $T$ on $\{0,1\}^\N$ using the partition $\{ [0], [1] \}$. If $\mu_p$ is the $(1-p,p)$ coin measure then \[ \ent(\xi) = -(1-p) \log (1-p) - p \log p \] and we conclude that, for not two values $0 \le p \l q \le 0.5$ are $T$ with $\mu_p$ and $T$ with $\mu_q$ isomorphic measure-preserving transformations.