2015-10-13-dirichlet-process-nomenclature


title: Nomenclature of Dirichlet Processes slug: nomenclature


Nomenclature of Dirichlet Processes

Much of the literature on Dirichlet Processes makes assertions similar to the following:

  • "DP is the Dirichlet process, a distribution over distributions." (Neal, 2000)
  • "[The Dirichlet process] is a distribution over distributions, i.e. each draw from a Dirichlet process is itself a distribution." (Teh, 2010)
  • "The Dirichlet process (DP) is a distribution over distributions." (Gershman and Blei, 2012)
  • The "Dirichlet process defines a distribution on random probability measures..." (Sudderth, 2006)
  • "Dirichlet processes define a distribution over distributions..." (Ghahramani, 2005)

Michael Jordan makes an equivalent statement.

Each of these sources makes the claim that a Dirichlet Process is a distribution over probability distributions. That is, given a base distribution $G_0$ and a parameter $\alpha$, $DP(\alpha, G_0)$ is a Dirichlet process and is (equivalently) a distribution over probability distributions. Therefore, a sample $G$ drawn from the Dirichlet process $DP(\alpha, G_0)$ is itself a probability distribution. $G$ (where $G\sim DP(\alpha, G_0)$) is a discrete probability distribution whose support is a discrete subset of the support of $G_0$.

Confusingly, while many sources refer to the DP as a distribution over distributions, when using the phrase "sample from a Dirichlet process", they mean a sample from $G$, not from $DP(\alpha, G_0)$. These authors appear to equivocate on the term "Dirichlet process". It means both a distribution over distributions and a distribution sampled from this distribution over distributions.

After being confused by this point for some time, I prepared these notes arguging that the Dirichlet process is a distribution over distributions. I argued that the term sample from a Dirichlet process should refer to a distribution sampled from the DP, not to a point sampled from the support of $G_0$.

In response to my notes, Dan Roy briefly argued that "The Dirichlet process is a distribution on the space of probability measures" is a misstatement. In fact, Roy argues that $DP(\alpha, G_0)$ is not a Dirichlet process at all! Instead $G$ (the distribution sampled from $DP(\alpha, G_0)$) is the Dirichlet process.

Thomas Ferguson first defined the Dirichlet Process in his 1973 paper. Charles Antoniak (a student of Ferguson) repeats the definition in his his 1974 paper. Antoniak's definition is as follows:

Let $\Theta$ be a set, and $\mathcal{A}$ a $\sigma$-field of subsets of $\Theta$. Let $\beta$ be a finite, nonnull, nonnegative, finitely additive measure on $(\Theta, \mathcal{A})$. We say a random probability measure $P$ on $(\Theta, \mathcal{A})$ is a Dirichlet process on $(\Theta, \mathcal{A})$ with parameter $\beta$, denoted $P\in \mathcal{D}(\beta)$, if for every $k=1, 2, \ldots$ and measurable partition $B_1,\ldots,B_k$ of $\Theta$, the joint distribution of the random probabilities $(P(B_1),\ldots,P(B_k))$ is Dirichlet with parameters $(\beta(B_1),\ldots,\beta(B_k))$, denoted $(P(B_1),\ldots,P(B_k))\in \mathcal{D}(\beta(B_1),\ldots,\beta(B_k))$.

Let's unpack this dense, measure theoretic definition.

Let $\Theta$ be a set, and $\mathcal{A}$ a $\sigma$-field of subsets of $\Theta$. Let $\beta$ be a finite, nonnull, nonnegative, finitely additive measure on $(\Theta, \mathcal{A})$.

First, note that $\beta$ here is a finite measure, i.e. a not-necessarily-normalized probability distribution. Antoniak's $\alpha$ is equivalent to $\alpha\cdot G_0$ in our notation. Essentially, this means we have a probability distribution over some set $\Theta$ with density $p(x) = \beta(x) / \beta(\Theta)$. (The business about $\sigma$-field of subsets just allows us to avoid things like the Banach–Tarski paradox.)

We say a random probability measure $P$ on $(\Theta, \mathcal{A})$ is a Dirichlet process on $(\Theta, \mathcal{A})$ with parameter $\beta$, denoted $P\in \mathcal{D}(\beta)$

TODO

...if for every $k=1, 2, \ldots$ and measurable partition $B_1,\ldots,B_k$ of $\Theta$, the joint distribution of the random probabilities $(P(B_1),\ldots,P(B_k))$ is Dirichlet with parameters $(\beta(B_1),\ldots,\beta(B_k))$, denoted $(P(B_1),\ldots,P(B_k))\in \mathcal{D}(\beta(B_1),\ldots,\beta(B_k))$.