{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"u'%.2f'"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%matplotlib inline\n",
"%precision 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Nonparametric Latent Dirichlet Allocation\n",
"\n",
"_Latent Dirichlet Allocation_ is a [generative](https://en.wikipedia.org/wiki/Generative_model) model for topic modeling. Given a collection of documents, an LDA inference algorithm attempts to determined (in an unsupervised manner) the topics discussed in the documents. It makes the assumption that each document is generated by a probability model, and, when doing inference, we try to find the parameters that best fit the model (as well as unseen/latent variables generated by the model). If you are unfamiliar with LDA, Edwin Chen has a [friendly introduction](http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/) you should read.\n",
"\n",
"\n",
"Because LDA is a _generative_ model, we can simulate the construction of documents by forward-sampling from the model. The generative algorithm is as follows (following [Heinrich](http://www.arbylon.net/publications/text-est.pdf)):\n",
"\n",
"* for each topic $k\\in [1,K]$ do\n",
" * sample term distribution for topic $\\overrightarrow \\phi_k \\sim \\text{Dir}(\\overrightarrow \\beta)$\n",
"* for each document $m\\in [1, M]$ do\n",
" * sample topic distribution for document $\\overrightarrow\\theta_m\\sim \\text{Dir}(\\overrightarrow\\alpha)$\n",
" * sample document length $N_m\\sim\\text{Pois}(\\xi)$\n",
" * for all words $n\\in [1, N_m]$ in document $m$ do\n",
" * sample topic index $z_{m,n}\\sim\\text{Mult}(\\overrightarrow\\theta_m)$\n",
" * sample term for word $w_{m,n}\\sim\\text{Mult}(\\overrightarrow\\phi_{z_{m,n}})$\n",
" \n",
"You can implement this with [a little bit of code](https://gist.github.com/tdhopper/521006b60e1311d45509) and start to simulate documents.\n",
"\n",
"In LDA, we assume each word in the document is generated by a two-step process:\n",
"\n",
"1. Sample a topic from the topic distribution for the document.\n",
"2. Sample a word from the term distribution from the topic. \n",
"\n",
"When we fit the LDA model to a given text corpus with an inference algorithm, our primary objective is to find the set of topic distributions $\\underline \\Theta$, term distributions $\\underline \\Phi$ that generated the documents, and latent topic indices $z_{m,n}$ for each word.\n",
"\n",
"To run the generative model, we need to specify each of these parameters:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"vocabulary = ['see', 'spot', 'run']\n",
"num_terms = len(vocabulary)\n",
"num_topics = 2 # K\n",
"num_documents = 5 # M\n",
"mean_document_length = 5 # xi\n",
"term_dirichlet_parameter = 1 # beta\n",
"topic_dirichlet_parameter = 1 # alpha"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The term distribution vector $\\underline\\Phi$ is a collection of samples from a Dirichlet distribution. This describes how our 3 terms are distributed across each of the two topics."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from scipy.stats import dirichlet, poisson\n",
"from numpy import round\n",
"from collections import defaultdict\n",
"from random import choice as stl_choice"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[ 0.41 0.02 0.57]\n",
" [ 0.38 0.36 0.26]]\n"
]
}
],
"source": [
"term_dirichlet_vector = num_terms * [term_dirichlet_parameter]\n",
"term_distributions = dirichlet(term_dirichlet_vector, 2).rvs(size=num_topics)\n",
"print(term_distributions)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each document corresponds to a categorical distribution across this distribution of topics (in this case, a 2-dimensional categorical distribution). This categorical distribution is a _distribution of distributions_; we could look at it as a Dirichlet process!\n",
"\n",
"The base base distribution of our Dirichlet process is a uniform distribution of topics (remember, topics are term distributions). "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"count: 5066 topic: [0.40999999999999998, 0.02, 0.56999999999999995]\n",
"count: 4934 topic: [0.38, 0.35999999999999999, 0.26000000000000001]\n"
]
}
],
"source": [
"base_distribution = lambda: stl_choice(term_distributions)\n",
"# A sample from base_distribution is a distribution over terms\n",
"# Each of our two topics has equal probability\n",
"from collections import Counter\n",
"for topic, count in Counter([tuple(base_distribution()) for _ in range(10000)]).most_common():\n",
" print(\"count:\", count, \"topic:\", [round(prob, 2) for prob in topic])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Recall that a sample from a Dirichlet process is a distribution that approximates (but varies from) the base distribution. In this case, a sample from the Dirichlet process will be a distribution over topics that varies from the uniform distribution we provided as a base. If we use the stick-breaking metaphor, we are effectively breaking a stick one time and the size of each portion corresponds to the proportion of a topic in the document.\n",
"\n",
"To construct a sample from the DP, we need to [again define our DP class](/dirichlet-distribution/):"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from scipy.stats import beta\n",
"from numpy.random import choice\n",
"\n",
"class DirichletProcessSample():\n",
" def __init__(self, base_measure, alpha):\n",
" self.base_measure = base_measure\n",
" self.alpha = alpha\n",
" \n",
" self.cache = []\n",
" self.weights = []\n",
" self.total_stick_used = 0.\n",
"\n",
" def __call__(self):\n",
" remaining = 1.0 - self.total_stick_used\n",
" i = DirichletProcessSample.roll_die(self.weights + [remaining])\n",
" if i is not None and i < len(self.weights) :\n",
" return self.cache[i]\n",
" else:\n",
" stick_piece = beta(1, self.alpha).rvs() * remaining\n",
" self.total_stick_used += stick_piece\n",
" self.weights.append(stick_piece)\n",
" new_value = self.base_measure()\n",
" self.cache.append(new_value)\n",
" return new_value\n",
" \n",
" @staticmethod \n",
" def roll_die(weights):\n",
" if weights:\n",
" return choice(range(len(weights)), p=weights)\n",
" else:\n",
" return None"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For each document, we will draw a topic distribution from the Dirichlet process:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"topic_distribution = DirichletProcessSample(base_measure=base_distribution, \n",
" alpha=topic_dirichlet_parameter)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A sample from this _topic_ distribution is a _distribution over terms_. However, unlike our base distribution which returns each term distribution with equal probability, the topics will be unevenly weighted."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"count: 9589 topic: [0.38, 0.35999999999999999, 0.26000000000000001]\n",
"count: 411 topic: [0.40999999999999998, 0.02, 0.56999999999999995]\n"
]
}
],
"source": [
"for topic, count in Counter([tuple(topic_distribution()) for _ in range(10000)]).most_common():\n",
" print(\"count:\", count, \"topic:\", [round(prob, 2) for prob in topic])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To generate each word in the document, we draw a sample topic from the topic distribution, and then a term from the term distribution (topic). "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"topic_index = defaultdict(list)\n",
"documents = defaultdict(list)\n",
"\n",
"for doc in range(num_documents):\n",
" topic_distribution_rvs = DirichletProcessSample(base_measure=base_distribution, \n",
" alpha=topic_dirichlet_parameter)\n",
" document_length = poisson(mean_document_length).rvs()\n",
" for word in range(document_length):\n",
" topic_distribution = topic_distribution_rvs()\n",
" topic_index[doc].append(tuple(topic_distribution))\n",
" documents[doc].append(choice(vocabulary, p=topic_distribution))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here are the documents we generated:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['see', 'run', 'see', 'spot', 'see', 'spot']\n",
"['see', 'run', 'see']\n",
"['see', 'run', 'see', 'see', 'run', 'spot', 'spot']\n",
"['run', 'run', 'run', 'spot', 'run']\n",
"['run', 'run', 'see', 'spot', 'run', 'run']\n"
]
}
],
"source": [
"for doc in documents.values():\n",
" print(doc)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see how each topic (term-distribution) is distributed across the documents:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Doc: 0\n",
" count: 6 topic: [0.38, 0.35999999999999999, 0.26000000000000001]\n",
"Doc: 1\n",
" count: 3 topic: [0.40999999999999998, 0.02, 0.56999999999999995]\n",
"Doc: 2\n",
" count: 5 topic: [0.40999999999999998, 0.02, 0.56999999999999995]\n",
" count: 2 topic: [0.38, 0.35999999999999999, 0.26000000000000001]\n",
"Doc: 3\n",
" count: 5 topic: [0.38, 0.35999999999999999, 0.26000000000000001]\n",
"Doc: 4\n",
" count: 5 topic: [0.40999999999999998, 0.02, 0.56999999999999995]\n",
" count: 1 topic: [0.38, 0.35999999999999999, 0.26000000000000001]\n"
]
}
],
"source": [
"for i, doc in enumerate(Counter(term_dist).most_common() for term_dist in topic_index.values()):\n",
" print(\"Doc:\", i)\n",
" for topic, count in doc:\n",
" print(5*\" \", \"count:\", count, \"topic:\", [round(prob, 2) for prob in topic])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To recap: for each document we draw a _sample_ from a Dirichlet _Process_. The base distribution for the Dirichlet process is a categorical distribution over term distributions; we can think of the base distribution as an $n$-sided die where $n$ is the number of topics and each side of the die is a distribution over terms for that topic. By sampling from the Dirichlet process, we are effectively reweighting the sides of the die (changing the distribution of the topics).\n",
"\n",
"For each word in the document, we draw a _sample_ (a term distribution) from the distribution (over term distributions) _sampled_ from the Dirichlet process (with a distribution over term distributions as its base measure). Each term distribution uniquely identifies the topic for the word. We can sample from this term distribution to get the word.\n",
"\n",
"Given this formulation, we might ask if we can roll an _infinite_ sided die to draw from an unbounded number of topics (term distributions). We can do exactly this with a _Hierarchical_ Dirichlet process. Instead of the base distribution of our Dirichlet process being a _finite_ distribution over topics (term distributions) we will instead make it an infinite Distribution over topics (term distributions) by using yet another Dirichlet process! This base Dirichlet process will have as its base distribution a Dirichlet _distribution_ over terms. \n",
"\n",
"We will again draw a _sample_ from a Dirichlet _Process_ for each document. The base distribution for the Dirichlet process is itself a Dirichlet process whose base distribution is a Dirichlet distribution over terms. (Try saying that 5-times fast.) We can think of this as a countably infinite die each side of the die is a distribution over terms for that topic. The sample we draw is a topic (distribution over terms).\n",
"\n",
"For each word in the document, we will draw a _sample_ (a term distribution) from the distribution (over term distributions) _sampled_ from the Dirichlet process (with a distribution over term distributions as its base measure). Each term distribution uniquely identifies the topic for the word. We can sample from this term distribution to get the word.\n",
"\n",
"These last few paragraphs are confusing! Let's illustrate with code."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"term_dirichlet_vector = num_terms * [term_dirichlet_parameter]\n",
"base_distribution = lambda: dirichlet(term_dirichlet_vector).rvs(size=1)[0]\n",
"\n",
"base_dp_parameter = 10\n",
"base_dp = DirichletProcessSample(base_distribution, alpha=base_dp_parameter)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This sample from the base Dirichlet process is our infinite sided die. It is a probability distribution over a countable infinite number of topics. \n",
"\n",
"The fact that our die is countably infinite is important. The sampler `base_distribution` draws topics (term-distributions) from an uncountable set. If we used this as the base distribution of the Dirichlet process below each document would be constructed from a _completely unique set of topics_. By feeding `base_distribution` into a Dirichlet Process (stochastic memoizer), we allow the topics to be shared across documents. \n",
"\n",
"In other words, `base_distribution` will never return the same topic twice; however, every topic sampled from `base_dp` would be sampled an infinite number of times (if we sampled from `base_dp` forever). At the same time, `base_dp` will also return an _infinite number_ of topics. In our formulation of the the LDA sampler above, our base distribution only ever returned a finite number of topics (`num_topics`); there is no `num_topics` parameter here.\n",
"\n",
"Given this setup, we can generate documents from the _hierarchical Dirichlet process_ with an algorithm that is essentially identical to that of the original _latent Dirichlet allocation_ generative sampler:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"nested_dp_parameter = 10\n",
"\n",
"topic_index = defaultdict(list)\n",
"documents = defaultdict(list)\n",
"\n",
"for doc in range(num_documents):\n",
" topic_distribution_rvs = DirichletProcessSample(base_measure=base_dp, \n",
" alpha=nested_dp_parameter)\n",
" document_length = poisson(mean_document_length).rvs()\n",
" for word in range(document_length):\n",
" topic_distribution = topic_distribution_rvs()\n",
" topic_index[doc].append(tuple(topic_distribution))\n",
" documents[doc].append(choice(vocabulary, p=topic_distribution))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here are the documents we generated:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['spot', 'spot', 'spot', 'spot', 'run']\n",
"['spot', 'spot', 'see', 'spot']\n",
"['spot', 'spot', 'spot', 'see', 'spot', 'spot', 'spot']\n",
"['run', 'run', 'spot', 'spot', 'spot', 'spot', 'spot', 'spot']\n",
"['see', 'run', 'see', 'run', 'run', 'run']\n"
]
}
],
"source": [
"for doc in documents.values():\n",
" print(doc)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And here are the latent topics used:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Doc: 0\n",
" count: 2 topic: [0.17999999999999999, 0.79000000000000004, 0.02]\n",
" count: 1 topic: [0.23000000000000001, 0.58999999999999997, 0.17999999999999999]\n",
" count: 1 topic: [0.089999999999999997, 0.54000000000000004, 0.35999999999999999]\n",
" count: 1 topic: [0.22, 0.40000000000000002, 0.38]\n",
"Doc: 1\n",
" count: 2 topic: [0.23000000000000001, 0.58999999999999997, 0.17999999999999999]\n",
" count: 1 topic: [0.17999999999999999, 0.79000000000000004, 0.02]\n",
" count: 1 topic: [0.35999999999999999, 0.55000000000000004, 0.089999999999999997]\n",
"Doc: 2\n",
" count: 4 topic: [0.11, 0.65000000000000002, 0.23999999999999999]\n",
" count: 2 topic: [0.070000000000000007, 0.65000000000000002, 0.27000000000000002]\n",
" count: 1 topic: [0.28999999999999998, 0.65000000000000002, 0.070000000000000007]\n",
"Doc: 3\n",
" count: 2 topic: [0.17999999999999999, 0.79000000000000004, 0.02]\n",
" count: 2 topic: [0.25, 0.55000000000000004, 0.20000000000000001]\n",
" count: 2 topic: [0.28999999999999998, 0.65000000000000002, 0.070000000000000007]\n",
" count: 1 topic: [0.23000000000000001, 0.58999999999999997, 0.17999999999999999]\n",
" count: 1 topic: [0.089999999999999997, 0.54000000000000004, 0.35999999999999999]\n",
"Doc: 4\n",
" count: 3 topic: [0.40000000000000002, 0.23000000000000001, 0.37]\n",
" count: 2 topic: [0.42999999999999999, 0.17999999999999999, 0.40000000000000002]\n",
" count: 1 topic: [0.23000000000000001, 0.29999999999999999, 0.46000000000000002]\n"
]
}
],
"source": [
"for i, doc in enumerate(Counter(term_dist).most_common() for term_dist in topic_index.values()):\n",
" print(\"Doc:\", i)\n",
" for topic, count in doc:\n",
" print(5*\" \", \"count:\", count, \"topic:\", [round(prob, 2) for prob in topic])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our documents were generated by an unspecified number of topics, and yet the topics were shared across the 5 documents. This is the power of the hierarchical Dirichlet process!\n",
"\n",
"This non-parametric formulation of Latent Dirichlet Allocation was first published by [Yee Whye Teh et al](http://www.cs.berkeley.edu/~jordan/papers/hdp.pdf). \n",
"\n",
"Unfortunately, forward sampling is the easy part. Fitting the model on data requires [complex MCMC](http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf) or [variational inference](http://www.cs.princeton.edu/~chongw/papers/WangPaisleyBlei2011.pdf). There are a [limited](http://www.stats.ox.ac.uk/~teh/software.html) [number](https://github.com/shuyo/iir/blob/master/lda/hdplda2.py) of [implementations](https://github.com/renaud/hdp-faster) [of HDP-LDA](http://www.arbylon.net/resources.html) available, and none of them are great. "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
},
"nikola": {
"slug": "nonparametric-lda",
"title": "Nonparametric Latent Dirichlet Allocation"
}
},
"nbformat": 4,
"nbformat_minor": 1
}