In word of John Von Neumann:

In mathmatics you don't understand things, you just get used to it.

In [1]:
# import library
%matplotlib inline
import numpy as np
from scipy import stats
from scipy.stats import norm
import pandas as pd

import matplotlib.pyplot as plt
pd.set_option('display.width',500)
pd.set_option('display.max_columns',100)
import seaborn as sns

Introduction

Motivation

Markov Chain Monte Carlo (MCMC) techniques are applied to solve integration and optimization problems in large dimensional spaces. Problems that are intractable using analytic approaches often become possible to solve using some form of MCMC.

MCMC algorithm

MCMC is a strategy for generating samples $x^{(i)}$ using a Markov chain mechanism. We should emphasize that we use MCMC when we cannot sample directly from $p(x)$, but can evaluate $p(x)$ up to a normalization constant.

Markov Chains

For simplicity we should introduce Markov chains on finite state space, where $x^{(i)}$ can only take discrete values $x^{(i)}=\{x_1^{i}, x_2^{i},\cdots, x_s^{i}\}$. This stochastic process is called Markov Chain if we transtion from one state to another state using a simple sequential precedure. We start a Markov chain at some state $x^1$, and use a transition function $T(x^{(i)}|x^{i-1})$, to determine the next state, $x^{(2)}$ conditional on the last state. We then keep iterating to create a sequence of states: \begin{equation} x^{(1)}\rightarrow x^{(2)} \rightarrow \cdots \rightarrow x^{(i)} \rightarrow \cdots \end{equation}

Each such a sequence of states is called a Markov chain or simply chain. The procedure for generating a sequence of T states from a Markov chain is the following:

  • Set i = 1
  • Generate a initial value u, and set $x^{(1)}=u$
  • Repeat
    • i = i+1
    • Sample a new value u from the transition function $T(x^i|x^{i-1})$
    • Set $x^{(i)}=u$
    • Until $i=T$

Importantly, in this iterative procedure, the next state of the chain at $i+1$ is based only on the previous state at $i$. Therefore, each Markov chain wanders around the state space and transition to a new state is only dependent on the last state. It is the local dependency what makes this procedure Markov or memoryless. Normally,

\begin{equation} p(x^{(i)}|x^{(i-1)}\cdots x^{(1)}) = T(x^i|x^{(i-1)}) \end{equation}

The chain is called homogenous if $T(x^i|x^{(i-1)})$ remains unchanged for all $i$.

When initializing each Markov chain, the chain will wander in state space around the starting state. Therefore, if we start a number of chains, each with different initial conditions, the chains will initially be close to the starting state. This peroid is called burnin. An important property of Markov chains is that the starting state of the chain no longer affects the state of the chain after a sufficiently long sequence of transition (assuming that certain conditions about the Markov chain are met). At this point, the chain is said to reach its steady state and the states reflect samples from its stationary distribution. This property that Markov chains converge to stationary distribution regardless of where we started (if certain regularity conditions of the transition function are met), is quite important.

In [2]:
# Simple samples of Markov Chain
# Transition martrix
T = np.array([[0.69,0.3,0.01],[0.8,0.1,0.1],[0,0,1]])

# Start state
S = np.random.uniform(low=0,high=1,size=3)
S = S/np.sum(S)
print S
Q = np.zeros([100,3])

for i in np.arange(0,100):
    Q[i,:] = np.dot(S,np.linalg.matrix_power(T,i))

plt.plot(Q)
plt.ylim([-0.1,1.1])
plt.legend(('H','S','D'),loc=2)
plt.show()
[ 0.27422575  0.09983014  0.6259441 ]

This stability result plays a fundamental role in MCMC simulation. For any starting point, the chain will converge to the invariant distribution $p(x)$, as long as $T$ is a stochastic transition matrix that obeys the following properties:

  • Irreducibility: For any state of Markov chain, there is a positive probability of visting all other states.
    • A chain is irreduciable is for all $i,j \in \Omega$, there exsits an integer $n$ such that $P^n(i,j)>0$. In other words, every state can eventually be reached starting from any other state.
  • Aperiodicity: The chain should not get trapped in circle.
  • Invariance: \begin{equation} p(x^{(i)})=\sum_{x^{(i-1)}}p(x^{(i-1)})T(x^{(i)}|x^{(i-1)}) \end{equation}

The Metropolis-Hastings algorithm

The metropolis-hastings (MH) algorithm is the most popular of the MCMC methods.

Suppose our goal is to sample from the target density $p(x)$, with $-\infty

  • Initialize $x^{(1)}$
  • For i =2 to N
    • Sample $u \sim U(0,1)$
    • Sample $x^* \sim q(x^*|x^{(i)})$
    • if $u < A(x^{(i)}, x^*)=min[1,\frac{p(x^*)q(x^{(i)}|x^*)}{p(x^{(i)})q(x^*|x^{(i)})}]$ $x^{(i+1)}=x^*$
    • else $x^{(i+1)}=x^{(i)}$

Note that if the transition probability is symmetric $q(x^*|x^{(i)})=q(x^{(i)}|x^*)$ then the acceptance probability becomes \begin{equation} A(x,x^) = min[1, \frac{p(x^)}{p(x)}] \end{equation} which was the original algorithm proposed by Metropolis.

Example 1

Sample from $p(x)=6x(1-x)$ using Metropolis algorithm with proposal distribution \begin{equation} x^* \sim N(x, 0.6) \end{equation}

In [14]:
plt.figure(figsize=[5,5])
# density function
p = lambda x: 6*x*(1-x)

# initialize the sample 
x_now = stats.uniform.rvs(0,1,1)

# parameters 
N = 5000
k = 0

x = np.zeros(N)
x[0] = x_now

while k < N:
    u = stats.norm.rvs(0,1,1)
    x_pot = stats.norm.rvs(loc=x_now,scale = 0.6,size=1)

    while x_pot > 1 or x_pot < 0:
        x_pot = stats.norm.rvs(loc=x_now,scale = 0.6,size=1)

    if u < p(x_pot)/p(x_now):
        x[k] = x_pot
        x_now = x_pot
    else:
        x[k] = x_now
    k+=1

x_bins = np.linspace(0,1,100)
plt.hist(x,bins=20,alpha = 0.4,normed=True,label='MCMC distribution')
plt.plot(x_bins,p(x_bins),label='True distribution')
plt.legend(loc=2)
plt.show()

The normal distribution

Sample from a normal distribution using Metropolis algorithm with a proposal distribution \begin{equation} x^* \sim U(x-\delta, x+\delta) \end{equation} where $\delta=0.5$ is the random walk.

In [16]:
# Intialize the parameter
N = 5000
x_now = np.random.uniform()
delta = 0.5

x = np.zeros(N)

k = 0

while k < N:
    x_pot = np.random.uniform(x_now-delta,x_now+delta,1)
    u = stats.uniform.rvs(0,1,1)

    if u < stats.norm.pdf(x_pot)/stats.norm.pdf(x_now):
        x[k] = x_pot
        x_now = x_pot
    else:
        x[k] = x_now

    k+=1
In [20]:
plt.figure(figsize=[5,5])
plt.hist(x,alpha = 0.6, normed=True,lw=1.5)
x_bins = np.linspace(-5,5,100)
plt.plot(x_bins,stats.norm.pdf(x_bins))
plt.show()

Metropolis-Hastings for Mutilvariate Distributions

Blockwise updating

Blockwise updating, uses a proposal distritution that has the same dimensionality as the target distribution. If the probability distribution involves N variables, we design a N-dimensional proposal distribution, and we either accept or reject the proposal as a block. In the following, we will use the vector notation $\textbf{x}=(x_1,x_2,\cdots,x_N)$ to represent a random variable involving N components, and $\textbf{x}^{(i)}$ to represent the i-th state in our sampler.

  • Set i = 1
  • Generate $\textbf{u}=(u_1,u_2,\cdots,u_N)$ and set $\textbf{x}^{(1)}=\textbf{u}$
  • Repeat
    • Propose $\textbf{x}^*$ from $q(\textbf{x}^*|\textbf{x}^{(i)})$
    • Evaluate acceptance probability $A=min(1, \frac{p(\textbf{x}^*)}{p(\textbf{x}^{(i)})}\frac{q(\textbf{x}^{(i)}|\textbf{x}^*)}{q(\textbf{x}^{*}|\textbf{x}^{(i)})})$
    • Generate $u \sim U(0,1)$
    • if $u \leq A$ accept then $\textbf{x}^{(i+1)}=\textbf{x}^*$ else $\textbf{x}^{(i+1)}=\textbf{x}^{(i)}$

Component-wise updating

For a bivariate distribution, $\textbf{x}=(x_1,x_2)$

  • Set i = 1
  • Generate $\textbf{u} = (u_1, u_2, \cdots, u_N)$ and set $\textbf{x}^{(1)}=u$
  • Repeat
    • Propose $x_1^*$ from $q(x_1^{(i+1)}|x_1^{(i)})$
    • Evaulate the acceptance probability $A = \frac{p(x_1^*,x_2^{(i)})}{p(x_1^{(i)},x_2^{(i)})}\frac{q(x_1^{(i)}|x_1^*)}{q(x_1^{*}|x_1^{(i)})}$
    • Generate $u \sim U(0,1)$
    • if $u \le A$ then accept $x_1^{(i+1)}=x_1^*$ else $x_1^{(i+1)}=x_1^{(i)}$
    • Propose $x_2^*$ from $q(x_2^{(i+1)}|x_2^{(i)})$
    • Evaluate the acceptance probability $A = \frac{p(x_1^{(i+1)},x_2^{*})}{p(x_1^{(i+1)},x_2^{(i)})}\frac{q(x_2^{(i)}|x_2^*)}{q(x_2^{*}|x_2^{(i)})}$
    • Generate $u \sim U(0,1)$
    • if $u \le A$ then accept $x_2^{(i+1)}=x_2^*$ else $x_2^{(i+1)}=x_2^{(i)}$