In word of John Von Neumann:
In mathmatics you don't understand things, you just get used to it.
# import library
%matplotlib inline
import numpy as np
from scipy import stats
from scipy.stats import norm
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.width',500)
pd.set_option('display.max_columns',100)
import seaborn as sns
Markov Chain Monte Carlo (MCMC) techniques are applied to solve integration and optimization problems in large dimensional spaces. Problems that are intractable using analytic approaches often become possible to solve using some form of MCMC.
MCMC is a strategy for generating samples $x^{(i)}$ using a Markov chain mechanism. We should emphasize that we use MCMC when we cannot sample directly from $p(x)$, but can evaluate $p(x)$ up to a normalization constant.
For simplicity we should introduce Markov chains on finite state space, where $x^{(i)}$ can only take discrete values $x^{(i)}=\{x_1^{i}, x_2^{i},\cdots, x_s^{i}\}$. This stochastic process is called Markov Chain if we transtion from one state to another state using a simple sequential precedure. We start a Markov chain at some state $x^1$, and use a transition function $T(x^{(i)}|x^{i-1})$, to determine the next state, $x^{(2)}$ conditional on the last state. We then keep iterating to create a sequence of states: \begin{equation} x^{(1)}\rightarrow x^{(2)} \rightarrow \cdots \rightarrow x^{(i)} \rightarrow \cdots \end{equation}
Each such a sequence of states is called a Markov chain or simply chain. The procedure for generating a sequence of T states from a Markov chain is the following:
Importantly, in this iterative procedure, the next state of the chain at $i+1$ is based only on the previous state at $i$. Therefore, each Markov chain wanders around the state space and transition to a new state is only dependent on the last state. It is the local dependency what makes this procedure Markov or memoryless. Normally,
\begin{equation} p(x^{(i)}|x^{(i-1)}\cdots x^{(1)}) = T(x^i|x^{(i-1)}) \end{equation}The chain is called homogenous if $T(x^i|x^{(i-1)})$ remains unchanged for all $i$.
When initializing each Markov chain, the chain will wander in state space around the starting state. Therefore, if we start a number of chains, each with different initial conditions, the chains will initially be close to the starting state. This peroid is called burnin. An important property of Markov chains is that the starting state of the chain no longer affects the state of the chain after a sufficiently long sequence of transition (assuming that certain conditions about the Markov chain are met). At this point, the chain is said to reach its steady state and the states reflect samples from its stationary distribution. This property that Markov chains converge to stationary distribution regardless of where we started (if certain regularity conditions of the transition function are met), is quite important.
# Simple samples of Markov Chain
# Transition martrix
T = np.array([[0.69,0.3,0.01],[0.8,0.1,0.1],[0,0,1]])
# Start state
S = np.random.uniform(low=0,high=1,size=3)
S = S/np.sum(S)
print S
Q = np.zeros([100,3])
for i in np.arange(0,100):
Q[i,:] = np.dot(S,np.linalg.matrix_power(T,i))
plt.plot(Q)
plt.ylim([-0.1,1.1])
plt.legend(('H','S','D'),loc=2)
plt.show()
This stability result plays a fundamental role in MCMC simulation. For any starting point, the chain will converge to the invariant distribution $p(x)$, as long as $T$ is a stochastic transition matrix that obeys the following properties:
The metropolis-hastings (MH) algorithm is the most popular of the MCMC methods.
Suppose our goal is to sample from the target density $p(x)$, with $-\infty Note that if the transition probability is symmetric $q(x^*|x^{(i)})=q(x^{(i)}|x^*)$ then the acceptance probability becomes
\begin{equation}
A(x,x^) = min[1, \frac{p(x^)}{p(x)}]
\end{equation}
which was the original algorithm proposed by Metropolis.
Sample from $p(x)=6x(1-x)$ using Metropolis algorithm with proposal distribution \begin{equation} x^* \sim N(x, 0.6) \end{equation}
plt.figure(figsize=[5,5])
# density function
p = lambda x: 6*x*(1-x)
# initialize the sample
x_now = stats.uniform.rvs(0,1,1)
# parameters
N = 5000
k = 0
x = np.zeros(N)
x[0] = x_now
while k < N:
u = stats.norm.rvs(0,1,1)
x_pot = stats.norm.rvs(loc=x_now,scale = 0.6,size=1)
while x_pot > 1 or x_pot < 0:
x_pot = stats.norm.rvs(loc=x_now,scale = 0.6,size=1)
if u < p(x_pot)/p(x_now):
x[k] = x_pot
x_now = x_pot
else:
x[k] = x_now
k+=1
x_bins = np.linspace(0,1,100)
plt.hist(x,bins=20,alpha = 0.4,normed=True,label='MCMC distribution')
plt.plot(x_bins,p(x_bins),label='True distribution')
plt.legend(loc=2)
plt.show()
Sample from a normal distribution using Metropolis algorithm with a proposal distribution \begin{equation} x^* \sim U(x-\delta, x+\delta) \end{equation} where $\delta=0.5$ is the random walk.
# Intialize the parameter
N = 5000
x_now = np.random.uniform()
delta = 0.5
x = np.zeros(N)
k = 0
while k < N:
x_pot = np.random.uniform(x_now-delta,x_now+delta,1)
u = stats.uniform.rvs(0,1,1)
if u < stats.norm.pdf(x_pot)/stats.norm.pdf(x_now):
x[k] = x_pot
x_now = x_pot
else:
x[k] = x_now
k+=1
plt.figure(figsize=[5,5])
plt.hist(x,alpha = 0.6, normed=True,lw=1.5)
x_bins = np.linspace(-5,5,100)
plt.plot(x_bins,stats.norm.pdf(x_bins))
plt.show()
Blockwise updating, uses a proposal distritution that has the same dimensionality as the target distribution. If the probability distribution involves N variables, we design a N-dimensional proposal distribution, and we either accept or reject the proposal as a block. In the following, we will use the vector notation $\textbf{x}=(x_1,x_2,\cdots,x_N)$ to represent a random variable involving N components, and $\textbf{x}^{(i)}$ to represent the i-th state in our sampler.
For a bivariate distribution, $\textbf{x}=(x_1,x_2)$