bayes_mem

The ``Maximum Entropy Method''

Common and powerful technique used for deconvolving sidelobes and other artifacts from images.
Usually justified as an ad hoc balance between goodness-of-fit and simplicity,
But MEM actually can be derived using Bayesian methods.
This places the method on a stronger logical foundation.

The Bayesian probabilistic method

Two basic concepts: "prior probability" and ``likelihood"
Both are functions of the intensities in the map pixels and the observables
The product "prior probability" X ``likelihood" has a maximum value for some particular map
This is the most probable map, and it turns out that this is also a MEM map.

The ``prior probability''

represents the chance that the image b represents the true map before any data is taken into account.
This ``prior probability'' is a measure of how simple an image is
Simple images are more probable (a priori) than complex images

How do we get the "prior probability", ?

A monkey argument suggested by Gull and Daniell (1978):
Suppose a monkey tosses N identical photons into M identical pixels:
Number of photons in jth pixel =
Number of ways to get a given image $\{b_j \vert j=1,...,M\}$

$\begin{displaymath} = {{N!} \over {b_1! b_2! \cdots b_M!}} \end{displaymath}$ (1)

$\begin{displaymath} = {{N!} \over {b_1! b_2! \cdots b_M!}} \end{displaymath}$

(1)

The a priori probability of getting a given image b is therefore this number divided by the number of all possible maps:

$\begin{displaymath} P(b) = {{N!/b_1! b_2! \cdots b_M!} \over {\sum_{b_i}(N!/b_1! b_2! \cdots b_M)} } \end{displaymath}$

(2)

If there are many photons in each pixel (

) we can use Stirling's approximation for the factorial, $b_i! \approx (b_i/e)^{b_i}$ . Using this, $P(b) \propto b_1^{b_1} b_2^{b_2} \cdots b_M^{b_M}$ .

It is much more convenient to normalize all the pixel intensities

to the so-called ``grey map'', where all the intensities equal the mean value $\bar{b} = N/M$ .

$\begin{displaymath} P(b) = {{P_G} \over {(b_1/\bar{b})^{b_1} (b_2/\bar{b})^{b_2} \cdots(b_M/\bar{b})^{b_M}}}, \end{displaymath}$

(3)

where

is a normalizing constant that guarantees that

sums to 1 over all possible maps

is the probability of the ``grey'' map, since all the factors $(b_j/\bar{b})$ in equation (3) are 1 in that case. It can be shown that

is the maximum value of

None of this looks at all like the ``entropy'' term we all know and love in MEM, but equation (3) can be re-written in an exponential form which gives us the so-called ``entropy'' term $S=-\sum_i{b_i log(b_i/\bar{b})}$ .

$\begin{displaymath} P(b) = e^{-\alpha{\sum_i{b_i log(b_i/\bar{b})} }} \end{displaymath}$

(4)

In our derivation, $\alpha=1$ , but Skilling and Gull (1989) and Skilling (1989) argued that it is an indeterminate free parameter of the probability distribution, and this free parameter is used in virtually all MEM programs.

Now that we have the ``prior'' probability, we need to find the "likelihood". The "likelihood" $P(D \vert b)$ (``P of D given b'') is the probability that the observed data D could have been generated from a given image b. )

This probability is computed using the difference of the simulated data $D_{sim}$ computed from some guess for the image b and the real data D. For Gaussian statistics,

$\begin{displaymath} P(D \vert b) \sim e^{-{1 \over 2}(D-D_{sim})^2/\sigma^2} \end{displaymath}$

(5)

$\begin{displaymath} P(D \vert b) \sim e^{-{1 \over 2}\chi^2} \end{displaymath}$

(6)

But what we want for the deconvolution is not $P(D \vert b)$ but $P(b \vert D)$ , that is, the probability of the image b given the data D. This is given by Bayse's theorem:

$\begin{displaymath} P(b \vert D) = P(b) * P(D \vert b)/P(D) \end{displaymath}$

(7)

At last we can write down the quantity to be maximized. Inserting the expressions (2) and (4) for

and $P(D \vert b)$ into equation (5), we get:

$\begin{displaymath} P(b \vert D) = C e^{-{1 \over 2}\chi^2} \times e^{-\alpha \sum b_i log(b_i/\bar{b})} \end{displaymath}$

(8)

$\begin{displaymath} log(P(b\vert D)) = -{1 \over 2}\chi^2 - \alpha \sum b_i log(b_i/\bar{b}) \end{displaymath}$

(9)

The first term is the constraint of the data; it is 0 when the image matches the data perfectly, but in realistic cases is $\sim 1$ when the data are fitted but not ``over fitted''. The second term is commonly called the "entropy" term, which guarantees the "simplicity" of the image. (It is not really the entropy in the sense used by Shannon (1948), but the weight of common use forces us to call it that.) The ``entropy'' term is maximum (0) when $b_i = \bar{b}$ . That is, the ``grey'' map is a priori most probable.

Note the arbitrary coefficient $\alpha$ giving relative weight to the ``entropy'' term. If $\alpha=1$ , Bayesian logic would give equal weight to each term, but MEM algorithms usually weight them differently to provide simpler maps ( $\alpha > 1$ ) or maps more faithful to the data ( $\alpha < 1$ ). It is possible, using Bayseian arguments, to estimate the most probable $\alpha$ (Skilling 1989)

Since the dimensionality of the parameter space for MEM is the number of map pixels, we cannot show contours of the function $P(b\vert D$ ) in a realistic case. Instead we show a case for 3 pixels, {

}, where the sum of the pixel brightnesses equals 100. This reduces it to a 2-dimensional problem. In the figure below, the dashed contours represent the entropy function

, and the solid contours show the $\chi^2$ function. The squares are located at the maximum of $P(b\vert D)$ for different values of $\alpha$ ". When $\alpha=\infty$ , the maximum is located at the entropy peak, and when $\alpha=0$ , it is located at the minimum of $\chi^2$ .

The goal of MEM is to find the optimum value along this curve, somewhere in parameter space between the "prior" grey map and the map which fits the data exactly. Skilling has argued that the ``most probable'' $\alpha$ is the one that makes the number of ``good'' measurements equal to - $2 \alpha S(b)$ , which is the (dimensionless) amount of structure in the map

. The number of ``good'' measurements is defined by a sum over curvature terms in the likelihood matrix (Skilling 1989).