Jacobian adjustment for a simple model

Introduction

In the previous texts (Part 1 and Part 2) we sampled from a known distribution. That is of course not a realistic scenario. I will show in this tutorial how we can apply skills learned in those previous tutorials to a real modeling situation.

I'll first walk you through examples using a simple and non-efficient (but hopefully easy-to-read) Metropolis algorithm and then discuss how this generalizes to Stan. Even if you are only interested in seeing the Stan versions, at least skim trough the Metropolis examples, since it is in those examples where I explain when Jacobians are needed and when not in greater detail.

We will be looking at a simple model in which the data y are assumed to come from a normal distribution with unknown mean (μ) and standard deviation (σ):

y ∼ N(μ, σ)

dnorm(y, mu, sigma)

The aim is to approximate the joint posterior density for the unknown parameters, but before we go into that, let us briefly talk about what kind of priors the parameters are assigned.

We can give μ a normal prior

μ ∼ N(0, 5)

dnorm(mu, 0, 5)

but in the case of σ we have to take into account that it's constrained to positive real numbers. We can give it a lognormal prior:

σ ∼ lognormal(1.5, 0.5)

dlnorm(sigma, 1.5, 0.5)

As before, the Metropolis algorithm samples values on the unconstrained scale, so we might want to apply the log transformation.

But what is the lognormal distribution? As the name implies, it is related to the normal distribution. Similar to the Gamma distribution it has support only on positive real numbers.

If x is distributed normally with the parameters μ and σ:

x ~ N(μ, σ)

then

exp(x) ~ lognormal(μ, σ)

In other words, if we draw a random sample from a normal distribution and then exponentiate those samples, the resulting samples will be lognormally distributed. Try this in R:

# First we draw samples from a normal distribution
x = rnorm(1e5, -2, 0.3)
# Then we exponentiate them
exp_x = exp(x)
# And plot the distribution agains the lognormal distribution
hist(exp_x, prob = T)
curve(dlnorm(x, -2, 0.3), add = T)

This figure shows the histogram exponentiated samples against a lognormal distribution:

You can try this for different values of the parameters.

Examples using a basic Metropolis algorithm

We will begin by coding a simple Metropolis algorithm: it's not coded in the most efficient way, and hardcoding initial values, number of steps and so on is not a good practice, but this will suffice for this example; really the code just needs to be readable. R code is shown below:

# INPUT
# posterior : a function for evaluating the unnormalized posterior
# y         : a vector of data
runChain = function(posterior, y){
# The number of steps is hard-coded, and the initial values 
# are always drawn randomly from the std normal distribution:
  NSteps      = 25000
  samples     = matrix(NaN, ncol = 2,  nrow = NSteps)
  samples[1,] = rnorm(2)
      
  for(i in 2:NSteps){
    proposal = rnorm(2, samples[i-1,], 1)
        
    P_accept = posterior(proposal, y) / posterior(samples[i-1,], y)
        
    if(is.nan(P_accept)) P_accept = 0
        
    if(runif(1) < P_accept){
      samples[i,] = proposal
    } else {
      samples[i,] = samples[i-1,]
    }
  }
  return(samples)
}

Next, we'll define two posterior distributions. First we define a posterior distribution without the Jacobian:

# INPUT
# theta : c(mu, sigma)
# y     : vector of data
# Note y can be of length zero: in that case we 
# simply are sampling from the prior distribution.
posteriorWoJacobian = function(theta, y){
    # Next we get the prior probabilities for the parameters.
  # Note that since theta[2] has been transformed, we 
  # use the inverse transform to get it back to the
  # original scale:
  prior = dnorm(theta[1], 0, 5) * dlnorm(exp(theta[2]), 1.5, 0.5)
    # This is just so we can run the Metropolis algorithm
  # with no data:
  if(length(y) == 0){
    likelihood = 1
  } else {
     likelihood = prod(dnorm(y, theta[1], exp(theta[2])))
  }
    # We return the product of the prior and likelihood,
  # there's no Jacobian adjustment!
  return(prior * likelihood)
}

We define another posterior, this time with the Jacobian adjustment:

# INPUT
# theta : c(mu, sigma)
# y     : vector of data
# Note y can be of length zero: in that case we 
# simply are sampling from the prior distribution.
posteriorJacobian = function(theta, y){
    # We do pretty much all of the same things here as before

  prior = dnorm(theta[1], 0, 5) * dlnorm(exp(theta[2]), 1.5, 0.5)

  if(length(y) == 0){
    likelihood = 1
  } else {
     likelihood = prod(dnorm(y, theta[1], exp(theta[2])))
  }
    # But now we do the Jacobian adjustment, that is, 
  # multply the density with the absolute value of the 
  # derivative of the inverse transformation. Remember that
  # theta[2] is still on the unconstrained scale! 
  return((prior * likelihood) * exp(theta[2]))
}

Let us begin by simply sampling from the prior distribution.

Since we want to sample from the prior, we initialize a vector
of length zero as data. I do it this way, since this is compatible
with Stan: 
y = array(dim=0)
samplesWoJacobian = runChain(posteriorWoJacobian, y)
samplesJacobian   = runChain(posteriorJacobian, y)

You can compare the different versions by plotting histograms of the samples against the prior distributions. Remeber to add prob=T to the call to the histogram and to exponentiate the samples on the second dimension:

hist(samplesWoJacobian[,1], prob = T)
curve(dnorm(x, 0, 5), add = T)
  
hist(exp(samplesWoJacobian[,2]), prob = T)
curve(dlnorm(x, 1.5, .5), add = T)  

# And the same for the the samples from runs
# with the Jacobian

But in case you don't feel like trying that yourself, I've included some figures below: the red shaded area is the target distribution. Solid lines show approximations using the posterior without Jacobian and the dashed line with it.

For the μ parameter things look fine, there's some random error, but things are different for the σ parameter. It is probably not suprising that the version without Jacobian doesn't converge to the prior distribution while the one with the Jacobian does. We saw as much when we were sampling from Gamma distribution: this example is almost one-to-one copy of that. But let us take a look at a third example!

Bring back to your mind what we talked about the relationship between normal and lognormal distributions. We have, so far, defined the prior for exp(theta[2]) to be lognormal distribution: dlonorm(exp(theta[2]), 1.5, 0.5). Mathematically this is the same as saying that the non-exponentiated theta is distributed normally with parameters μ = 1.5 and σ = 0.5: dnorm(theta[2], 1.5, 0.5.

Let us use this information to modify our model in this way:

# INPUT
# theta : c(mu, sigma)
# y     : vector of data
# Note y can be of length zero: in that case we 
# simply are sampling from the prior distribution.
posteriorLogPrior = function(theta, y){
    # We now have prior for non-exponentiated theta[2]:
  prior = dnorm(theta[1], 0, 5) * dnorm(theta[2], 1.5, 0.5)

  if(length(y) == 0){
    likelihood = 1
  } else {
  # We still have to remember to exponentiate theta[2]
  # when we are calculating the likelihood:
    
     likelihood = prod(dnorm(y, theta[1], exp(theta[2])))
  }
# Now we don't include the Jacobian:
  return(prior * likelihood)
}

Since we've already seen that the difference (in including the Jacobian or not) can be seen in the σ parameter, let us look just at that. As before, the shaded area represents the target distribution (lognormal(1.5, 0.5)), and the solid line represents the Metropolis approximation:

Seems like a match – even though there was no Jacobian adjustment! Why is that?

Jacobian adjustments are needed when the scale on which the sampler works (here: Metropolis algorithm) is transformed to some other scale. Or in other words, when the inverse transformed parameter is present in a call to a density function: somePDF(exp(theta)). Note that it doesn't matter when the parameter is transformed: we could modify the earlier examples in such a way that theta[2] would be exponentiated just once, when the function is entered, and we would still need to include the Jacobian.

But given that in this last example exponentiated theta is not used in any calls to density functions, only in the likelihood, no Jacobian adjustment is needed. Schön und flink!

Let us now take a look at how this generalizes to Stan models.

Examples using Stan

I'll show you three models.

Model 1: the exponentiated standard deviation is given a lognormal prior (analogous to the first two examples earlier). Inclusion of Jacobian is controlled through the parameter includeJacobian)
Model 2: standard deviation is again given a lognormal prior, but we use Stan's constraints
Model 3: which prior is given for log transformed standard deviation (analogous to the last example in the previous section)

It's important to note that the expression exp(logSD) ~ lognormal(1.5, 0.5) is analogous to dlnorm(exp(logSD), 1.5, 0.5).

Model 1:

  data{
    int N;
    vector[N] y;
    int includeJacobian;
  }
  
  parameters{
    real mu;
    real logSD;
  }
  
  model{
    mu ~ normal(0, 5);
    exp(logSD) ~ lognormal(1.5, 0.5);
    
    // For sampling from the prior. In R use "y = array(dim=0))"
    if(N > 0){
      y ~ normal(mu, exp(logSD));
    } 
  // I don't explicitly use the absolute value here 
  // because the values are always positive anyways. Note that we add
  // the Jacobian to the target since Stan works with the log posterior. 
  // This is why we also take take Log of the Jacobian.
  // (Or we add the Jacobian to the target: 
  // that's the same as multiplication)    
    if(includeJacobian == 1){
      target += log(exp(logSD));
    }
  }

Model 2, using Stan's constrains:

  data{
    int N;
    vector[N] y;
  }
  
  parameters{
    real mu;
    real<lower = 0> logSD;
  }
  
  model{
    mu ~ normal(0, 5);
    logSD ~ lognormal(1.5, 0.5);
    
    // For sampling from the prior. In R use "y = array(dim=0))"
    if(N > 0){
      y ~ normal(mu, logSD);
    }
  }

Model 4, prior on log scale

  data{
    int N;
    vector[N] y;
  }
  
  parameters{
    real mu;
    real logSD;
  }
  
  model{
    mu    ~ normal(0, 5);
    logSD ~ normal(1.5, 0.5);
    
    // For sampling from the prior. In R use "y = array(dim=0))"
    if(N > 0){
      y ~ normal(mu, exp(logSD));
    } 
  }

Let us sample from the prior and see how the different models (with/without Jacobian) do.

Model 1 without Jacobian
Model 1 with Jacobian
Model 2: here Stan does Jacobian adjustment under the hood
Model 3: prior is on log scale, Jacobian is not needed

Results are similar to those that were obtained using Metropolis algorithm. When prior is explicitly defined for the (inverse) transformed parameter (Model 1) we need to include the Jacobian. If we use Stan's constraints (Model 2), Stan does these adjustments for use. If prior is on the transformed scale (in this case log scale) no adjustment is needed, even though the inverse transformed parameter (exp(logSD)) is used in the likelihood.

Effect of Jacobian as N gets larger

The last example we'll look at is how the number of observations affects the difference between estimates that use and estimates that don't use Jacobian adjustment. We'll look at the the estimated marginal posterior mean of the standard deviation parameter. In the figure below is the absolute difference between the marginal mean when Jacobian is included versus when it is not.

Observations are generated from normal distribution with μ = 2.0 and σ = 8.0.

For each N a couple of simulations were run:

As you can see, the difference approaches zero. This is because Jacobian affects the prior, and as usual, when N gets larger, the influence of the prior diminishes.

It is of course good to remember that this depends on the information content of the observations and the model used. See this text for further details.

Conclusion

Jacobians are needed when we transform the samples to fit the prior distribution. If transformed values are used in the likelihood, this does not imply the need for Jacobian adjustment. If the prior is on such a scale that the samples need not be transformed, there's no need for a Jacobian either. Also, in Stan, if we specify explicit constraints, Stan will handle Jacobian adjustments for us.