Hash-based Matched Pseudo-Random Number Generation • hashprng

Overview

This vignette demonstrates the practical use of hashprng::hash_seed; if you want a detailed explanation of why to use this approach, read the paper associated with the package citation: (TODO).

Briefly, hash-based matched pseudo-random number generation (HBM-PRNG) can ensure that the same “events” across different simulation scenarios are resolved consistently, while also still permitting stochastic simulation, and incurring minimal additional computational overhead. Matched samples can then be analyzed pair-wise, leading to higher resolution calculation of the overall impact of the cross-scenario varying features.

However, researchers still need to specify what constitutes “the same” event, both in terms of what features are hashed and how they implement the model structure. In this vignette, we focus on how to use hashprng::hash_seed, but do so using two different implementations of the same model world and two different definitions of “same” for events.

Standards of Matching

Throughout this vignette we consider a few matching regimes:

Non-matched (NON): this is equivalent to drawing an item from this random process and then one from the same-but-slightly-perturbed process and comparing them. Note: this kind of simulation should still be seeded, to ensure reproducibile PRNG.
Seed-Matched-Only (SMO): this scheme ensures that compared runs both have the same PRNG seeding. This can guarantee, for example, that the history leading up to an intervention is the same. However, generally once simulation trajectory diverges, the PRNG is exercised differently, and events that should be the same between runs are not guaranteed to see the same PRNG draws.
Fully-Event-Matched (FEM): this scheme ensures that compared runs have all of their identical events resolved consistently. Any event which occurs in both simulations gets the same draw from the PRNG.

Hash-based matched (HBM) PRNG is an approach to acheiving FEM simulations. There are alternatives, for example managing multiple PRNGs or precomputing all possible outcomes. In general, HBM imposes a guaranteed-but-low computational cost, scaling with the number of the PRNG draws in any given simulation, with minimal implementation burden on the modeller.

We’ll be making comparisons across NON, SMO, and HBM in terms of implementation, compute cost, and calculation resolution.

Model World and Scenario

We’ll be considering an infectious disease system of the Susceptible-Infectious- Removed variety and an intervention where, after some delay from initial infection introduction, a portion of the population are spontaneously moved to R (representing e.g. a reactive vaccination campaign).

Model Implementation

We’ll implement this model in terms of discrete time steps, discrete individuals, and event probabilities. At each time step:

We’ll consider if infectious individuals have exposing contact with susceptible individuals - this converts S to I at the end of the step.
We’ll consider if infectious individuals recover - I to R.
If it’s time for the move, we’ll consider which susceptible individuals to remove - S to R.

This is essentially a perturbation to the classic Reed-Frost model, with the modification that people may be infectious for more than a single time step ( as well as introducing an intervention).

We’re also going to implement this model two ways, each with a switch for whether or not to use hash-based matching for the random number draws. We’ll flip this switch to show matching-vs-not changes conclusions comparing across scenarios, as well as to measure the resource cost associated with the higher resolution measurement. We’ll also create a setup where the seed is only set once for the NON comparison.

The other element we need to finally specify the model is what constitutes an event. We’re going to offer two definitions and explore the consequences. Those two definitions hinge on whether the model’s discrete individuals have identity or not.

If individuals have identity, then events are distinguished by who is involved. If they do not, then events are only distinguished by what and when.

# define some variables to differentiate between individual states
# for the non-identity model, these will be compartment indices
# for the identity model, these are individual states
susceptible <- 1L
infectious <- 2L
removed <- 3L

Non-Identity Model Implementation

For the non-identity model, we assume that at each time step only the size of the relative populations should determine event outcomes, e.g. how many people are infected. We can accomplish that by drawing from the binomial distribution.

For HBM, we need to match time and event. But since there are (at most) three draws per step, always occurring in the same order, we can simply match on time and then quantile draws. We are assuming here that runif always advances underlying PRNG state by the number of deviates requested, but that rbinom might advance state by a variable amount depending on both parameters and state, hence the need to draw quantiles and then invert to outcomes.

N.B. ?rbinom indicates draws are done per “Kachitvichyanukul, V. and Schmeiser, B. W. (1988) Binomial random variate generation. Communications of the ACM, 31, 216–222”, which describes and algorithm BPTE which does indeed draw uniform deviates a variable number of times per binomial draw, depending on the \(n\) and \(p\) arguments.

#' @param t an integer scalar, the simulation time
#' @param y an integer vector, the current counts of individuals in S, I, and R
#' @param parameters a list, `transmission_p`, `recovery_p`, `vaccination_p`,
#'   and `vax_t`
#' @param seed optional; if non-null, using HBM
nonidentity_dStep <- function(t, y, parameters, seed = NULL) {
  with(parameters, {
    # extract the S, I population counts
    Ninf <- y[infectious]
    Nsus <- y[susceptible]

    # if we are matching events
    if (!is.null(seed)) {
      # draw the same quantiles for each when
      hash_seed(seed, t)
      evtqs <- runif(2)
      # convert these to the outcomes
      newinf <- qbinom(evtqs[1], Nsus, 1 - (1 - transmission_p)^Ninf)
      newrem <- qbinom(evtqs[2], Ninf, recovery_p)
    } else { # simply make the binomial draws
      newinf <- rbinom(1, Nsus, 1 - (1 - transmission_p)^Ninf)
      newrem <- rbinom(1, Ninf, recovery_p)
    }

    # move infectees S -> I, recoveries I -> R
    dY <- c(S = -newinf, I = newinf - newrem, R = newrem)

    if (t == vax_t) {
      if (!is.null(seed)) {
        # n.b. still on a consistent PRNG state for this time, so can just draw
        newvax <- qbinom(runif(1), Nsus - newinf, vaccination_p)
      } else {
        newvax <- rbinom(1, Nsus - newinf, vaccination_p)
      }
      # move vaccinees
      dY[susceptible] <- dY[susceptible] - newvax
      dY[removed] <- dY[removed] + newvax
    }

    # return the overall state change, as well as new incidence
    return(list(dY, newinf))
  })
}

Higher Resolution Implementation

For the identity based model, we assume who particularly gets infected or vaccinated matters. Thus, for HBM we need to hash both who and when for infection and recovery. We can again take advantage of a consistent number of draws (because we always consider exposing all individuals, susceptible or not) to avoid reseting the PRNG for the recovery draw.

We also set the PRNG for vaccination. This probably isn’t necessary: vaccination is itself the only factor that might change trajectories, so there are no differences in trajectory prior to considering vaccination. However, this is a very fragile arrangement. Adding any other interventions or otherwise perturbing model assumptions could easily change traversal of the PRNG and result in unmatched vaccinee selection.

N.B., for both the matched and unmatched versions we’re generally overdrawing on random number deviates. While it might be possible to reduce computational time with fewer draws, that would entail either other computational work (e.g. calculating who is eligible for infection) or more convoluted data structures. Those would be plausible for the unmatched version, at this model complexity, but likely become less practical with increasing complexity.

#' @param t an integer scalar, the simulation time
#' @param y an integer vector, the individual states: S, I, R
#' @param parameters a list, `transmission_p`, `recovery_p`, `vaccination_p`,
#'   and `vax_t`
#' @param salt an integer scalar, optional; if present, use hash-based matching
identity_dStep <- function(t, y, parameters, salt = NULL) {
  with(parameters, {
    N <- length(y)
    dY <- integer(N)
    incI <- 0
    # consider whether each infectious individual exposes susceptible
    # individuals
    infectable <- which(y == susceptible)
    for (infector in which(y == infectious)) {
      if (length(infectable)) {
        if (!is.null(salt)) {
          # if HBM, need to reseed RNG for each pair => draw
          # easier to salt => draw all possible => slice relevant
          hash_seed(salt, "inf", infector, t)
          infectees <- infectable[
            (runif(tail(infectable, 1)) < transmission_p)[infectable]
          ]
        } else {
          infectees <- sample(
            infectable, rbinom(1, length(infectable), transmission_p)
          )
        }
        dY[infectees] <- 1
        infectable <- setdiff(infectable, infectees)
        # if this infector would recovery, indicate 2 => 3 transition
        if (runif(1) < recovery_p) dY[infector] <- 1
      }
    }

    incI <- sum((dY == 1) & (y != infectious))

    # if its time for vaccination
    if (t == vax_t) {
      if (!is.null(salt)) {
        # if HBM, need to reseed for each individual
        # include the t here - what if the model changed vax_t?
        psalt <- hash_seed(salt, "vax", t)
        vaccinees <- infectable[
          (runif(tail(infectable, 1)) < vaccination_p)[infectable]
        ]
      } else {
        vaccinees <- sample(
          infectable, rbinom(1, length(infectable), vaccination_p)
        )
      }
      # for people still susceptible, indicate 1 => 3 transition
      dY[vaccinees] <- 2
    }

    return(list(dY, incI))
  })
}

Comparison

Let’s run these simulation for a population of 4000 individuals, 75 time steps, and 500 samples. We’ll consider a very small amount of vaccination - only 3.3%. We want to assess the cumulative averted incidence of this scenario compared to no vaccination.

For the “non-matching” version, we will still match on random number seeds; we refer to that as seed-matched-only, or SMO. Because our model starts vaccination after some delay, we should still see absolutely identical outcomes for SMO up to that point. However, after that point, the RNG will see a different traversal.

We’re going to discard outcomes were the epidemic dies out prior to vaccination. This is a debate-worthy choice - e.g. if we want to understand the potential benefit of purchasing some program, that should include the possibility that it was unnecessary. For this model world, however, those sort of considerations apply equally to either HBM or SMO, and we want to focus on the ability to resolve intervention impact when there is an epidemic. Hence, we discard those outcomes.

# setup parameters
samplen <- 500L
pop <- 4000L
maxt <- 75L
initI <- 5L
ps <- list(
  transmission_p = 1 - exp(-0.78 / pop),
  # ~ 0.78 infections produced per day in fully susceptible pop
  recovery_p = 1 - exp(-0.44) # ~ 2.3 days infectious
  # implies R0 ~ 1.8
)
# non-intervention, set vax_t < 0
nonps <- c(ps, list(
  vax_t = -1, vaccination_p = NA
))
# intervention, vax_t == 10, 3.3% chance to happen
intps <- c(ps, list(
  vax_t = 10, vaccination_p = 0.033
))
# create populations
yinit <- rep.int(susceptible, pop)
yinit[1:initI] <- infectious
yinit_nonid <- c(pop - initI, initI, 0L)

stepper <- function(
    maxtime = maxt, y0,
    dFUN, pars, seed, HBM) {
  # if not HBM & a seed provided => SMO => set initial seeding
  if (!missing(seed) && !HBM) set.seed(seed)

  # reserve data structure
  res <- integer(maxtime + 1)
  res[1] <- 1
  y <- y0
  # run the simulation
  for (i in seq_len(maxtime)) {
    dY <- if (HBM) dFUN(i, y, pars, seed) else dFUN(i, y, pars)
    y <- y + dY[[1]]
    res[i + 1] <- dY[[2]]
  }

  return(res)
}

Sampling

Now we run all the samples, using plain sample ID as the SMO seed or HBM salt, for all the models.

set.seed(8675309)
nonid_no_match <- system.time(
  ni_non_dt <- seq_len(samplen) |> parallel::mclapply(function(seed) {
    data.table(
      type = "NON", model = "nonID",
      not_i = cumsum(stepper(
        y0 = yinit_nonid, dFUN = nonidentity_dStep, pars = nonps, HBM = FALSE
      )),
      withi = cumsum(stepper(
        y0 = yinit_nonid, dFUN = nonidentity_dStep, pars = intps, HBM = FALSE
      ))
    )
  }, mc.cores = 5) |> rbindlist(idcol = "sample")
)

set.seed(8675309)
id_no_match <- system.time(
  wi_non_dt <- seq_len(samplen) |> parallel::mclapply(function(seed) {
    data.table(
      type = "NON", model = "ID",
      not_i = cumsum(stepper(
        y0 = yinit, dFUN = identity_dStep, pars = nonps, HBM = FALSE
      )),
      withi = cumsum(stepper(
        y0 = yinit, dFUN = identity_dStep, pars = intps, HBM = FALSE
      ))
    )
  }, mc.cores = 5) |> rbindlist(idcol = "sample")
)

nonid_seed_match_only <- system.time(
  ni_smo_dt <- seq_len(samplen) |> parallel::mclapply(function(seed) {
    data.table(
      type = "SMO", model = "nonID",
      not_i = cumsum(stepper(
        y0 = yinit_nonid, dFUN = nonidentity_dStep, pars = nonps, seed = seed, HBM = FALSE
      )),
      withi = cumsum(stepper(
        y0 = yinit_nonid, dFUN = nonidentity_dStep, pars = intps, seed = seed, HBM = FALSE
      ))
    )
  }, mc.cores = 5) |> rbindlist(idcol = "sample")
)

id_seed_match_only <- system.time(
  wi_smo_dt <- seq_len(samplen) |> parallel::mclapply(function(seed) {
    data.table(
      type = "SMO", model = "ID",
      not_i = cumsum(stepper(
        y0 = yinit, dFUN = identity_dStep, pars = nonps, seed = seed, HBM = FALSE
      )),
      withi = cumsum(stepper(
        y0 = yinit, dFUN = identity_dStep, pars = intps, seed = seed, HBM = FALSE
      ))
    )
  }, mc.cores = 5) |> rbindlist(idcol = "sample")
)

nonid_hash_based_matching <- system.time(
  ni_hbm_dt <- seq_len(samplen) |> mclapply(function(seed) {
    data.table(
      type = "HBM", model = "nonID",
      not_i = cumsum(stepper(
        y0 = yinit_nonid, dFUN = nonidentity_dStep, pars = nonps, seed = seed, HBM = TRUE
      )),
      withi = cumsum(stepper(
        y0 = yinit_nonid, dFUN = nonidentity_dStep, pars = intps, seed = seed, HBM = TRUE
      ))
    )
  }, mc.cores = 5) |> rbindlist(idcol = "sample")
)

id_hash_based_matching <- system.time(
  wi_hbm_dt <- seq_len(samplen) |> mclapply(function(seed) {
    data.table(
      type = "HBM", model = "ID",
      not_i = cumsum(stepper(
        y0 = yinit, dFUN = identity_dStep, pars = nonps, seed = seed, HBM = TRUE
      )),
      withi = cumsum(stepper(
        y0 = yinit, dFUN = identity_dStep, pars = intps, seed = seed, HBM = TRUE
      ))
    )
  }, mc.cores = 5) |> rbindlist(idcol = "sample")
)

sampledt <- rbind(ni_non_dt, wi_non_dt, ni_smo_dt, wi_smo_dt, ni_hbm_dt, wi_hbm_dt)

Timings:

Now we compute the difference in series, and determine if there are any to drop:

sampledt[, averted := not_i - withi]
sampledt[, time := seq_len(.N) - 1L, by = .(type, model, sample)]
keepers <- sampledt[, .(keep = not_i[intps$vax_t + 1L] != not_i[.N]), by = .(type, model, sample)]
keepers[, sum(keep) / .N, by = .(model, type)]

reduced_dt <- sampledt[keepers, on = .(type, model, sample)][keep == TRUE]

qtiled_dt <- reduced_dt[,
  {
    qs <- quantile(averted, probs = c(0.25, 0.5, 0.75))
    .(lo = qs[1], md = qs[2], hi = qs[3])
  },
  by = .(type, model, time)
]

… and lastly, we should visualize these