# Hash-based Matched Pseudo-Random Number Generation

`hashprng.Rmd`

## Overview

This vignette demonstrates the practical use of
`hashprng::hash_seed`

; if you want a detailed explanation of
*why* to use this approach, read the paper associated with the
package citation: (TODO).

Briefly, hash-based matched pseudo-random number generation (HBM-PRNG) can ensure that the same “events” across different simulation scenarios are resolved consistently, while also still permitting stochastic simulation, and incurring minimal additional computational overhead. Matched samples can then be analyzed pair-wise, leading to higher resolution calculation of the overall impact of the cross-scenario varying features.

However, researchers still need to specify what constitutes “the
same” event, both in terms of what features are hashed and how they
implement the model structure. In this vignette, we focus on how to use
`hashprng::hash_seed`

, but do so using two different
implementations of the same model world and two different definitions of
“same” for events.

## Standards of Matching

Throughout this vignette we consider a few matching regimes:

- Non-matched (NON): this is equivalent to drawing an item from this random process and then one from the same-but-slightly-perturbed process and comparing them. Note: this kind of simulation should still be seeded, to ensure reproducibile PRNG.
- Seed-Matched-Only (SMO): this scheme ensures that compared runs both
have the same PRNG seeding. This can guarantee, for example, that the
history leading up to an intervention is the same. However, generally
once simulation trajectory diverges, the PRNG is exercised differently,
and events that
*should*be the same between runs are not guaranteed to see the same PRNG draws. - Fully-Event-Matched (FEM): this scheme ensures that compared runs have all of their identical events resolved consistently. Any event which occurs in both simulations gets the same draw from the PRNG.

Hash-based matched (HBM) PRNG is an approach to acheiving FEM simulations. There are alternatives, for example managing multiple PRNGs or precomputing all possible outcomes. In general, HBM imposes a guaranteed-but-low computational cost, scaling with the number of the PRNG draws in any given simulation, with minimal implementation burden on the modeller.

We’ll be making comparisons across NON, SMO, and HBM in terms of implementation, compute cost, and calculation resolution.

## Model World and Scenario

We’ll be considering an infectious disease system of the
*S*usceptible-*I*nfectious- *R*emoved variety and
an intervention where, after some delay from initial infection
introduction, a portion of the population are spontaneously moved to
*R* (representing e.g. a reactive vaccination campaign).

## Model Implementation

We’ll implement this model in terms of discrete time steps, discrete individuals, and event probabilities. At each time step:

- We’ll consider if infectious individuals have exposing contact with
susceptible individuals - this converts
*S*to*I*at the end of the step. - We’ll consider if infectious individuals recover -
*I*to*R*. - If it’s time for the move, we’ll consider which susceptible
individuals to remove -
*S*to*R*.

This is essentially a perturbation to the classic Reed-Frost model, with the modification that people may be infectious for more than a single time step ( as well as introducing an intervention).

We’re also going to implement this model two ways, each with a switch for whether or not to use hash-based matching for the random number draws. We’ll flip this switch to show matching-vs-not changes conclusions comparing across scenarios, as well as to measure the resource cost associated with the higher resolution measurement. We’ll also create a setup where the seed is only set once for the NON comparison.

The other element we need to finally specify the model is what
constitutes an event. We’re going to offer two definitions and explore
the consequences. Those two definitions hinge on whether the model’s
discrete individuals have *identity* or not.

If individuals have identity, then events are distinguished by
*who* is involved. If they do not, then events are only
distinguished by *what* and *when*.

```
# define some variables to differentiate between individual states
# for the non-identity model, these will be compartment indices
# for the identity model, these are individual states
susceptible <- 1L
infectious <- 2L
removed <- 3L
```

### Non-Identity Model Implementation

For the non-identity model, we assume that at each time step only the
*size* of the relative populations should determine event
outcomes, e.g. how many people are infected. We can accomplish that by
drawing from the binomial distribution.

For HBM, we need to match time and event. But since there are (at
most) three draws per step, always occurring in the same order, we can
simply match on time and then quantile draws. We are assuming here that
`runif`

always advances underlying PRNG state by the number
of deviates requested, but that `rbinom`

might advance state
by a variable amount depending on both parameters and state, hence the
need to draw quantiles and then invert to outcomes.

N.B. `?rbinom`

indicates draws are done per
“Kachitvichyanukul, V. and Schmeiser, B. W. (1988) Binomial random
variate generation. Communications of the ACM, 31, 216–222”, which
describes and algorithm BPTE which does indeed draw uniform deviates a
variable number of times per binomial draw, depending on the \(n\) and \(p\) arguments.

```
#' @param t an integer scalar, the simulation time
#' @param y an integer vector, the current counts of individuals in S, I, and R
#' @param parameters a list, `transmission_p`, `recovery_p`, `vaccination_p`,
#' and `vax_t`
#' @param seed optional; if non-null, using HBM
nonidentity_dStep <- function(t, y, parameters, seed = NULL) {
with(parameters, {
# extract the S, I population counts
Ninf <- y[infectious]
Nsus <- y[susceptible]
# if we are matching events
if (!is.null(seed)) {
# draw the same quantiles for each when
hash_seed(seed, t)
evtqs <- runif(2)
# convert these to the outcomes
newinf <- qbinom(evtqs[1], Nsus, 1 - (1 - transmission_p)^Ninf)
newrem <- qbinom(evtqs[2], Ninf, recovery_p)
} else { # simply make the binomial draws
newinf <- rbinom(1, Nsus, 1 - (1 - transmission_p)^Ninf)
newrem <- rbinom(1, Ninf, recovery_p)
}
# move infectees S -> I, recoveries I -> R
dY <- c(S = -newinf, I = newinf - newrem, R = newrem)
if (t == vax_t) {
if (!is.null(seed)) {
# n.b. still on a consistent PRNG state for this time, so can just draw
newvax <- qbinom(runif(1), Nsus - newinf, vaccination_p)
} else {
newvax <- rbinom(1, Nsus - newinf, vaccination_p)
}
# move vaccinees
dY[susceptible] <- dY[susceptible] - newvax
dY[removed] <- dY[removed] + newvax
}
# return the overall state change, as well as new incidence
return(list(dY, newinf))
})
}
```

### Higher Resolution Implementation

For the identity based model, we assume who *particularly*
gets infected or vaccinated matters. Thus, for HBM we need to hash both
*who* and *when* for infection and recovery. We can again
take advantage of a consistent number of draws (because we always
consider exposing all individuals, susceptible or not) to avoid reseting
the PRNG for the recovery draw.

We also set the PRNG for vaccination. This probably isn’t necessary: vaccination is itself the only factor that might change trajectories, so there are no differences in trajectory prior to considering vaccination. However, this is a very fragile arrangement. Adding any other interventions or otherwise perturbing model assumptions could easily change traversal of the PRNG and result in unmatched vaccinee selection.

N.B., for both the matched and unmatched versions we’re generally overdrawing on random number deviates. While it might be possible to reduce computational time with fewer draws, that would entail either other computational work (e.g. calculating who is eligible for infection) or more convoluted data structures. Those would be plausible for the unmatched version, at this model complexity, but likely become less practical with increasing complexity.

```
#' @param t an integer scalar, the simulation time
#' @param y an integer vector, the individual states: S, I, R
#' @param parameters a list, `transmission_p`, `recovery_p`, `vaccination_p`,
#' and `vax_t`
#' @param salt an integer scalar, optional; if present, use hash-based matching
identity_dStep <- function(t, y, parameters, salt = NULL) {
with(parameters, {
N <- length(y)
dY <- integer(N)
incI <- 0
# consider whether each infectious individual exposes susceptible
# individuals
infectable <- which(y == susceptible)
for (infector in which(y == infectious)) {
if (length(infectable)) {
if (!is.null(salt)) {
# if HBM, need to reseed RNG for each pair => draw
# easier to salt => draw all possible => slice relevant
hash_seed(salt, "inf", infector, t)
infectees <- infectable[
(runif(tail(infectable, 1)) < transmission_p)[infectable]
]
} else {
infectees <- sample(
infectable, rbinom(1, length(infectable), transmission_p)
)
}
dY[infectees] <- 1
infectable <- setdiff(infectable, infectees)
# if this infector would recovery, indicate 2 => 3 transition
if (runif(1) < recovery_p) dY[infector] <- 1
}
}
incI <- sum((dY == 1) & (y != infectious))
# if its time for vaccination
if (t == vax_t) {
if (!is.null(salt)) {
# if HBM, need to reseed for each individual
# include the t here - what if the model changed vax_t?
psalt <- hash_seed(salt, "vax", t)
vaccinees <- infectable[
(runif(tail(infectable, 1)) < vaccination_p)[infectable]
]
} else {
vaccinees <- sample(
infectable, rbinom(1, length(infectable), vaccination_p)
)
}
# for people still susceptible, indicate 1 => 3 transition
dY[vaccinees] <- 2
}
return(list(dY, incI))
})
}
```

## Comparison

Let’s run these simulation for a population of 4000 individuals, 75
time steps, and 500 samples. We’ll consider a very small amount of
vaccination - only 3.3%. We want to assess the cumulative averted
*incidence* of this scenario compared to no vaccination.

For the “non-matching” version, we will still match on random number seeds; we refer to that as seed-matched-only, or SMO. Because our model starts vaccination after some delay, we should still see absolutely identical outcomes for SMO up to that point. However, after that point, the RNG will see a different traversal.

We’re going to discard outcomes were the epidemic dies out prior to vaccination. This is a debate-worthy choice - e.g. if we want to understand the potential benefit of purchasing some program, that should include the possibility that it was unnecessary. For this model world, however, those sort of considerations apply equally to either HBM or SMO, and we want to focus on the ability to resolve intervention impact when there is an epidemic. Hence, we discard those outcomes.

```
# setup parameters
samplen <- 500L
pop <- 4000L
maxt <- 75L
initI <- 5L
ps <- list(
transmission_p = 1 - exp(-0.78 / pop),
# ~ 0.78 infections produced per day in fully susceptible pop
recovery_p = 1 - exp(-0.44) # ~ 2.3 days infectious
# implies R0 ~ 1.8
)
# non-intervention, set vax_t < 0
nonps <- c(ps, list(
vax_t = -1, vaccination_p = NA
))
# intervention, vax_t == 10, 3.3% chance to happen
intps <- c(ps, list(
vax_t = 10, vaccination_p = 0.033
))
# create populations
yinit <- rep.int(susceptible, pop)
yinit[1:initI] <- infectious
yinit_nonid <- c(pop - initI, initI, 0L)
stepper <- function(
maxtime = maxt, y0,
dFUN, pars, seed, HBM) {
# if not HBM & a seed provided => SMO => set initial seeding
if (!missing(seed) && !HBM) set.seed(seed)
# reserve data structure
res <- integer(maxtime + 1)
res[1] <- 1
y <- y0
# run the simulation
for (i in seq_len(maxtime)) {
dY <- if (HBM) dFUN(i, y, pars, seed) else dFUN(i, y, pars)
y <- y + dY[[1]]
res[i + 1] <- dY[[2]]
}
return(res)
}
```

## Sampling

Now we run all the samples, using plain sample ID as the SMO seed or HBM salt, for all the models.

```
set.seed(8675309)
nonid_no_match <- system.time(
ni_non_dt <- seq_len(samplen) |> parallel::mclapply(function(seed) {
data.table(
type = "NON", model = "nonID",
not_i = cumsum(stepper(
y0 = yinit_nonid, dFUN = nonidentity_dStep, pars = nonps, HBM = FALSE
)),
withi = cumsum(stepper(
y0 = yinit_nonid, dFUN = nonidentity_dStep, pars = intps, HBM = FALSE
))
)
}, mc.cores = 5) |> rbindlist(idcol = "sample")
)
set.seed(8675309)
id_no_match <- system.time(
wi_non_dt <- seq_len(samplen) |> parallel::mclapply(function(seed) {
data.table(
type = "NON", model = "ID",
not_i = cumsum(stepper(
y0 = yinit, dFUN = identity_dStep, pars = nonps, HBM = FALSE
)),
withi = cumsum(stepper(
y0 = yinit, dFUN = identity_dStep, pars = intps, HBM = FALSE
))
)
}, mc.cores = 5) |> rbindlist(idcol = "sample")
)
nonid_seed_match_only <- system.time(
ni_smo_dt <- seq_len(samplen) |> parallel::mclapply(function(seed) {
data.table(
type = "SMO", model = "nonID",
not_i = cumsum(stepper(
y0 = yinit_nonid, dFUN = nonidentity_dStep, pars = nonps, seed = seed, HBM = FALSE
)),
withi = cumsum(stepper(
y0 = yinit_nonid, dFUN = nonidentity_dStep, pars = intps, seed = seed, HBM = FALSE
))
)
}, mc.cores = 5) |> rbindlist(idcol = "sample")
)
id_seed_match_only <- system.time(
wi_smo_dt <- seq_len(samplen) |> parallel::mclapply(function(seed) {
data.table(
type = "SMO", model = "ID",
not_i = cumsum(stepper(
y0 = yinit, dFUN = identity_dStep, pars = nonps, seed = seed, HBM = FALSE
)),
withi = cumsum(stepper(
y0 = yinit, dFUN = identity_dStep, pars = intps, seed = seed, HBM = FALSE
))
)
}, mc.cores = 5) |> rbindlist(idcol = "sample")
)
nonid_hash_based_matching <- system.time(
ni_hbm_dt <- seq_len(samplen) |> mclapply(function(seed) {
data.table(
type = "HBM", model = "nonID",
not_i = cumsum(stepper(
y0 = yinit_nonid, dFUN = nonidentity_dStep, pars = nonps, seed = seed, HBM = TRUE
)),
withi = cumsum(stepper(
y0 = yinit_nonid, dFUN = nonidentity_dStep, pars = intps, seed = seed, HBM = TRUE
))
)
}, mc.cores = 5) |> rbindlist(idcol = "sample")
)
id_hash_based_matching <- system.time(
wi_hbm_dt <- seq_len(samplen) |> mclapply(function(seed) {
data.table(
type = "HBM", model = "ID",
not_i = cumsum(stepper(
y0 = yinit, dFUN = identity_dStep, pars = nonps, seed = seed, HBM = TRUE
)),
withi = cumsum(stepper(
y0 = yinit, dFUN = identity_dStep, pars = intps, seed = seed, HBM = TRUE
))
)
}, mc.cores = 5) |> rbindlist(idcol = "sample")
)
sampledt <- rbind(ni_non_dt, wi_non_dt, ni_smo_dt, wi_smo_dt, ni_hbm_dt, wi_hbm_dt)
```

Timings:

Now we compute the difference in series, and determine if there are any to drop:

```
sampledt[, averted := not_i - withi]
sampledt[, time := seq_len(.N) - 1L, by = .(type, model, sample)]
keepers <- sampledt[, .(keep = not_i[intps$vax_t + 1L] != not_i[.N]), by = .(type, model, sample)]
keepers[, sum(keep) / .N, by = .(model, type)]
reduced_dt <- sampledt[keepers, on = .(type, model, sample)][keep == TRUE]
qtiled_dt <- reduced_dt[,
{
qs <- quantile(averted, probs = c(0.25, 0.5, 0.75))
.(lo = qs[1], md = qs[2], hi = qs[3])
},
by = .(type, model, time)
]
```

… and lastly, we should visualize these