Skip to contents

Randomly generate a list of two matrices containing multivariate binary data.

Usage

genMVBinaryData(
  n = c(30, 30),
  d = 2000,
  null_hyp = TRUE,
  r = 0.3,
  epsilon = 0.2,
  sigma = c(0.3, 0.1),
  gamma = 0.3,
  p0 = 0.1
)

Arguments

n

Vector of length 2 containing group size (i.e. number of samples) for each group. Default value is (30, 30).

d

Number of variables (dimension) of the data to be generated. Default value is 2000.

null_hyp

Boolean indicating whether group means should be the same (i.e. null hypothesis is TRUE) or different (i.e. null hypothesis is FALSE). Default value is TRUE.

r

Mean for distribution of of \(U_{ij} ~ Ber(r)\). See details below. Increase r to increase the amount of correlation among the d variables. Default value is 0.3.

epsilon

Used in mixture model that generates the probability vectors. See details below. Sparsity can be increased by decreasing epsilon and vice versa. Default value is 0.2.

sigma

Used to define a uniform distribution used to generates the probability vectors. See details below. Default value is (0.3,0.1).

gamma

Mean for dist of \(Z_i ~ Ber(gamma)\). See details below. Default value is 0.3.

p0

See details below. Default is 0.1.

Value

X: List of two n by d matrices each containing the generated datasets.

p: The probability vectors used to generate the two datasets.

null_hyp: Value of the null_hyp parameter.

r: Value of the r parameter.

epsilon: Value of the epsilon parameter.

Details

The \((i,j)^{th}\) entry of the \(c^{th}\) matrix is \(X_{cij} = (1 - U_{ij})Y_{icj} + U_{ij}Z_{i}\) where

  • \(U_{ij} \sim Ber(r)\),

  • \(Z_i \sim Ber(\gamma)\),

  • \(Y_{icj} \sim Ber(p_{jc})\) where

    • \(p_{jc} = (1 - \beta)p_{o} + \beta h_c\)

    • \(\beta \sim Ber(\epsilon)\)

    • \(h_c \sim Uniform(0,\sigma_c)\)

See also

Amanda Plunkett & Junyong Park (2017), Two-sample tests for sparse high-dimensional binary data, Communications in Statistics - Theory and Methods, 46:22, 11181-11193

Junyong Park & J. Davis (2011), Estimating and testing conditional sums of means in high dimensional multivariate binary data, Journal of Statistical Planning and Inference, 141:1021-1030

Examples

binData <- genMVBinaryData(null_hyp = TRUE)$X

# Check the dimension of each matrix:
lapply(binData, dim)
#> [[1]]
#> [1]   30 2000
#> 
#> [[2]]
#> [1]   30 2000
#>