Generate multinomial data — genMultinomialData • hddtest

Generate two sets of multinomially distributed vectors using rmultinom. Useful for hypothesis testing simulations. Three different experiments with different probability vectors (of length \(k\)) are available in addition to user-specified probability vector p:

Experiment 1: \(p_{1i} = \frac{1/i^\alpha}{\sum_1^k 1/i^\alpha}\). When the null_hyp parameter is FALSE, the probability vector for the 2nd group is generated by switching the position of 1st and \(m^th\) entries.
Experiment 2: \(p_{1i} = 1/k\). When the null_hyp parameter is FALSE, \(p_{2i} = 0\) for \(i \in 1...b\) and \(p_{2,b+1}= \sum_{1}^{b+1} p_{1i} = (b+1)/k \).
Experiment 3: \(p_{1i} = 1/k\). When the null_hyp parameter is FALSE, \(p_{2i} = 0\) for \(i \in 1...b\) and \(p_{2i} = 1/(k − b)\) for \(i > b\).

Usage

genMultinomialData(
  null_hyp = TRUE,
  p = NULL,
  k = 2000,
  n = c(8000, 8000),
  sample_size = 30,
  expID = 1,
  alpha = 0.45,
  m = 1000,
  numzero = 50,
  ...
)

Arguments

null_hyp: logical; if TRUE, generate data using the same distribution. Default value is TRUE.
p: An optional 2 by \(k\) matrix specifying the probabilities of the \(k\) categories for each of the two groups. Each row of p must sum to 1. If defined, all remaining parameters in the function definition are ignored. Default value is NULL.
k: integer representing dimension (number of categories). Default 2000.
n: Vector of length 2 specifying the parameter of each multinomial distribution used to define the total number of objects that are put into \(k\) bins in the typical multinomial experiment.
sample_size: integer specifying the number of random vectors to generate for each of the two groups.
expID: Experiment number 1-3. Default is 1.
alpha: Number between 0 and 1. Used for experiment 1. Default is 0.45.
m: integer between 2 and \(k\). Used in experiment 1 for the alternative hypothesis. Default is 1000.
numzero: integer between 1 and \(k\)-1. Used in experiments 2 and 3 for the alternative hypothesis. Default is 50.
...: Additional parameters.

Value

A list containing two matrices each having dimension sample_size by \(k\).

Examples

#Generate data when the null hypothesis is FALSE:
X <- genMultinomialData(FALSE)

#Dimension of the two generated datasets:
lapply(X, dim)
#> [[1]]
#> [1]   30 2000
#> 
#> [[2]]
#> [1]   30 2000
#> 

#Proportion of entries less than 5 in the first dataset:
sum(X[[1]]<5)/(nrow(X[[1]])*ncol(X[[1]]))
#> [1] 0.6975333