Test two multivariate binary datasets

Peforms a two-sample test for two binary vectors testing \(H_0:\) the underlying probability vectors are the same vs. \(H_1:\) they are different.

Usage

mvbinary.test(x, y = NULL, numPerms = 5000)

Arguments

x, y: Matrices (or dataframes) containing multiple integer vector observations as rows. x and y must be the same type and dimension. Alternatively, x can be a list of two matrices (or dataframes) to be compared. In this case, y is NULL by default.
numPerms: Number of permutations to use to calculate the p-value. Default value is 5000.

Value

A list containing the computed statistic, a list of statistics (null.statistics) used to construct the null distritubution (from the permutation method), and the associated pvalue. The pvalue is the percent of null.statistics that are more extreme than the statistic computed from the original dataset.

Details

The statistic is \(T = \sum_{j=1}^d D_j^2 I( |Dj| \ge \delta(d))\) where \(d\) is the dimension of the data. Additionally:

\(Dj = (\hat{p}_{1j} − \hat{p}_{2j} )/\sqrt{ \hat{p}_j (1 − \hat{p}_j )(1/n1 + 1/n2) } \)
\(\hat{p}_{cj}\) is the estimate of \(p_{cj}\) for the \(c^{th}\) group calculated by the \(j^th\) column mean
\(\hat{p}_j\) is the pooled estimate for the \(j^{th}\) variable.
\(\delta(d) = \sqrt{2 log (a_d d)}\) where \(a_d = (log d)^{-2}\)

The p-value associated with the statistic is calculated using the permutation method. The observation vectors are repeatedly shuffled between groups, each time being used to re-calculate the statistic. A null distribution is constructed and used to calcualate the p-value.

Warning

As described in the reference below, this method may not perform well (low power) on highly correlated variables.

Also, note that for large values of numPerms, run time may be long. However, larger values of numPerms produce more accurate estimates of the p-value.

Examples