Test two multivariate binary datasets
mvbinary.test.RdPeforms a two-sample test for two binary vectors testing \(H_0:\) the underlying probability vectors are the same vs. \(H_1:\) they are different.
Arguments
- x, y
Matrices (or dataframes) containing multiple integer vector observations as rows.
xandymust be the same type and dimension. Alternatively,xcan be a list of two matrices (or dataframes) to be compared. In this case,yis NULL by default.- numPerms
Number of permutations to use to calculate the p-value. Default value is 5000.
Value
A list containing the computed statistic, a list of statistics
(null.statistics) used to construct the null distritubution (from the
permutation method), and the associated pvalue. The pvalue is
the percent of null.statistics that are more extreme than the
statistic computed from the original dataset.
Details
The statistic is \(T = \sum_{j=1}^d D_j^2 I( |Dj| \ge \delta(d))\) where \(d\) is the dimension of the data. Additionally:
\(Dj = (\hat{p}_{1j} − \hat{p}_{2j} )/\sqrt{ \hat{p}_j (1 − \hat{p}_j )(1/n1 + 1/n2) } \)
\(\hat{p}_{cj}\) is the estimate of \(p_{cj}\) for the \(c^{th}\) group calculated by the \(j^th\) column mean
\(\hat{p}_j\) is the pooled estimate for the \(j^{th}\) variable.
\(\delta(d) = \sqrt{2 log (a_d d)}\) where \(a_d = (log d)^{-2}\)
The p-value associated with the statistic is calculated using the permutation method. The observation vectors are repeatedly shuffled between groups, each time being used to re-calculate the statistic. A null distribution is constructed and used to calcualate the p-value.
Warning
As described in the reference below, this method may not perform well (low power) on highly correlated variables.
Also, note that for large values of numPerms, run time may be long.
However, larger values of numPerms produce more accurate estimates
of the p-value.
See also
Amanda Plunkett & Junyong Park (2017), Two-sample Tests for Sparse High-Dimensional Binary Data, Communications in Statistics - Theory and Methods, 46:22, 11181-11193
Examples
# Binarize the twoNewsGroups dataset:
data(twoNewsGroups)
binData <- list(twoNewsGroups[[1]] > 0, twoNewsGroups[[2]] > 0)
names(binData) <- names(twoNewsGroups)
# Perform the test:
result <- mvbinary.test(binData, numPerms = 100)
result$pvalue
#> [1] 0
# The following are equivalent to the previous test:
result <- mvbinary.test(binData[[1]], binData[[2]], numPerms = 100)
result <- binData |> mvbinary.test(numPerms = 100)