Test two multivariate binary datasets
mvbinary.test.Rd
Peforms a two-sample test for two binary vectors testing \(H_0:\) the underlying probability vectors are the same vs. \(H_1:\) they are different.
Arguments
- x, y
Matrices (or dataframes) containing multiple integer vector observations as rows.
x
andy
must be the same type and dimension. Alternatively,x
can be a list of two matrices (or dataframes) to be compared. In this case,y
is NULL by default.- numPerms
Number of permutations to use to calculate the p-value. Default value is 5000.
Value
A list containing the computed statistic
, a list of statistics
(null.statistics
) used to construct the null distritubution (from the
permutation method), and the associated pvalue
. The pvalue
is
the percent of null.statistics
that are more extreme than the
statistic
computed from the original dataset.
Details
The statistic is \(T = \sum_{j=1}^d D_j^2 I( |Dj| \ge \delta(d))\) where \(d\) is the dimension of the data. Additionally:
\(Dj = (\hat{p}_{1j} − \hat{p}_{2j} )/\sqrt{ \hat{p}_j (1 − \hat{p}_j )(1/n1 + 1/n2) } \)
\(\hat{p}_{cj}\) is the estimate of \(p_{cj}\) for the \(c^{th}\) group calculated by the \(j^th\) column mean
\(\hat{p}_j\) is the pooled estimate for the \(j^{th}\) variable.
\(\delta(d) = \sqrt{2 log (a_d d)}\) where \(a_d = (log d)^{-2}\)
The p-value associated with the statistic is calculated using the permutation method. The observation vectors are repeatedly shuffled between groups, each time being used to re-calculate the statistic. A null distribution is constructed and used to calcualate the p-value.
Warning
As described in the reference below, this method may not perform well (low power) on highly correlated variables.
Also, note that for large values of numPerms
, run time may be long.
However, larger values of numPerms
produce more accurate estimates
of the p-value.
See also
Amanda Plunkett & Junyong Park (2017), Two-sample Tests for Sparse High-Dimensional Binary Data, Communications in Statistics - Theory and Methods, 46:22, 11181-11193
Examples
# Binarize the twoNewsGroups dataset:
data(twoNewsGroups)
binData <- list(twoNewsGroups[[1]] > 0, twoNewsGroups[[2]] > 0)
names(binData) <- names(twoNewsGroups)
# Perform the test:
result <- mvbinary.test(binData, numPerms = 100)
result$pvalue
#> [1] 0
# The following are equivalent to the previous test:
result <- mvbinary.test(binData[[1]], binData[[2]], numPerms = 100)
result <- binData |> mvbinary.test(numPerms = 100)