Title: | Fast, Exact Bootstrap Principal Component Analysis for High Dimensional Data |
---|---|
Description: | Implements fast, exact bootstrap Principal Component Analysis and Singular Value Decompositions for high dimensional data, as described in <doi:10.1080/01621459.2015.1062383> (see also <arXiv:1405.0922> ). For data matrices that are too large to operate on in memory, users can input objects with class 'ff' (see the 'ff' package), where the actual data is stored on disk. In response, this package will implement a block matrix algebra procedure for calculating the principal components (PCs) and bootstrap PCs. Depending on options set by the user, the 'parallel' package can be used to parallelize the calculation of the bootstrap PCs. |
Authors: | Aaron Fisher <[email protected]> |
Maintainer: | Aaron Fisher <[email protected]> |
License: | GPL-2 |
Version: | 1.1 |
Built: | 2025-02-06 03:55:34 UTC |
Source: | https://github.com/cran/bootSVD |
Let be the number of bootstrap samples, indexed by
.
As2Vs
is a simple function converts the list of principal component (PC) matrices for the bootstrap scores to a list of principal component matrices on the original high dimensional space. Both of these lists, the input and the output of As2Vs
, are indexed by .
As2Vs(AsByB, V, pattern = NULL, ...)
As2Vs(AsByB, V, pattern = NULL, ...)
AsByB |
a list of the PCs matrices for each bootstrap sample, indexed by |
V |
a tall ( |
pattern |
if |
... |
passed to |
a B
-length list of (p
by K
) PC matrices on the original sample coordinate space (denoted here as ). This is achieved by the matrix multiplication
. Note that here,
denotes the
bootstrap PC matrix, not
raised to the power
. This notation is the same as the notation used in (Fisher et al., 2014).
Aaron Fisher, Brian Caffo, and Vadim Zipunnikov. Fast, Exact Bootstrap Principal Component Analysis for p>1 million. 2014. http://arxiv.org/abs/1405.0922
#use small n, small B, for a quick illustration set.seed(0) Y<-simEEG(n=100, centered=TRUE, wide=TRUE) svdY<-fastSVD(Y) DUt<- tcrossprod(diag(svdY$d),svdY$u) bInds<-genBootIndeces(B=50,n=dim(DUt)[2]) bootSVD_LD_output<-bootSVD_LD(DUt=DUt,bInds=bInds,K=3,verbose=interactive()) Vs<-As2Vs(As=bootSVD_LD_output$As,V=svdY$v) # Yields the high dimensional bootstrap PCs (left singular # vectors of the bootstrap sample Y), # indexed by b = 1,2...B, where B is the number of bootstrap samples
#use small n, small B, for a quick illustration set.seed(0) Y<-simEEG(n=100, centered=TRUE, wide=TRUE) svdY<-fastSVD(Y) DUt<- tcrossprod(diag(svdY$d),svdY$u) bInds<-genBootIndeces(B=50,n=dim(DUt)[2]) bootSVD_LD_output<-bootSVD_LD(DUt=DUt,bInds=bInds,K=3,verbose=interactive()) Vs<-As2Vs(As=bootSVD_LD_output$As,V=svdY$v) # Yields the high dimensional bootstrap PCs (left singular # vectors of the bootstrap sample Y), # indexed by b = 1,2...B, where B is the number of bootstrap samples
All arguments are passed to bootSVD
. This function should be used in exactly the same way as bootSVD
. The only difference is that PCA typically involves re-centering each bootstrap sample, whereas calculations involving the SVD might not.
bootPCA(...)
bootPCA(...)
... |
passed to |
bootSVD(...)
Applies fast bootstrap PCA, using the method from (Fisher et al., 2014). Dimension of the sample is denoted by , and sample size is denoted by
, with
.
bootSVD(Y = NULL, K, V = NULL, d = NULL, U = NULL, B = 50, output = "HD_moments", verbose = getOption("verbose"), bInds = NULL, percentiles = c(0.025, 0.975), centerSamples = TRUE, pattern_V = "V_", pattern_Vb = "Vb_")
bootSVD(Y = NULL, K, V = NULL, d = NULL, U = NULL, B = 50, output = "HD_moments", verbose = getOption("verbose"), bInds = NULL, percentiles = c(0.025, 0.975), centerSamples = TRUE, pattern_V = "V_", pattern_Vb = "Vb_")
Y |
initial data sample, which can be either a matrix or a |
K |
number of PCs to calculate the bootstrap distribution for. |
V |
(optional) the ( |
d |
(optional) |
U |
(optional) the ( |
B |
number of bootstrap samples to compute. |
output |
a vector telling which descriptions of the bootstrap distribution should be calculated. Can include any of the following: 'initial_SVD', 'HD_moments', 'full_HD_PC_dist', and 'HD_percentiles'. See below for explanations of these outputs. |
verbose |
if TRUE, the function will print progress during calculation procedure. |
bInds |
a ( |
percentiles |
a vector containing percentiles to be used to calculate element-wise percentiles across the bootstrap distribution (both across the distribution of |
centerSamples |
whether each bootstrap sample should be centered before calculating the SVD. |
pattern_V |
if |
pattern_Vb |
if |
Users might also consider changing the global options ffbatchbytes
, from the ff
package, and mc.cores
, from the parallel
package. When ff
objects are entered as arguments for bootSVD
, the required matrix algebra is done using block matrix alebra. The ffbatchbytes
option determines the size of the largest block matrix that will be held in memory at any one time. The mc.cores
option (set to 1 by default) determines the level of parallelization to use when calculating the high dimensional distribution of the bootstrap PCs (see mclapply
).
bootSVD
returns a list that can include any of the following elements, depending on what is specified in the output
argument:
The singular value decomposition of the centered, original data matrix. initial_SVD
is a list containing V
, the matrix of -dimensional principal components,
d
, the vector of singular values of Y
, and U
, the matrix of -dimensional singular vectors of
Y
.
A list containing the bootstrap expected value (EPCs
), element-wise bootstrap variance (varPCs
), and element-wise bootstrap standard deviation (sdPCs
) for each of the -dimensional PCs. Each of these three elements of
HD_moments
is also a list, which contains vectors, one for each PC.
HD_moments
also contains momentCI
, a -length list of (
by 2) matrices containing element-wise moment based confidence intervals for the PCs.
A -length list of matrices (or
ff
matrices), with the list element equal to the (
by
) matrix of high dimensional PCs for the
bootstrap sample.
For especially high dimensional cases when the output is returned as ff
matrices, caution should be used if requesting 'full_HD_PC_dist' due to potential storage limitations.
To reindex these PCs by (the PC index) as opposed to
(the bootstrap index), see
reindexMatricesByK
. Again though, caution shoulded be used when reindexing PCs stored as ff
objects, as this will double the number of files stored.
A list of matrices, each of dimension (
by
), where
is the number of percentiles requested (i.e.
=
length(percentiles)
). The matrix in
HD_percentiles
contains element-wise percentiles for the ,
-dimensional PC.
In addition, the following results are always included in the output, regardless of what is specified in the output
argument:
full_LD_PC_dist |
A |
d_dist |
A |
U_dist |
A |
LD_moments |
A list that is comparable to |
LD_percentiles |
A list of |
Aaron Fisher, Brian Caffo, and Vadim Zipunnikov. Fast, Exact Bootstrap Principal Component Analysis for p>1 million. 2014. http://arxiv.org/abs/1405.0922
#use small n, small B, for a quick illustration set.seed(0) Y<-simEEG(n=100, centered=TRUE, wide=TRUE) b<-bootSVD(Y, B=50, K=2, output= c('initial_SVD', 'HD_moments', 'full_HD_PC_dist', 'HD_percentiles'), verbose=interactive()) b #explore results matplot(b$initial_SVD$V[,1:4],type='l',main='Fitted PCs',lty=1) legend('bottomright',paste0('PC',1:4),col=1:4,lty=1,lwd=2) ###################### # look specifically at 2nd PC k<-2 ###### #looking at HD variability #plot several draws from bootstrap distribution VsByK<-reindexMatricesByK(b$full_HD_PC_dist) matplot(t(VsByK[[k]][1:20,]),type='l',lty=1, main=paste0('20 Draws from bootstrap\ndistribution of HD PC ',k)) #plot pointwise CIs matplot(b$HD_moments$momentCI[[k]],type='l',col='blue',lty=1, main=paste0('CIs For HD PC ',k)) matlines(b$HD_percentiles[[k]],type='l',col='darkgreen',lty=1) lines(b$initial_SVD$V[,k]) legend('topright',c('Fitted PC','Moment CIs','Percentile CIs'), lty=1,col=c('black','blue','darkgreen')) abline(h=0,lty=2,col='darkgrey') ###### # looking at LD variability # plot several draws from bootstrap distribution AsByK<-reindexMatricesByK(b$full_LD_PC_dist) matplot(t(AsByK[[k]][1:50,]),type='l',lty=1, main=paste0('50 Draws from bootstrap\ndistribution of LD PC ',k), xlim=c(1,10),xlab='PC index (truncated)') # plot pointwise CIs matplot(b$LD_moments$momentCI[[k]],type='o',col='blue', lty=1,main=paste0('CIs For LD PC ',k),xlim=c(1,10), xlab='PC index (truncated)',pch=1) matlines(b$LD_percentiles[[k]],type='o',pch=1,col='darkgreen',lty=1) abline(h=0,lty=2,col='darkgrey') legend('topright',c('Moment CIs','Percentile CIs'),lty=1, pch=1,col=c('blue','darkgreen')) #Note: variability is mostly due to rotations with the third and fourth PC. # Bootstrap eigenvalue distribution dsByK<-reindexVectorsByK(b$d_dist) boxplot(dsByK[[k]]^2,main=paste0('Covariance Matrix Eigenvalue ',k), ylab='Bootstrap Distribution', ylim=range(c(dsByK[[k]]^2,b$initial_SVD$d[k]^2))) points(b$initial_SVD$d[k]^2,pch=18,col='red') legend('bottomright','Sample Value',pch=18,col='red') ################## #Example with ff input library(ff) Yff<-as.ff(Y, pattern='Y_') # If desired, change options in 'ff' package to # adjust the size of matrix blocks held in RAM. # For example: # options('ffbatchbytes'=100000) ff_dir<-tempdir() pattern_V <- paste0(ff_dir,'/V_') pattern_Vb <- paste0(ff_dir,'/Vb_') bff <- bootSVD(Yff, B=50, K=2, output=c('initial_SVD', 'HD_moments', 'full_HD_PC_dist', 'HD_percentiles'), pattern_V= pattern_V, pattern_Vb=pattern_Vb, verbose=interactive()) # Note that elements of full_HD_PC_dist and initial_SVD # have class 'ff' str(lapply(bff,function(x) class(x[[1]]))) #Show some results of bootstrap draws plot(bff$full_HD_PC_dist[[1]][,k],type='l') #Reindexing by K will create a new set of ff files. VsByKff<-reindexMatricesByK(bff$full_HD_PC_dist, pattern=paste0(ff_dir,'/Vk_')) physical(bff$full_HD_PC_dist[[1]])$filename physical(VsByKff[[1]])$filename matplot(t(VsByKff[[k]][1:10,]),type='l',lty=1, main=paste0('Bootstrap Distribution of PC',k)) # Saving and moving results: saveRDS(bff,file=paste0(ff_dir,'/bff.rds')) close(bff$initial_SVD$V) physical(bff$initial_SVD$V)$filename # If the 'ff' files on disk are moved or renamed, # this filename attribute can be changed: old_ff_path <- physical(bff$initial_SVD$V)$filename new_ff_path <- paste0(tempdir(),'/new_V_file.ff') file.rename(from= old_ff_path, to= new_ff_path) physical(bff$initial_SVD$V)$filename <- new_ff_path matplot(bff$initial_SVD$V[,1:4],type='l',lty=1)
#use small n, small B, for a quick illustration set.seed(0) Y<-simEEG(n=100, centered=TRUE, wide=TRUE) b<-bootSVD(Y, B=50, K=2, output= c('initial_SVD', 'HD_moments', 'full_HD_PC_dist', 'HD_percentiles'), verbose=interactive()) b #explore results matplot(b$initial_SVD$V[,1:4],type='l',main='Fitted PCs',lty=1) legend('bottomright',paste0('PC',1:4),col=1:4,lty=1,lwd=2) ###################### # look specifically at 2nd PC k<-2 ###### #looking at HD variability #plot several draws from bootstrap distribution VsByK<-reindexMatricesByK(b$full_HD_PC_dist) matplot(t(VsByK[[k]][1:20,]),type='l',lty=1, main=paste0('20 Draws from bootstrap\ndistribution of HD PC ',k)) #plot pointwise CIs matplot(b$HD_moments$momentCI[[k]],type='l',col='blue',lty=1, main=paste0('CIs For HD PC ',k)) matlines(b$HD_percentiles[[k]],type='l',col='darkgreen',lty=1) lines(b$initial_SVD$V[,k]) legend('topright',c('Fitted PC','Moment CIs','Percentile CIs'), lty=1,col=c('black','blue','darkgreen')) abline(h=0,lty=2,col='darkgrey') ###### # looking at LD variability # plot several draws from bootstrap distribution AsByK<-reindexMatricesByK(b$full_LD_PC_dist) matplot(t(AsByK[[k]][1:50,]),type='l',lty=1, main=paste0('50 Draws from bootstrap\ndistribution of LD PC ',k), xlim=c(1,10),xlab='PC index (truncated)') # plot pointwise CIs matplot(b$LD_moments$momentCI[[k]],type='o',col='blue', lty=1,main=paste0('CIs For LD PC ',k),xlim=c(1,10), xlab='PC index (truncated)',pch=1) matlines(b$LD_percentiles[[k]],type='o',pch=1,col='darkgreen',lty=1) abline(h=0,lty=2,col='darkgrey') legend('topright',c('Moment CIs','Percentile CIs'),lty=1, pch=1,col=c('blue','darkgreen')) #Note: variability is mostly due to rotations with the third and fourth PC. # Bootstrap eigenvalue distribution dsByK<-reindexVectorsByK(b$d_dist) boxplot(dsByK[[k]]^2,main=paste0('Covariance Matrix Eigenvalue ',k), ylab='Bootstrap Distribution', ylim=range(c(dsByK[[k]]^2,b$initial_SVD$d[k]^2))) points(b$initial_SVD$d[k]^2,pch=18,col='red') legend('bottomright','Sample Value',pch=18,col='red') ################## #Example with ff input library(ff) Yff<-as.ff(Y, pattern='Y_') # If desired, change options in 'ff' package to # adjust the size of matrix blocks held in RAM. # For example: # options('ffbatchbytes'=100000) ff_dir<-tempdir() pattern_V <- paste0(ff_dir,'/V_') pattern_Vb <- paste0(ff_dir,'/Vb_') bff <- bootSVD(Yff, B=50, K=2, output=c('initial_SVD', 'HD_moments', 'full_HD_PC_dist', 'HD_percentiles'), pattern_V= pattern_V, pattern_Vb=pattern_Vb, verbose=interactive()) # Note that elements of full_HD_PC_dist and initial_SVD # have class 'ff' str(lapply(bff,function(x) class(x[[1]]))) #Show some results of bootstrap draws plot(bff$full_HD_PC_dist[[1]][,k],type='l') #Reindexing by K will create a new set of ff files. VsByKff<-reindexMatricesByK(bff$full_HD_PC_dist, pattern=paste0(ff_dir,'/Vk_')) physical(bff$full_HD_PC_dist[[1]])$filename physical(VsByKff[[1]])$filename matplot(t(VsByKff[[k]][1:10,]),type='l',lty=1, main=paste0('Bootstrap Distribution of PC',k)) # Saving and moving results: saveRDS(bff,file=paste0(ff_dir,'/bff.rds')) close(bff$initial_SVD$V) physical(bff$initial_SVD$V)$filename # If the 'ff' files on disk are moved or renamed, # this filename attribute can be changed: old_ff_path <- physical(bff$initial_SVD$V)$filename new_ff_path <- paste0(tempdir(),'/new_V_file.ff') file.rename(from= old_ff_path, to= new_ff_path) physical(bff$initial_SVD$V)$filename <- new_ff_path matplot(bff$initial_SVD$V[,1:4],type='l',lty=1)
-dimensional PCsbootSVD_LD
Calculates the bootstrap distribution of the principal components (PCs) of a low dimensional matrix. If the score matrix is inputted, the output of bootSVD_LD
can be used to to calculate bootstrap standard errors, confidence regions, or the full bootstrap distribution of the high dimensional components. Most users may want to instead consider using bootSVD
, which also calculates descriptions of the high dimensional components. Note that bootSVD
calls bootSVD_LD
.
bootSVD_LD(UD, DUt = t(UD), bInds = genBootIndeces(B = 1000, n = dim(DUt)[2]), K, warning_type = "silent", verbose = getOption("verbose"), centerSamples = TRUE)
bootSVD_LD(UD, DUt = t(UD), bInds = genBootIndeces(B = 1000, n = dim(DUt)[2]), K, warning_type = "silent", verbose = getOption("verbose"), centerSamples = TRUE)
UD |
(optional) a ( |
DUt |
the transpose of |
bInds |
a ( |
K |
the number of PCs to be estimated. |
warning_type |
passed to |
verbose |
if |
centerSamples |
whether each bootstrap sample should be centered before calculating the SVD. |
For each bootstrap matrix , let
, where
and
are (
by
) orthonormal matrices, and
is a (
by
) diagonal matrix
. Here we calculate only the first
K
columns of , but all
n
columns of . The results are stored as a list containing
As |
a |
ds |
a |
Us |
a |
time |
The computation time required for the procedure, taken using |
If the score matrix is inputted to bootSVD_LD
, the results can be transformed to get the PCs on the original space by multiplying each matrix by the PCs of the original sample,
(see
As2Vs
). The bootstrap scores of the original sample are equal to .
#use small n, small B, for a quick illustration set.seed(0) Y<-simEEG(n=100, centered=TRUE, wide=TRUE) svdY<-fastSVD(Y) DUt<- tcrossprod(diag(svdY$d),svdY$u) bInds<-genBootIndeces(B=50,n=dim(DUt)[2]) bootSVD_LD_output<-bootSVD_LD(DUt=DUt,bInds=bInds,K=3,verbose=interactive())
#use small n, small B, for a quick illustration set.seed(0) Y<-simEEG(n=100, centered=TRUE, wide=TRUE) svdY<-fastSVD(Y) DUt<- tcrossprod(diag(svdY$d),svdY$u) bInds<-genBootIndeces(B=50,n=dim(DUt)[2]) bootSVD_LD_output<-bootSVD_LD(DUt=DUt,bInds=bInds,K=3,verbose=interactive())
This package is based on (Fisher et al., 2014), which uses as an example a subset of the electroencephalogram (EEG) measurements from the Sleep Heart Health Study (SHHS) (Quan et al. 1997). Since we cannot publish the EEG recordings from SHHS participants in this package, we instead include the summary statistics of the PCs from our subsample of the processed SHHS EEG data. These summary statistics were generated from measurements of smoothed Normalized Delta Power. This data is used by the simEEG
to simulate data examples to demonstrate our functions.
Specifically, EEG_leadingV
is a matrix whose columns contain the leading 5 principal components of the EEG dataset.
Aaron Fisher, Brian Caffo, and Vadim Zipunnikov. Fast, Exact Bootstrap Principal Component Analysis for p>1 million. 2014. http://arxiv.org/abs/1405.0922
Stuart F Quan, Barbara V Howard, Conrad Iber, James P Kiley, F Javier Nieto, George T O'Connor, David M Rapoport, Susan Redline, John Robbins, JM Samet, et al. The sleep heart health study: design, rationale, and methods. Sleep, 20(12):1077-1085, 1997. 1.1
This package is based on (Fisher et al., 2014), which uses as an example a subset of the electroencephalogram (EEG) measurements from the Sleep Heart Health Study (SHHS) (Quan et al. 1997). Since we cannot publish the EEG recordings from SHHS participants in this package, we instead include the summary statistics of the PCs from our subsample of the processed SHHS EEG data. These summary statistics were generated from measurements of smoothed Normalized Delta Power. This data is used by the simEEG
to simulate data examples to demonstrate our functions.
Specifically, EEG_mu
is a vector containing the mean normalized delta power function across all subjects, for the first 7.5 hours of sleep.
Aaron Fisher, Brian Caffo, and Vadim Zipunnikov. Fast, Exact Bootstrap Principal Component Analysis for p>1 million. 2014. http://arxiv.org/abs/1405.0922
Stuart F Quan, Barbara V Howard, Conrad Iber, James P Kiley, F Javier Nieto, George T O'Connor, David M Rapoport, Susan Redline, John Robbins, JM Samet, et al. The sleep heart health study: design, rationale, and methods. Sleep, 20(12):1077-1085, 1997. 1.1
This package is based on (Fisher et al., 2014), which uses as an example a subset of the electroencephalogram (EEG) measurements from the Sleep Heart Health Study (SHHS) (Quan et al. 1997). Since we cannot publish the EEG recordings from SHHS participants in this package, we instead include the summary statistics of the PCs from our subsample of the processed SHHS EEG data. These summary statistics were generated from measurements of smoothed Normalized Delta Power. This data is used by the simEEG
to simulate data examples to demonstrate our functions.
Specifically, EEG_score_var
is a vector containing the variances of the first 5 empirical score variables. Here, we refer to the score variables refer to the -dimensional, uncorrelated variables, whose coordinate vectors are the principal components
EEG_leadingV
.
Aaron Fisher, Brian Caffo, and Vadim Zipunnikov. Fast, Exact Bootstrap Principal Component Analysis for p>1 million. 2014. http://arxiv.org/abs/1405.0922
Stuart F Quan, Barbara V Howard, Conrad Iber, James P Kiley, F Javier Nieto, George T O'Connor, David M Rapoport, Susan Redline, John Robbins, JM Samet, et al. The sleep heart health study: design, rationale, and methods. Sleep, 20(12):1077-1085, 1997. 1.1
fastSVD
uses the inherent low dimensionality of a wide, or tall, matrix to quickly calculate its SVD. For a matrix , this function solves
.
This function can be applied to either standard matrices, or, when the data is too large to be stored in memeory, to matrices with class
ff
. ff
objects have a representation in memory, but store their contents on disk. In these cases, fastSVD
will implement block matrix algebra to compute the SVD.
fastSVD(A, nv = min(dim(A)), warning_type = "silent", center_A = FALSE, pattern = NULL)
fastSVD(A, nv = min(dim(A)), warning_type = "silent", center_A = FALSE, pattern = NULL)
A |
matrix of dimension ( |
nv |
number of high dimensional singular vectors to obtain. If |
warning_type |
passed to |
center_A |
Whether the matrix |
pattern |
passed to |
Users might also consider changing the global option ffbatchbytes
, from the ff
package. When a ff
object is entered, the ffbatchbytes
option determines the maximum block size in the block matrix algebra used to calculate the SVD.
Let be the rank of the matrix
A
. fastSVD
solves , where
is an (
by
) orthonormal matrix,
is an (
by
) diagonal matrix; and
is a (
by
) orthonormal matrix. When
A
is entered as an ff
object, the high dimensional singular vectors of A
will be returned as an ff
object as well. For matrices where one dimension is substantially large than the other, calculation times are considerably faster than the standard svd
function.
Y<-simEEG(n=100,centered=TRUE,wide=TRUE) svdY<-fastSVD(Y) svdY matplot(svdY$v[,1:5],type='l',lty=1) #sample PCs for a wide matrix are the right singular vectors #Note: For a tall, demeaned matrix Y, with columns corresponding #to subjects and rows to measurements, #the PCs are the high dimensional left singular vectors. #Example with 'ff' dev.off() library(ff) Yff<-as.ff(Y) svdYff<-fastSVD(Yff) svdYff matplot(svdYff$v[,1:5],type='l',lty=1)
Y<-simEEG(n=100,centered=TRUE,wide=TRUE) svdY<-fastSVD(Y) svdY matplot(svdY$v[,1:5],type='l',lty=1) #sample PCs for a wide matrix are the right singular vectors #Note: For a tall, demeaned matrix Y, with columns corresponding #to subjects and rows to measurements, #the PCs are the high dimensional left singular vectors. #Example with 'ff' dev.off() library(ff) Yff<-as.ff(Y) svdYff<-fastSVD(Yff) svdYff matplot(svdYff$v[,1:5],type='l',lty=1)
A function for crossprod(x,y)
, for tcrossprod(x,y)
, or for regular matrix multiplication, that is compatible with ff
matrices. Multiplication is done without creating new matrices for the transposes of x
or y
. Note, the crossproduct function can't be applied directly to objects with class ff
.
ffmatrixmult(x, y = NULL, xt = FALSE, yt = FALSE, ram.output = FALSE, override.big.error = FALSE, ...)
ffmatrixmult(x, y = NULL, xt = FALSE, yt = FALSE, ram.output = FALSE, override.big.error = FALSE, ...)
x |
a matrix or ff_matrix |
y |
a matrix or ff_matrix. If NULL, this is set equal to x, although a second copy of the matrix x is not actually stored. |
xt |
should the x matrix be transposed before multiplying |
yt |
should the y matrix be transposed before multiplying (e.g. |
ram.output |
force output to be a normal matrix, as opposed to an object with class |
override.big.error |
If the dimension of the final output matrix is especially large, |
... |
passed to |
A standard matrix, or a matrix with class ff
if one of the input matrices has class ff
.
## Not run: library(ff) #Tall data y_tall<-matrix(rnorm(5000),500,10) #y tall x_tall<-matrix(rnorm(5000),500,10) y_wide<-t(y_tall) x_wide<-t(x_tall) y_tall_ff<-as.ff(y_tall) #y tall and ff x_tall_ff<-as.ff(x_tall) y_wide_ff<-as.ff(y_wide) #y tall and ff x_wide_ff<-as.ff(x_wide) #Set options to ensure that block matrix algebra is actually done, #and the entire algebra isn't just one in one step. #Compare ffmatrixmult against output from standard methods options('ffbytesize'=100) #small final matrices #x'x range( crossprod(x_tall) - ffmatrixmult(x_tall_ff, xt=TRUE) ) range( tcrossprod(x_wide) - ffmatrixmult(x_wide_ff, yt=TRUE) ) range( crossprod(x_tall,y_tall) - ffmatrixmult(x_tall_ff,y_tall_ff, xt=TRUE) ) range( tcrossprod(x_wide,y_wide) - ffmatrixmult(x_wide_ff,y_wide_ff, yt=TRUE) ) range( (x_wide%*%y_tall) - ffmatrixmult(x_wide_ff,y_tall_ff) ) #ff + small data s_tall <- matrix(rnorm(80),10,8) s_wide <- matrix(rnorm(80),8,10) #tall output range( crossprod(x_wide, s_tall) - ffmatrixmult(x_wide_ff, s_tall,xt=TRUE)[] ) range( tcrossprod(x_tall, s_wide) - ffmatrixmult(x_tall_ff, s_wide,yt=TRUE)[] ) range( x_tall%*%s_tall - ffmatrixmult(x_tall_ff, s_tall)[]) #Wide output range( crossprod(s_tall, y_wide) - ffmatrixmult( s_tall, y_wide_ff,xt=TRUE)[] ) range( tcrossprod(s_wide, y_tall) - ffmatrixmult( s_wide,y_tall_ff,yt=TRUE)[] ) range( s_wide%*%y_wide - ffmatrixmult(s_wide,y_wide_ff)[]) #Reset options for more practical use options('ffbytesize'=16777216) ## End(Not run)
## Not run: library(ff) #Tall data y_tall<-matrix(rnorm(5000),500,10) #y tall x_tall<-matrix(rnorm(5000),500,10) y_wide<-t(y_tall) x_wide<-t(x_tall) y_tall_ff<-as.ff(y_tall) #y tall and ff x_tall_ff<-as.ff(x_tall) y_wide_ff<-as.ff(y_wide) #y tall and ff x_wide_ff<-as.ff(x_wide) #Set options to ensure that block matrix algebra is actually done, #and the entire algebra isn't just one in one step. #Compare ffmatrixmult against output from standard methods options('ffbytesize'=100) #small final matrices #x'x range( crossprod(x_tall) - ffmatrixmult(x_tall_ff, xt=TRUE) ) range( tcrossprod(x_wide) - ffmatrixmult(x_wide_ff, yt=TRUE) ) range( crossprod(x_tall,y_tall) - ffmatrixmult(x_tall_ff,y_tall_ff, xt=TRUE) ) range( tcrossprod(x_wide,y_wide) - ffmatrixmult(x_wide_ff,y_wide_ff, yt=TRUE) ) range( (x_wide%*%y_tall) - ffmatrixmult(x_wide_ff,y_tall_ff) ) #ff + small data s_tall <- matrix(rnorm(80),10,8) s_wide <- matrix(rnorm(80),8,10) #tall output range( crossprod(x_wide, s_tall) - ffmatrixmult(x_wide_ff, s_tall,xt=TRUE)[] ) range( tcrossprod(x_tall, s_wide) - ffmatrixmult(x_tall_ff, s_wide,yt=TRUE)[] ) range( x_tall%*%s_tall - ffmatrixmult(x_tall_ff, s_tall)[]) #Wide output range( crossprod(s_tall, y_wide) - ffmatrixmult( s_tall, y_wide_ff,xt=TRUE)[] ) range( tcrossprod(s_wide, y_tall) - ffmatrixmult( s_wide,y_tall_ff,yt=TRUE)[] ) range( s_wide%*%y_wide - ffmatrixmult(s_wide,y_wide_ff)[]) #Reset options for more practical use options('ffbytesize'=16777216) ## End(Not run)
Let be the original sample size,
be the number of measurements per subject, and
be the number of bootstrap samples.
genBootIndeces
generates a ( by
) matrix containing
indexing vectors that can be used to create
bootstrap samples, each of size
.
genBootIndeces(B, n)
genBootIndeces(B, n)
B |
number of desired bootstrap samples |
n |
size of original sample from which we'll be resampling. |
A ( by
) matrix of bootstrap indeces. Let
bInds
denote the output of getBootIndeces
, and Y
denote the original ( by
) sample. Then
Y[,bInds[b,]]
is the bootstrap sample.
bInds<-genBootIndeces(B=50,n=200)
bInds<-genBootIndeces(B=50,n=200)
genQ
generates a square matrix of random normal noise, and then takes the QR decomposition to return Q, a random orthogonal square matrix.
genQ(n, lim_attempts = 200)
genQ(n, lim_attempts = 200)
n |
the dimension of the desired random orthonormal matrix |
lim_attempts |
the random matrix of normal noise must be full rank to generate the appropriate QR decomposition. |
a random orthonormal ( by
) matrix
A<-genQ(3) round(crossprod(A),digits=10)
A<-genQ(3) round(crossprod(A),digits=10)
Let be the number of PCs of interest, let
be the number of bootstrap samples, and let
be the number of measurements per subject, also known as the dimension of the sample. In general, we use
to refer to the principal component (PC) index, where
, and use
to refer to the bootstrap index, where
.
getMomentsAndMomentCI(AsByK, V, K = length(AsByK), verbose = FALSE)
getMomentsAndMomentCI(AsByK, V, K = length(AsByK), verbose = FALSE)
AsByK |
a list of the bootstrap PC matrices. This list should be indexed by |
V |
a ( |
K |
the number of leading PCs for which moments and confidence intervals should be obtained. |
verbose |
setting to |
a list containing
EVs |
a list containing element-wise bootstrap means for each of the |
varVs |
a list containing element-wise bootstrap variances for each of the |
sdVs |
a list containing element-wise bootstrap standard errors for each of the |
momentCI |
a list of ( |
#use small n, small B, for a quick illustration set.seed(0) Y<-simEEG(n=100, centered=TRUE, wide=TRUE) svdY<-fastSVD(Y) V<-svdY$v #right singular vectors of the wide matrix Y DUt<- tcrossprod(diag(svdY$d),svdY$u) bInds<-genBootIndeces(B=50,n=dim(DUt)[2]) bootSVD_LD_output<-bootSVD_LD(DUt=DUt,bInds=bInds,K=3,verbose=interactive()) AsByB<-bootSVD_LD_output$As AsByK<-reindexMatricesByK(AsByB) moments<-getMomentsAndMomentCI(AsByK,V,verbose=interactive()) plot(V[,1],type='l',ylim=c(-.1,.1),main='Original PC1, with CI in blue') matlines(moments$momentCI[[1]],col='blue',lty=1) #Can also use this function to get moments for low dimensional #vectors A^b[,k], by setting V to the identity matrix. moments_A<- getMomentsAndMomentCI(As=AsByK,V=diag(ncol(AsByK[[1]])))
#use small n, small B, for a quick illustration set.seed(0) Y<-simEEG(n=100, centered=TRUE, wide=TRUE) svdY<-fastSVD(Y) V<-svdY$v #right singular vectors of the wide matrix Y DUt<- tcrossprod(diag(svdY$d),svdY$u) bInds<-genBootIndeces(B=50,n=dim(DUt)[2]) bootSVD_LD_output<-bootSVD_LD(DUt=DUt,bInds=bInds,K=3,verbose=interactive()) AsByB<-bootSVD_LD_output$As AsByK<-reindexMatricesByK(AsByB) moments<-getMomentsAndMomentCI(AsByK,V,verbose=interactive()) plot(V[,1],type='l',ylim=c(-.1,.1),main='Original PC1, with CI in blue') matlines(moments$momentCI[[1]],col='blue',lty=1) #Can also use this function to get moments for low dimensional #vectors A^b[,k], by setting V to the identity matrix. moments_A<- getMomentsAndMomentCI(As=AsByK,V=diag(ncol(AsByK[[1]])))
Quickly print an R object's size
os(x, units = "Mb")
os(x, units = "Mb")
x |
an object of interest |
units |
measure to print size in |
print(object.size(x),units=units)
Y<-simEEG(n=50) os(Y)
Y<-simEEG(n=50) os(Y)
svd
, which uses random preconditioning to restart when svd fails to convergeIn order to generate the SVD of the matrix x
, qrSVD
calls genQ
to generate a random orthonormal matrix, and uses this random matrix to precondition x
. The svd of the preconditioned matrix is calculated, and adjusted to account for the preconditioning process in order to find svd(x)
.
qrSVD(x, lim_attempts = 50, warning_type = "silent", warning_file = "qrSVD_warnings.txt", ...)
qrSVD(x, lim_attempts = 50, warning_type = "silent", warning_file = "qrSVD_warnings.txt", ...)
x |
a matrix to calculate the svd for |
lim_attempts |
the number of tries to randomly precondition x. We generally find that one preconditioning attempt is sufficient. |
warning_type |
controls whether the user should be told if an orthogonal preconditioning matrix is required, or if |
warning_file |
gives the location of a file to print warnings to, if |
... |
parameters passed to |
Solves , where
is an matrix containing the left singular vectors of
,
is a diagonal matrix containing the singular values of
; and
is a matrix containing the right singular vectors of
(output follows the same notation convention as the
svd
function).
qrSVD
will attempt the standard svd
function before preconditioning the matrix .
x <-matrix(rnorm(3*5),nrow=3,ncol=5) svdx <- qrSVD(x) svdx
x <-matrix(rnorm(3*5),nrow=3,ncol=5) svdx <- qrSVD(x) svdx
by PC index (
) rather than bootstrap index (
).This function is used as a precursor step for calculate bootstrap standard errors, or percentiles. For very high dimensional data, we recommend that the this function be applied to the low dimensional components , but the function can also be used to reorder a list of high dimensional bootstrap PCs. It can equivalently be used to reorder a list of scores. In general, we recommend that as many operations as possible be applied to the low dimensional components, as opposed to their high dimensional counterparts. This function is called by
getMomentsAndMomentCI
.
reindexMatricesByK(matricesByB, pattern)
reindexMatricesByK(matricesByB, pattern)
matricesByB |
a |
pattern |
(optional) passed to |
a K
-length list of ( by
) matrices. If elements of
matricesByB
have class ff
, then the returned, reordered matrices will also have class ff
.
#use small n, small B, for a quick illustration set.seed(0) Y<-simEEG(n=100, centered=TRUE, wide=TRUE) svdY<-fastSVD(Y) V<- svdY$v #original sample PCs DUt<- tcrossprod(diag(svdY$d),svdY$u) bInds<-genBootIndeces(B=50,n=dim(DUt)[2]) bootSVD_LD_output<-bootSVD_LD(DUt=DUt,bInds=bInds,K=3,verbose=interactive()) ######## # to get 'low dimensional PC' moments and lower percentiles AsByB<-bootSVD_LD_output$As AsByK<-reindexMatricesByK(AsByB) meanA1<- apply(AsByK[[1]],2,mean) seA1<- apply(AsByK[[1]],2,sd) pA1<- apply(AsByK[[1]],2,function(x) quantile(x,.05)) #can also use lapply to get a list (indexed by k=1,...K) of #the means, standard errors, or percentiles for each PC. #See example below, for high dimensional bootstrap PCs. #Alternatively, moments can be calculated with seA1_v2<- getMomentsAndMomentCI(As=AsByK, V=diag(dim(AsByK[[1]])[2]))$sdPCs[[1]] all(seA1_v2==seA1) #Additional examples of exploring the low dimensional bootstrap #PC distribution are given in the documentation for #the 'bootSVD' function. ######### ######### #High dimensional percentiles for each PC VsByB<-As2Vs(As=AsByB,V=V) VsByK<-reindexMatricesByK(VsByB) percentileCI_Vs<-lapply(VsByK,function(mat_k){ apply(mat_k,2,function(x) quantile(x,c(.025,.975))) }) k=2 # the 2nd PC is a little more interesting here. matplot(t(percentileCI_Vs[[k]]),type='l',lty=1,col='blue') lines(V[,k]) ######## # Note: This function can also be used to reorganize the # high dimensional PCs. For 'ff' matrices, this will # create a new set of files on disk.
#use small n, small B, for a quick illustration set.seed(0) Y<-simEEG(n=100, centered=TRUE, wide=TRUE) svdY<-fastSVD(Y) V<- svdY$v #original sample PCs DUt<- tcrossprod(diag(svdY$d),svdY$u) bInds<-genBootIndeces(B=50,n=dim(DUt)[2]) bootSVD_LD_output<-bootSVD_LD(DUt=DUt,bInds=bInds,K=3,verbose=interactive()) ######## # to get 'low dimensional PC' moments and lower percentiles AsByB<-bootSVD_LD_output$As AsByK<-reindexMatricesByK(AsByB) meanA1<- apply(AsByK[[1]],2,mean) seA1<- apply(AsByK[[1]],2,sd) pA1<- apply(AsByK[[1]],2,function(x) quantile(x,.05)) #can also use lapply to get a list (indexed by k=1,...K) of #the means, standard errors, or percentiles for each PC. #See example below, for high dimensional bootstrap PCs. #Alternatively, moments can be calculated with seA1_v2<- getMomentsAndMomentCI(As=AsByK, V=diag(dim(AsByK[[1]])[2]))$sdPCs[[1]] all(seA1_v2==seA1) #Additional examples of exploring the low dimensional bootstrap #PC distribution are given in the documentation for #the 'bootSVD' function. ######### ######### #High dimensional percentiles for each PC VsByB<-As2Vs(As=AsByB,V=V) VsByK<-reindexMatricesByK(VsByB) percentileCI_Vs<-lapply(VsByK,function(mat_k){ apply(mat_k,2,function(x) quantile(x,c(.025,.975))) }) k=2 # the 2nd PC is a little more interesting here. matplot(t(percentileCI_Vs[[k]]),type='l',lty=1,col='blue') lines(V[,k]) ######## # Note: This function can also be used to reorganize the # high dimensional PCs. For 'ff' matrices, this will # create a new set of files on disk.
vectors to be organized by PC index (
) rather than bootstrap index (
).Used to study of the bootstrap distribution of the k^th singular values, by re-indexing the list of vectors to be organized by PC index (
) rather than bootstrap index (
).
reindexVectorsByK(vectorsByB)
reindexVectorsByK(vectorsByB)
vectorsByB |
a |
a K
-length list of ( by
) matrices, where each matrices' rows refers to the values from a different bootstrap sample.
#use small n, small B, for a quick illustration set.seed(0) Y<-simEEG(n=100, centered=TRUE, wide=TRUE) svdY<-fastSVD(Y) DUt<- tcrossprod(diag(svdY$d),svdY$u) bInds<-genBootIndeces(B=50,n=dim(DUt)[2]) bootSVD_LD_output<-bootSVD_LD(DUt=DUt,bInds=bInds,K=3,verbose=interactive()) dsByK<-reindexVectorsByK(bootSVD_LD_output$ds) boxplot(dsByK[[1]],main='Bootstrap distribution of 1st singular value')
#use small n, small B, for a quick illustration set.seed(0) Y<-simEEG(n=100, centered=TRUE, wide=TRUE) svdY<-fastSVD(Y) DUt<- tcrossprod(diag(svdY$d),svdY$u) bInds<-genBootIndeces(B=50,n=dim(DUt)[2]) bootSVD_LD_output<-bootSVD_LD(DUt=DUt,bInds=bInds,K=3,verbose=interactive()) dsByK<-reindexVectorsByK(bootSVD_LD_output$ds) boxplot(dsByK[[1]],main='Bootstrap distribution of 1st singular value')
Our data from (Fisher et al. 2014) consists of EEG measurements from the Sleep Heart Health Study (SHHS) (Quan et al. 1997). Since we cannot publish the EEG recordings from the individuals in the SHHS, we instead include the summary statistics of the PCs from our subsample of the processed SHHS EEG data. This data is used by the simEEG
to simulate functional data that is approximately similar to the data used in our work. The resulting simulated vectors are always of length 900, and are generated from 5 basis vectors (see EEG_leadingV
).
simEEG(n = 100, centered = TRUE, propVarNoise = 0.45, wide = TRUE)
simEEG(n = 100, centered = TRUE, propVarNoise = 0.45, wide = TRUE)
n |
the desired sample size |
centered |
if TRUE, the sample will be centered to have mean zero for each dimension. If FALSE, measurements will be simulated from a population where the mean is equal to that observed in the sample used in (Fisher et al. 2014) (see |
propVarNoise |
the approximate proportion of total sample variance attributable to random noise. |
wide |
if TRUE, the resulting data is outputted as a |
A matrix containing n
simulated measurement vectors of Normalized Delta Power, for the first 7.5 hours of sleep. These vectors are generated according to the equation:
Where is the simulated measurement for a subject,
is the
basis vector,
is a random normal variable with mean zero, and e is a vector of random normal noise. The specific values for
and
are determined from the EEG data sample studied in (Fisher et al., 2014), and are respectively equal to the
empirical principal component vector (see
EEG_leadingV
), and the empirical variance of the score variable (see
EEG_score_var
).
Aaron Fisher, Brian Caffo, and Vadim Zipunnikov. Fast, Exact Bootstrap Principal Component Analysis for p>1 million. 2014. http://arxiv.org/abs/1405.0922
Stuart F Quan, Barbara V Howard, Conrad Iber, James P Kiley, F Javier Nieto, George T O'Connor, David M Rapoport, Susan Redline, John Robbins, JM Samet, et al. The sleep heart health study: design, rationale, and methods. Sleep, 20(12):1077-1085, 1997. 1.1
set.seed(0) #Low noise example, for an illustration of smoother functions Y<-simEEG(n=20,centered=FALSE,propVarNoise=.02,wide=FALSE) matplot(Y,type='l',lty=1) #Higher noise example, for PCA Y<-simEEG(n=100,centered=TRUE,propVarNoise=.5,wide=TRUE) svdY<-fastSVD(Y) V<-svdY$v #since Y is wide, the PCs are the right singular vectors (svd(Y)$v). d<-svdY$d head(cumsum(d^2)/sum(d^2),5) #first 5 PCs explain about half the variation # Compare fitted PCs to true, generating basis vectors # Since PCs have arbitrary sign, we match the sign of # the fitted sample PCs to the population PCs first V_sign_adj<- array(NA,dim=dim(V)) for(i in 1:5){ V_sign_adj[,i]<-V[,i] * sign(crossprod(V[,i],EEG_leadingV[,i])) } par(mfrow=c(1,2)) matplot(V_sign_adj[,1:5],type='l',lty=1, main='PCs from simulated data,\n sign adjusted') matplot(EEG_leadingV,type='l',lty=1,main='Population PCs')
set.seed(0) #Low noise example, for an illustration of smoother functions Y<-simEEG(n=20,centered=FALSE,propVarNoise=.02,wide=FALSE) matplot(Y,type='l',lty=1) #Higher noise example, for PCA Y<-simEEG(n=100,centered=TRUE,propVarNoise=.5,wide=TRUE) svdY<-fastSVD(Y) V<-svdY$v #since Y is wide, the PCs are the right singular vectors (svd(Y)$v). d<-svdY$d head(cumsum(d^2)/sum(d^2),5) #first 5 PCs explain about half the variation # Compare fitted PCs to true, generating basis vectors # Since PCs have arbitrary sign, we match the sign of # the fitted sample PCs to the population PCs first V_sign_adj<- array(NA,dim=dim(V)) for(i in 1:5){ V_sign_adj[,i]<-V[,i] * sign(crossprod(V[,i],EEG_leadingV[,i])) } par(mfrow=c(1,2)) matplot(V_sign_adj[,1:5],type='l',lty=1, main='PCs from simulated data,\n sign adjusted') matplot(EEG_leadingV,type='l',lty=1,main='Population PCs')