--- title: 'Testing for the number of interesting components in ICA' author: "Klaus Nordhausen, Hannu Oja, David E. Tyler, Joni Virta" date: "`r Sys.Date()`" output: html_document: toc: yes toc_depth: 2 rmarkdown::html_vignette: fig_caption: yes toc: yes toc_depth: 2 pdf_document: fig_caption: yes toc: yes toc_depth: 2 vignette: | %\VignetteIndexEntry{ICA} %\usepackage[utf8]{inputenc} %\VignetteEngine{knitr::rmarkdown} --- ## Fourth order blind identification In independent component analysis (ICA) it is usually assumed that the observable $p$-vector $x$ is a linear mixture of $p$ mutually independent components, where the vector of independent components is $s$. Hence the model can be written as $$ x = As + \mu, $$ where the full rank $A$ represents the linear mixing and $\mu$ the location. It is then usually assumed that $E(s)=0$ and $Cov(s)=I_p$. The goal is then to find an unmixing matrix $W$ such that $Wx$ has independent components. In ICA gaussian components are considered the uninteresting ones whereas the non-gaussian ones are considered interesting. Actually, if there is more than one gaussian component only the non-gaussian components can be recovered but not the gaussian ones. Therefore it is naturally of interest to test how many interesting components there are. This is done by testing if a subvector of $Wx$ is gaussian. FOBI is one of the first ICA methods - it jointly diagonalizes the regular covariance matrix $Cov$ and the so-called matrix of fourth moments $$ Cov_4(x) = E(r^2(x-\mu)(x-\mu)^T), $$ where $\mu = E(x)$ and $r^2=(x-\mu)^T Cov(x)^{-1} (x-\mu)$. Note that under this definition $Cov_4$ is not consistent under the normal model and the scale of a normal component would be $\sqrt{p+2}$. (Therefore for example the function `cov4` from the `ICS` package divides $Cov_4$ by $p+2$.) Hence FOBI is the solution to the generalized eigenvector-eigenvalue problem which solves $$ W Cov(x) W^T = I_p \ and \ W Cov_4(x) W^T = D, $$ where D is a diagonal matrix. For the context under consideration it is natural to order the general eigenvectors in $W$ such that the eigenvalues, which are the diagonal elements of $D$, $d_1,...,d_p$, fulfill $$ (d_1 - (p+2))^2 \geq ... \geq (d_p - (p+2))^2, $$ which means gaussian components would be put at the end of $s$ and values of $d_i$'s which correspond to a gaussian component should be $p+2$. Hence, assuming that there are $k$ interesting components the hypothesis is that $$ H_0: d_{k+1} = ... = d_p = p+2. $$ The package `ICtest` provides different asymptotic and bootstrapping based tests to test this hypothesis ### Asymptotic tests for a specific value of $k$ The function for the asymptotic test is `FOBIasymp` and it provides three different tests, denoted by `S1`, `S2` and `S3`, and specified using the `type` argument: 1. `type="S1"`: The test statistic $T$ is the variance of the last $p-k$ eigenvalues around p+2: $$ T = n \sum_{i=k+1}^p (d_i-(p+2))^2 $$ the limiting distribution of $T$ under the null is the sum of two weighted chisquare distributions with weights $w_1 = 2 \sigma_1 / (p-k)$ and $w_2 = 2 \sigma_1 / (p-k) + \sigma_2$ and degrees of freedom $df_1 = (p-k-1)(p-k+2)/2$ and $df_2 = 1$. 2. `type="S2"`: Another possible version for the test statistic is a scaled sum of the variance of the eigenvalues around the mean plus the variance around the expected value under normality ($p+2$). Denote $VAR_{dpk}$ as the variance of the last $p-k$ eigenvalues and $VAR2_{dpk}$ as the variance of these eigenvalues around $p+2$. Then the test statistic is: $$ T = (n (p-k) VAR_{dpk}) / (2 \sigma_1) + (n VAR2_{dpk}) / (2 \sigma_1 / (p-k) + \sigma_2) $$ This test statistic has a limiting chisquare distribution with $(p-k-1)(p-q+2)/2 + 1$ degrees of freedom. 3. `type="S3"`: The third possible test statistic just checks the equality of the last $p-k$ eigenvalues using only the first part of the test statistic of `type="S2"`. The test statistic is then: $$ T = (n (p-k) VAR_{dpk}) / (2 \sigma_1) $$ and has a limiting chisquare distribution with $(p-k-1)(p-q+2)/2$ degrees of freedom. The constants $\sigma_1$ and $\sigma_2$ depend on the kurtosis values of the independent components and on the dimension $p$ and are estimated from the data. To demonstrate the function `FOBIasymp` consider the following artificial data ```{r, message = FALSE, warning = FALSE, fig.show = "hold"} library(ICtest) set.seed(1) n <- 1500 S <- cbind(runif(n), rchisq(n, 2), rexp(n), rnorm(n), rnorm(n), rnorm(n)) A <- matrix(rnorm(36), ncol = 6) X <- S %*% t(A) ``` Therefore there are three interesting components and three noise components. ```{r, message = FALSE, warning = FALSE, fig.show = "hold"} FOBI1 <- FOBIasymp(X, k = 3, model="ICA") screeplot(FOBI1) abline(h = 8) # p=6, i.e. p + 2 = 8 ``` The screeplot shows that the components are in the right order and 3 components have an eigenvalue close to the value of interest. The test decision is ```{r, message = FALSE, warning = FALSE, fig.show = "hold"} FOBI1 ``` Which we can compare to the other test version: ```{r, message = FALSE, warning = FALSE, fig.show = "hold"} FOBIasymp(X, k = 3, type = "S1", model = "ICA") FOBIasymp(X, k = 3, type = "S2", model = "ICA") ``` which means all three tests do not reject that the last three components are gaussian. Using the default, `model = "NGCA"` should be also consistent but does not explicitely assume an ICA model but the more general non-gaussian component analysis (NGCA) model with possible dependent interesting signals. To see if there are 4 gaussian components using the default `model` argument one could use for example: ```{r, message = FALSE, warning = FALSE, fig.show = "hold"} FOBIasymp(X, k = 2) ``` which would be rejected at the $0.05$ level. ### Bootstrapping tests for a specific value of $k$ The test statistic used here is again the variance of the last $p-k$ components around the value of interest $p+2$: $$ T = n \sum_{i=k+1}^p (d_i-(p+2))^2 $$ Two possible bootstrap testing strategies, denoted `B1` and `B2`, are provided for testing the null hypothesis of interest. Let $X$ be the centered data matrix of the $n$ $p$-variate observations and $S$ the corresponding estimated independent components which can be decomposed into $S_1$, the non-gaussian components and $S_2$ the gaussian components. Let $W$ be the estimated unmixing matrix. 1. `s.boot="B1"`: The first strategy has the following steps: a. Take a bootstrap sample $S_1^*$ of size $n$ from $S_1$. b. Take a bootstrap sample $S_2^*$ consisting of a matrix of standard normally distributed elements. c. Combine $S^*=(S_1^*, S_2^*)$ and create $X^*= S^* W^{-1}$. d. Compute the test statistic $T^*$ based on $X^*$. e. Repeat the previous steps `n.boot` times to get the test statistics $T^*_1,..., T^*_{n.boot}$. Note that in this bootstrapping test the assumption of ''independent components'' is not used, it is only used that the last $p-k$ components are gaussian and independent from the first $k$ components. Therefore this strategy can be applied in an independent component analysis (ICA) framework and in a non-gaussian components analysis (NGCA) framework. 2. `s.boot="B2"`: The second strategy has the following steps: a. Take a bootstrap sample $S_1^*$ of size $n$ from $S_1$ where the subsampling is done separately for each independent component. b. Take a bootstrap sample $S_2^*$ consisting of a matrix of standard normally distributed elements. c. Combine $S^*=(S_1^*, S_2^*)$ and create $X^*= S^* W^{-1}$. d. Compute the test statistic $T^*$ based on $X^*$. e. Repeat the previous steps `n.boot` times to get the test statistics $T^*_1,..., T^*_{n.boot}$. The p-value is then in both cases $$ \frac{\#(T_i^* \geq T)+1}{n.boot+1}. $$ This bootstrapping strategy assumes a full ICA model and cannot be used in an NGCA framework. To test for the previous data whether there are three gaussian components, the bootstrap `FOBIboot` can be used as follows: ```{r, message = FALSE, warning = FALSE, fig.show = "hold"} FOBIboot(X, k = 3, s.boot = "B1") FOBIboot(X, k = 3, s.boot = "B2") ``` ## Non-Gaussian projection pursuit Non-Gaussian projection pursuit (NGPP) assumes that the observed multivariate data is generated by a linear mixture of $k$ signals and $p - k$ channels of Gaussian noise. The package provides both methods for estimating the signals when $k$ is known *a priori* and a technique for testing and estimating the value of $k$ itself. The estimation is based on the maximization of convex combinations of squared objective functions, of which four popular ones are included in the package: - Third cumulant, `skew` - Fourth cumulant, `pow3` - Logarithmic hyperbolic cosine, `tanh` - Gaussian function, `gauss` The names (`skew` etc.) of the functions in the package are based on the corresponding derivatives, or *non-linearities*. ### Estimating the signal components The simplest way to use NGPP is to just call the function using default settings and indicating the number of signal components `k` you want to extract. The default objective function is a certain linear combination of squared third and fourth cumulants, the *Jarque-Bera* test statistic for normality. To use some other objective function you can specify a vector of objective function names in `nl` and the corresponding weights in `alpha`. ```{r, message = FALSE, warning = FALSE, fig.show = "hold"} library(ICtest) X <- as.matrix(iris[, 1:4]) res1 <- NGPP(X, k = 2) plot(res1$S, col = as.factor(iris[, 5]), xlab = "signal_1", ylab = "signal_2") res2 <- NGPP(X, k = 2, nl=c("tanh", "pow3"), alpha = c(0.5, 0.5)) plot(res2$S, col = as.factor(iris[, 5]), xlab = "signal_1", ylab = "signal_2") ``` ### Testing for a specific value of $k$ The above example used the default `method="symm"` to specify that all `k` signals should be estimated simultaneously and the alternative `method="defl"` could have been used to estimate the signals one-by-one, leading into different solutions. But perhaps more interestingly, the one-by-one estimation can also be used to test the null hypothesis $H_0: true\_k \leq k$ with the function `NGPPsim`. Let's create next some mixed toy data and try to infer its true signal dimension. ```{r, message = FALSE, warning = FALSE} set.seed(2016) S <- cbind(rexp(100), rnorm(100), rnorm(100)) A <- matrix(rnorm(9), 3, 3) X <- S%*%A res0 <- NGPPsim(X, k = 0, N = 200) res0$p.value res1 <- NGPPsim(X, k = 1, N = 200) res1$p.value ``` The obtained $p$-values provide sufficient evidence to conclude that $k\_true > 0$ and $k\_true \leq 1$, correctly implying that the signal dimension is equal to one. The parameter `N` controls the number of repetitions used to simulate the distribution of the test statistic under the null hypothesis and increasing it gives more accurate results but increases the computation time. ### Estimation of the true dimension In the above example we had to separately call `NGPPsim` for each value of `k`. This process is automated by the function `NGPPest` which performs the hypothesis test for all $k = 0, \ldots , p-1$. Let's create another toy data and try to estimate its dimension, using this time only the non-linearity `pow3`. ```{r, message = FALSE, warning = FALSE} set.seed(2016) S <- cbind(rexp(100), runif(100), rnorm(100)) A <- matrix(rnorm(9), 3, 3) X <- S%*%A res <- NGPPest(X, nl = "pow3", N = 200) res$p.value ``` The resulting vector of $p$-values can now be directly interpreted as testing the null hypotheses whether each particular component is just noise. The correct conclusion of two signal components is again reached.