
Workshop on HighDimensional Data Analysis
(27  29 Feb 2008)
Jointly organized with Department of Statistics & Applied Probability
~ Abstracts ~
Spectra of large dimensional random matrices (LDRM) Arup Bose, Indian Statistical Institute, India
We shall consider (square) matrices with random entries (real or complex). For example, Sample variance covariance matrix, IID matrix, Wigner matrix, Toeplitz matrix etc. where the dimension is growing to infinity. Properties of eigenvalues of such matrices are of interest.
In this talk we will mostly look at real symmetric matrices and discuss in a broad way the limiting spectral distribution (LSD) of these matrices under suitable conditions.
We shall provide some simulations with these matrices, loose description of some results on LSD and pose some questions which should be of interest to statisticians and probabilists.
« Back... Clustering curves via subspace projection JengMin Chiou, Institute of Statistical Science, Academia Sinica, Taiwan
This study considers a functional clustering method, kcenters functional clustering, for random curves. The kcenters functional clustering approach accounts for both the mean and the modes of variation differentials among clusters, and predicts cluster memberships via projection and reclassification. The distance measures considered include the L2 distance and the functional correlation defined in this study, which are embedded in the clustering criteria. The cluster membership predictions are based on nonparametric random effect models of the truncated KarhunenLoeve expansion, coupled with a nonparametric iterative mean and covariance updating scheme. The properties of the proposed clustering methods unravel the cluster qualities. Simulation studies and practical examples illustrate the practical performance of the proposed methods.
« Back... RKHS formulations of some functional data analysis problems Tailen Hsing, University of Michigan, USA
We discuss the inference of two processes in the contexts of functional data analysis, including canonical correlations and regression. The common approach defines canonical variables or regressors in terms of projections in a Hilbert space. While this is conceptionally straightforward, it has a number of weaknesses. We describe an approach that does not require the specification of a Hilbert space, which leads to theories and more general inference procedures.
« Back... Nonlinear dimension reduction with kernel methods SuYun Huang, Institute of Statistical Science, Academia Sinica, Taiwan
Dimension reduction has long been an important technique for highdimensional data analysis. The principal component analysis (PCA), canonical correlation analysis (CCA), and sliced inverse regression (SIR) are some important tools in classical statistical analysis for linear dimension reduction. In this talk we will introduce their nonlinear extension using kernel methods.
The essence of kernelbased nonlinear dimension reduction is to map the pattern data originally observed in Euclidean space to a highdimensional Hilbert space, called feature space, by an appropriate kernel transformation. Lowdimensional projections of highdimensional feature data are approximately ellipticallycontoured and approximately Gaussian distributed. Notions of PCA, CCA and SIR can be extended to the framework of kernel associated feature Hilbert space, known as reproducing kernel Hilbert space, for nonlinear dimension reduction. Computing algorithms including large data handling and numerical examples will be presented.
« Back... Functional mixture regression Thomas Lee, The Chinese University of Hong Kong, Hong Kong
This talk introduces Functional Mixture Regression (FMR), a natural and useful extension of the classical functional linear regression (FLR) model. FMR generalizes FLR essentially in the same way as linear mixture regression generalizes linear regression. That is, the observed predictor random processes are allowed to form subgroups in such a way that each subgroup will have its own regression parameter function. In this talk both theoretical and empirical properties on FMR will be discussed.
This is joint work with Yuejiao Fu and Fang Yao.
« Back... Variable selection and coefficient estimation via regularized rank regression Chenlei Leng, National University of Singapore
The penalized least squares method with some appropriately defined penalty is widely used for simultaneous variable selection and coefficient estimation in linear regression. However, the least squares (LS) based methods may be adversely affected by outlying observations and heavy tailed distributions.
On the other hand, the least absolute deviation (LAD) estimator is more robust, but may be inefficient for many distributions of interest.
To overcome these issues, we propose a novel method termed the regularized rank regression estimator by combining the LAD and the penalized LS methods for variable selection. We show that the proposed estimator has attractive theoreotical properties and is easy to implement.
Simulations and real data analysis both show that the proposed methed performs well in finite sample cases.
« Back... Model selection, dimension reduction and liquid association: a trilogy via Stein's lemma KerChau Li, Institute of Statistical Science, Academia Sinica, Taiwan
University of California, Los Angeles, USA
In this talk, I will describe how a basic idea from Stein?s monumental work in decision theory has led to my earlier research in model selection (generalized cross validation, honest confidence region), dimension reduction (sliced inverse regression and principal Hessian direction) and more recently in the development of liquid association for bioinformatics applications.
References:
Li, K. C. (1985). From Stein's unbiased risk estimates to the method of generalized cross validation. Ann. Statist. 13 13521377.
Li, K. C. (1992). On principal Hessian directions for data visualization and dimension reduction : another application of Stein's lemma. J. Ameri. Stat. Assoc. 87, 10251039.
Li, KC, Palotie A, Yuan, S, Bronnikov, D., Chen D., Wei X., Choi, O., Saarela J., Peltonen L. (2007) Finding candidate disease genes by liquid association. Genome Biology, 8, R205. oi:10.1186/gb2007810r205
« Back... Dimension reduction for unsupervised and partially supervised learning Debasis Sengupta, Indian Statistical Institute, India
Machine learning is often attempted through clustering and/or classification of multidimensional input data. While classification and clustering are used in supervised and unsupervised learning, respectively, there are clustering problems in the case of partially supervised learning, where the classed represented in the training data are far from being exhaustive. In all these cases, the problem of high dimensionality has to be addressed. We consider dimension reduction for clustering on the basis of a mixture model, where observations are normally distributed around a cluster center, and cluster centers also have a multivariate normal distribution. We propose an intuitively appealing objective function for this problem, and work out a solution in the cases of unsupervised and partially supervised clustering.
We apply the methods to the problem of pugmark based estimation of tiger population total, and that of clustering organisms in terms of tetra nucleotide content pattern of ribosomal DNA sequences.
« Back... Supervised singular value decomposition and its application to independent component analysis for fMRI Young Truong, The University of North Carolina, USA
Functional Magnetic Resonance Imaging(fMRI) has been used by neuroscientists as a powerful tool to study brain functions. Independent component analysis (ICA) is an effective method to explore spatiotemporal features in fMRI data. It has been especially successful to recover brainfunctionrelated signals from recorded mixtures of unrelated signals. Due to the high sensitivity of MR scanners, spikes are commonly observed in fMRI data, and they deteriorate the analysis. No particular method exists yet to address this problem. In this paper, we introduce a supervised singular value decomposition technique into the data reduction step of ICA. Two major advantages are discussed: first, the proposed method improves the robustness of ICA against spikes; second, the method uses the particular fMRI experiment designs to guide the fully datadriven ICA, and makes the computation more efficient. The advantages are demonstrated using a spatiotemporal simulation study as well as a real data analysis. This is a joint work with Bai, P., Shen, H. and Huang, X.
« Back... Sliced regression for dimension reduction Hansheng Wang, Peking University, China
By slicing the region of the response (Li, 1991) and applying local kernel regression (MAVE, Xia, et al, 2002) to each slice, a new dimension reduction method is proposed.
Compared with the traditional inverse regression methods, e.g. sliced inverse regression (Li, 1991), the new method is free of the linearity condition (Li, 1991) and enjoys much improved estimation accuracy. Compared with the direct estimation methods (e.g., MAVE), the new method is much more robust against extreme values and can capture the entire central subspace (Cook, 1998) exhaustively. To determine the CS dimension, a consistent crossvalidation (CV) criterion is developed. Extensive numerical studies including one real example confirm our theoretical findings.
« Back... Central limit theorem for linear spectral statistics of large dimensional F matrix Shurong Zheng, Northeast Normal University, China
A central limit theorem (CLT) for linear spectral statistics (LSS) of a product of a large dimensional sample covariance matrix and a nonnegative definite Hermitian matrix was established in Bai and Silverstein (2004). However, their results don?t cover the case of a product of one sample covariance matrix and the inverse of another covariance matrix, independent of each other (F matrix). This is because for F matrix, their CLT established the asymptotic normality of the difference of two dependent statistics defined by the empirical spectral distribution (ESD) of F matrix and by the ESD of the inverse of the second sample covariance matrix. But in fact, in many applications of F matrix, one is often interested in making statistical inference for the parameter defined by the limiting spectral distribution (LSD) of F matrix. Then one is interested in the asymptotic distribution of the difference of the parameter and the estimator defined by LSS of F matrix. In this paper, we shall establish the CLT for LSS of F matrix. As a consequence, we shall also establish the CLT for LSS of beta matrix.
Key words and phrases: Linear spectral statistics, central limit theorem, large dimensional random matrix, large dimensional data analysis.
« Back... A binary response transformationexpectation estimation in dimension reduction Lixing Zhu, The Hong Kong Baptist University, Hong Kong
Slicing estimation is one of the most popularly used methods in the sufficient dimension reduction area. However, the efficacy of the slicing estimation for many inverse regression methods depends heavily on the choice of slice number when response variable is continuous. It is similar to, but more difficult than classical tuning parameter selection in nonparametric function estimation. Thus, how to select the slice number is a longstanding, and still open problem. In this paper, we propose a binary response transformationexpectation (BRTE) method. It completely avoids selecting the number of slices, and meanwhile preserves the integrity of the original central subspace. This generic method also ensures the root $n$ consistency and the asymptotic normality of slicing estimators for many inverse regression methods, and can be applied to multivariate response cases. Finally, BRTE is compared with the existing estimators by extensive simulations and an illustrative real data example.
« Back...

