(i) Dimension reduction: Dimensionality
reduction involves the determination of intrinsic dimensionality q of
the input space where q << p: This can be done by
orthogonalization techniques on the input space which reduces the problem to a
lower dimension and orthogonal input space leading to variance reduction for
the estimator. Principal Component Analysis (PCA) and Singular Value
Decomposition (SVD) are the methods for dimensionality reduction. However if p
>> n; then most of these techniques cannot be used directly.
(ii)
Kernelization: In applications such as signal processing, it is always that p
>> n in time domain. A ten second audio tape at a 44100 Hz sampling
rate generates a vector of dimension p = 44100 in time domain and one
usually has a few hundred or may be thousand (= n) tracks for analysis.
In image processing, there are similar problems of dimensionality with a face
of size 640 _ 512 generating a p = 327680 dimension input space.
In both these cases, it is not possible to use PCA or SVD because p >>
n: Here one uses the method of kernelization. Given a data set with n input
vectors xi 2 X from some p-dimensional space, the main
component of kernelization is a bivariate function K(:; :)
defined on X _ X with values in R: The matrix K given by
K(x1; x1)
: : : K(x1; xn)
K(x2; x1)
: : : K(x2; xn)
: : : : : : : :
:
K(xn; x1)
: : : K(xn; xn)
is called a Gram
matrix. The Gram matrix is of order n _ n and does not depend on p: One
can compute the eigenvalues and eigenvectors of the matrix K of lower
dimension n and analyze the data.
(iii) Bagging:
As it was observed earlier, it is common in massive data that a single model
selected does not lead to optimal prediction. If there is a multi-collinearity
between the variables which is bound to happen when p is very large, the
estimators are unstable and of large variances. Bootstrap aggregation (also
called bagging) reduces the variance of the estimators by aggregation of
bootstrapped versions of the base estimators.
(iv)
Parelellization: When the computational complexity for building the base
learner is high, the method of bagging becomes inefficient and not practical.
One way to avoid this problem is to use parallel processing. Big Data analytics
will need parallel processing or parallelization for speeding up computation or
to handle massive data that cannot fit into a single computer memory. One way
to make statistical procedures more efficient in analysis of Big Data is to
parallelize them, that is, to write many algorithms that can run on many computers
or many processors at the same time. The method of “Bootstrap” is a standard method
for inferring the probability distribution from a sample. It is computationally
intensive. However it is ideally suitable for parallelization because it
involves generating numerous independent rounds of simulated data. One can use “Bag
of Little Bootstraps” (BLB) which generates results comparable to the regular
bootstrap but much faster.
(v)
Regularization: With large p and small n; there exist a
multiplicity of solutions for any optimization problem involving Big Data and
hence the problem becomes ill-posed. Regularization methods are used to find a
feasible optimal solution and one method of regularization is Lagrangian
formulation of a constrained version of the problem. LASSO (Tibshirani (1996))
is one such method in high-dimensional data analysis.
(vi) Assumption
of sparsity: As we described earlier, thousands of irrelevant parameters will
appear to be statistically significant if we use small data statistics for Big
Data. In classical statistics, if the data implies occurrence of an event that
has one-in-a million chance of occurring, then we are sure it is not by chance
and hence consider it statistically significant. But if we are considering a
Big Data with a large number of parameters, it is possible for the event to
occur by chance and not due to significance of the relationship. Most data sets
have only a few strong relationships between variables and everything else is
noise. Thus most of the parameters do not matter. This leads to sparsity
assumption which is to assume that all but a few parameters are negligible.
This will allow a way of extracting information from a Big Data. One such
method is L1-minimization called LASSO due to Tibshirani (1996). This
was used in the field of image processing to extract an image in sharp focus
from blurry or noisy data.
(vii) False
Discovery Rate (FDR): Another technique that is applied for analysis of Big Data,
specially in the genome and neuroimaging research, is the false discovery rate
(FDR) suggested by Benjamini and Hochberg (1995). If a study finds 20 locations
in a human genome with a statistically significant association with cancer and
it has a false discovery rate of ten percent, then we can expect that two of
the 20 discoveries to be false on the average. The FdR does not indicate which
discoveries are spurious but that can be determined sometimes by a follow-up
study.
(viii) The
problem of “Big n, Big p, Little t”: The speed at which
one can process is an important element in analyzing Big Data. Classical
statistics was always done in an off-line mode, the size was small and the the
time for analysis was essentially unlimited. However, in the era of Big Data
things are different. For a web company which is trying to predict user
reaction and elicit user behaviour such as clicking on an advertisement
sponsored by a client, time is important. The web company might have only
milliseconds to decide how to respond to a given user’s click. Furthermore the
model constantly has to change to adopt to new users and new products. The
objective of the person who is analyzing the data may not be to deliver a
perfect answer but to deliver a good answer fast.
(ix) Privacy and
Confidentiality: How to keep privacy and confidentiality in the era of Big
Data? Public concerns about privacy, confidentiality and misuse and abuse of
individual data is a matter of concern in collection of Big Data. There are
ways of masking Big Data.One way is to anonymize the records after they are
collected by adding a random noise or to do matrix masking of the data matrix
by a known mathematical operation so that individual information is difficult
to retrieve. Cryptography is another discipline that applies mathematical
transformations to data that are either irreversible or reversible only with a password
or reversible only at a great expense that an opponent can ill afford to pay
for it.
Reference: Mr. B.L.S. PRAKASA RAO, from Brief Notes on BIG
DATA: A Cursory Look