GE8151 PROBLEM SOLVING AND PYTHON PROGRAMMING : METHODS OF HANDLING BIG DATA

(i) Dimension reduction: Dimensionality reduction involves the determination of intrinsic dimensionality q of the input space where q << p: This can be done by orthogonalization techniques on the input space which reduces the problem to a lower dimension and orthogonal input space leading to variance reduction for the estimator. Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) are the methods for dimensionality reduction. However if p >> n; then most of these techniques cannot be used directly.

(ii) Kernelization: In applications such as signal processing, it is always that p >> n in time domain. A ten second audio tape at a 44100 Hz sampling rate generates a vector of dimension p = 44100 in time domain and one usually has a few hundred or may be thousand (= n) tracks for analysis. In image processing, there are similar problems of dimensionality with a face of size 640 _ 512 generating a p = 327680 dimension input space. In both these cases, it is not possible to use PCA or SVD because p >> n: Here one uses the method of kernelization. Given a data set with n input vectors xi 2 X from some p-dimensional space, the main component of kernelization is a bivariate function K(:; :) defined on X _ X with values in R: The matrix K given by

K(x1; x1) : : : K(x1; xn)

K(x2; x1) : : : K(x2; xn)

: : : : : : : : :

K(xn; x1) : : : K(xn; xn)

is called a Gram matrix. The Gram matrix is of order n _ n and does not depend on p: One can compute the eigenvalues and eigenvectors of the matrix K of lower dimension n and analyze the data.

(iii) Bagging: As it was observed earlier, it is common in massive data that a single model selected does not lead to optimal prediction. If there is a multi-collinearity between the variables which is bound to happen when p is very large, the estimators are unstable and of large variances. Bootstrap aggregation (also called bagging) reduces the variance of the estimators by aggregation of bootstrapped versions of the base estimators.

(iv) Parelellization: When the computational complexity for building the base learner is high, the method of bagging becomes inefficient and not practical. One way to avoid this problem is to use parallel processing. Big Data analytics will need parallel processing or parallelization for speeding up computation or to handle massive data that cannot fit into a single computer memory. One way to make statistical procedures more efficient in analysis of Big Data is to parallelize them, that is, to write many algorithms that can run on many computers or many processors at the same time. The method of “Bootstrap” is a standard method for inferring the probability distribution from a sample. It is computationally intensive. However it is ideally suitable for parallelization because it involves generating numerous independent rounds of simulated data. One can use “Bag of Little Bootstraps” (BLB) which generates results comparable to the regular bootstrap but much faster.

(v) Regularization: With large p and small n; there exist a multiplicity of solutions for any optimization problem involving Big Data and hence the problem becomes ill-posed. Regularization methods are used to find a feasible optimal solution and one method of regularization is Lagrangian formulation of a constrained version of the problem. LASSO (Tibshirani (1996)) is one such method in high-dimensional data analysis.

(vi) Assumption of sparsity: As we described earlier, thousands of irrelevant parameters will appear to be statistically significant if we use small data statistics for Big Data. In classical statistics, if the data implies occurrence of an event that has one-in-a million chance of occurring, then we are sure it is not by chance and hence consider it statistically significant. But if we are considering a Big Data with a large number of parameters, it is possible for the event to occur by chance and not due to significance of the relationship. Most data sets have only a few strong relationships between variables and everything else is noise. Thus most of the parameters do not matter. This leads to sparsity assumption which is to assume that all but a few parameters are negligible. This will allow a way of extracting information from a Big Data. One such method is L1-minimization called LASSO due to Tibshirani (1996). This was used in the field of image processing to extract an image in sharp focus from blurry or noisy data.

(vii) False Discovery Rate (FDR): Another technique that is applied for analysis of Big Data, specially in the genome and neuroimaging research, is the false discovery rate (FDR) suggested by Benjamini and Hochberg (1995). If a study finds 20 locations in a human genome with a statistically significant association with cancer and it has a false discovery rate of ten percent, then we can expect that two of the 20 discoveries to be false on the average. The FdR does not indicate which discoveries are spurious but that can be determined sometimes by a follow-up study.

(viii) The problem of “Big n, Big p, Little t”: The speed at which one can process is an important element in analyzing Big Data. Classical statistics was always done in an off-line mode, the size was small and the the time for analysis was essentially unlimited. However, in the era of Big Data things are different. For a web company which is trying to predict user reaction and elicit user behaviour such as clicking on an advertisement sponsored by a client, time is important. The web company might have only milliseconds to decide how to respond to a given user’s click. Furthermore the model constantly has to change to adopt to new users and new products. The objective of the person who is analyzing the data may not be to deliver a perfect answer but to deliver a good answer fast.

(ix) Privacy and Confidentiality: How to keep privacy and confidentiality in the era of Big Data? Public concerns about privacy, confidentiality and misuse and abuse of individual data is a matter of concern in collection of Big Data. There are ways of masking Big Data.One way is to anonymize the records after they are collected by adding a random noise or to do matrix masking of the data matrix by a known mathematical operation so that individual information is difficult to retrieve. Cryptography is another discipline that applies mathematical transformations to data that are either irreversible or reversible only with a password or reversible only at a great expense that an opponent can ill afford to pay for it.

Reference: Mr. B.L.S. PRAKASA RAO, from Brief Notes on BIG DATA: A Cursory Look

GE8151 PROBLEM SOLVING AND PYTHON PROGRAMMING

RABBITMQ

Saturday, October 1, 2016

METHODS OF HANDLING BIG DATA

No comments:

Post a Comment