Tuesday, October 4, 2016

Masters Engineering Programs on BIG DATA in INDIA



The recent years have witnessed a tremendous technological growth in managing, organizing and harnessing the power of large scale data. Big Data and Data analytics are playing an important role in defining the vision in every sphere of life. There is no human institution not influenced by big data and analytics. Business, government, healthcare, education and the society as a whole, derives insights from the historical data. Analytics helps in predicting potential opportunities as well as predicting the possible future.
Big Data is about providing efficient technological “solution stacks” to organize and access large scale data. Data Analytics is about combining principles and techniques from mathematics, computer science and machine learning to predict possibilities as well as to prescribe actions.

Currently Anna university, VIT University and Manipal university have started Masters Engineering program on Big data Analytics

Anna University - M.E. computer science and Engineering with specialization in Big Data Analytics

VIT University - M.Tech computer science and Engineering with specialization in Big Data Analytics 

Manipal University - M.E. in Big Data and Data Analytics

Reva University - M. Tech. in Data Engineering and Cloud Computing

Saturday, October 1, 2016

R Programming Language


 R

R is a programming language and software environment for statistical analysis, graphics representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team. R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems like Linux, Windows and Mac.This programming language was named R, based on the first letter of first name of the two R authors (Robert Gentleman and Ross Ihaka), and partly a play on the name of the Bell Labs Language S.
R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.
One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.
R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.

The R environment

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes
  • an effective data handling and storage facility,
  • a suite of operators for calculations on arrays, in particular matrices,
  • a large, coherent, integrated collection of intermediate tools for data analysis,
  • graphical facilities for data analysis and display either on-screen or on hardcopy, and
  • a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.
The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.
R, like S, is designed around a true computer language, and it allows users to add additional functionality by defining new functions. Much of the system is itself written in the R dialect of S, which makes it easy for users to follow the algorithmic choices made. For computationally-intensive tasks, C, C++ and Fortran code can be linked and called at run time. Advanced users can write C code to manipulate R objects directly.
Many users think of R as a statistics system. We prefer to think of it of an environment within which statistical techniques are implemented. R can be extended (easily) via packages. There are about eight packages supplied with the R distribution and many more are available through the CRAN family of Internet sites covering a very wide range of modern statistics.

Code Editors for R

Several excellent code editors are available that provide functionalities like R syntax highlighting, auto code indenting and utilities to send code/functions to the R console.

Finding Help

Reference list on R programming (selection)


METHODS OF HANDLING BIG DATA


 (i) Dimension reduction: Dimensionality reduction involves the determination of intrinsic dimensionality q of the input space where q << p: This can be done by orthogonalization techniques on the input space which reduces the problem to a lower dimension and orthogonal input space leading to variance reduction for the estimator. Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) are the methods for dimensionality reduction. However if p >> n; then most of these techniques cannot be used directly.

(ii) Kernelization: In applications such as signal processing, it is always that p >> n in time domain. A ten second audio tape at a 44100 Hz sampling rate generates a vector of dimension p = 44100 in time domain and one usually has a few hundred or may be thousand (= n) tracks for analysis. In image processing, there are similar problems of dimensionality with a face of size 640 _ 512 generating a p = 327680 dimension input space. In both these cases, it is not possible to use PCA or SVD because p >> n: Here one uses the method of kernelization. Given a data set with n input vectors xi 2 X from some p-dimensional space, the main component of kernelization is a bivariate function K(:; :) defined on X _ X with values in R: The matrix K given by
K(x1; x1) : : : K(x1; xn)
K(x2; x1) : : : K(x2; xn)
: : : : : : : : :
K(xn; x1) : : : K(xn; xn)

is called a Gram matrix. The Gram matrix is of order n _ n and does not depend on p: One can compute the eigenvalues and eigenvectors of the matrix K of lower dimension n and analyze the data.

(iii) Bagging: As it was observed earlier, it is common in massive data that a single model selected does not lead to optimal prediction. If there is a multi-collinearity between the variables which is bound to happen when p is very large, the estimators are unstable and of large variances. Bootstrap aggregation (also called bagging) reduces the variance of the estimators by aggregation of bootstrapped versions of the base estimators.

(iv) Parelellization: When the computational complexity for building the base learner is high, the method of bagging becomes inefficient and not practical. One way to avoid this problem is to use parallel processing. Big Data analytics will need parallel processing or parallelization for speeding up computation or to handle massive data that cannot fit into a single computer memory. One way to make statistical procedures more efficient in analysis of Big Data is to parallelize them, that is, to write many algorithms that can run on many computers or many processors at the same time. The method of “Bootstrap” is a standard method for inferring the probability distribution from a sample. It is computationally intensive. However it is ideally suitable for parallelization because it involves generating numerous independent rounds of simulated data. One can use “Bag of Little Bootstraps” (BLB) which generates results comparable to the regular bootstrap but much faster.

(v) Regularization: With large p and small n; there exist a multiplicity of solutions for any optimization problem involving Big Data and hence the problem becomes ill-posed. Regularization methods are used to find a feasible optimal solution and one method of regularization is Lagrangian formulation of a constrained version of the problem. LASSO (Tibshirani (1996)) is one such method in high-dimensional data analysis.

(vi) Assumption of sparsity: As we described earlier, thousands of irrelevant parameters will appear to be statistically significant if we use small data statistics for Big Data. In classical statistics, if the data implies occurrence of an event that has one-in-a million chance of occurring, then we are sure it is not by chance and hence consider it statistically significant. But if we are considering a Big Data with a large number of parameters, it is possible for the event to occur by chance and not due to significance of the relationship. Most data sets have only a few strong relationships between variables and everything else is noise. Thus most of the parameters do not matter. This leads to sparsity assumption which is to assume that all but a few parameters are negligible. This will allow a way of extracting information from a Big Data. One such method is L1-minimization called LASSO due to Tibshirani (1996). This was used in the field of image processing to extract an image in sharp focus from blurry or noisy data.

(vii) False Discovery Rate (FDR): Another technique that is applied for analysis of Big Data, specially in the genome and neuroimaging research, is the false discovery rate (FDR) suggested by Benjamini and Hochberg (1995). If a study finds 20 locations in a human genome with a statistically significant association with cancer and it has a false discovery rate of ten percent, then we can expect that two of the 20 discoveries to be false on the average. The FdR does not indicate which discoveries are spurious but that can be determined sometimes by a follow-up study.

(viii) The problem of “Big n, Big p, Little t: The speed at which one can process is an important element in analyzing Big Data. Classical statistics was always done in an off-line mode, the size was small and the the time for analysis was essentially unlimited. However, in the era of Big Data things are different. For a web company which is trying to predict user reaction and elicit user behaviour such as clicking on an advertisement sponsored by a client, time is important. The web company might have only milliseconds to decide how to respond to a given user’s click. Furthermore the model constantly has to change to adopt to new users and new products. The objective of the person who is analyzing the data may not be to deliver a perfect answer but to deliver a good answer fast.

(ix) Privacy and Confidentiality: How to keep privacy and confidentiality in the era of Big Data? Public concerns about privacy, confidentiality and misuse and abuse of individual data is a matter of concern in collection of Big Data. There are ways of masking Big Data.One way is to anonymize the records after they are collected by adding a random noise or to do matrix masking of the data matrix by a known mathematical operation so that individual information is difficult to retrieve. Cryptography is another discipline that applies mathematical transformations to data that are either irreversible or reversible only with a password or reversible only at a great expense that an opponent can ill afford to pay for it.


Reference: Mr. B.L.S. PRAKASA RAO, from Brief Notes on BIG DATA: A Cursory Look

SOME OF THE LINUX COMPANIES IN KERALA



1.Quintet solutions

2.Vipoint solutions

3.Spark support

4.Ideamine technologies

5.Armia

6.Admin ahead

7.Admod technologies

8.Syntrio technologies

9.N dimensionz

10.On mobile

11.Hash root

12.Hashcod

13.BVS technologies

14.Rmesi

15.X Minds

16.OOPS Matrix (Denoct)

17.KSWAN

18.Vanilla networks

19.Sequires

20.Aigensolutions

21.Xieles

22.Webhostrepo

23.Supportsages

24.Servadm

25.Logicsupport

26.Bobcares

27.Bigserversolutions

28.Cliffsupport

29.Liquidsupport

30.Supportlobby

31.Best value Support

32.Supportresort

33.Asteriskssoft

34.Igloo

35.Takira solutions

36.Active Lobby

Big data analytics

Big data analytics is the process of examining large data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information.With today’s technology, it’s possible to analyze your data and get answers from it almost immediately – an effort that’s slower and less efficient with more traditional business intelligence solutions.

Why is big data analytics important?

Big data analytics helps organizations harness their data and use it to identify new opportunities. That, in turn, leads to smarter business moves, more efficient operations, higher profits and happier customers. IIA Director of Research Tom Davenport interviewed more than 50 businesses to understand how they used big data. He found that they got value in the following ways:
  1. Cost reduction. Big data technologies such as Hadoop and cloud-based analytics bring significant cost advantages when it comes to storing large amounts of data – plus they can identify more efficient ways of doing business.
  2. Faster, better decision making. With the speed of Hadoop and in-memory analytics, combined with the ability to analyze new sources of data, businesses are able to analyze information immediately – and make decisions based on what they’ve learned.
  3. New products and services. With the ability to gauge customer needs and satisfaction through analytics comes the power to give customers what they want. Davenport points out that with big data analytics, more companies are creating new products to meet customers’ needs.

The most important research topics in the Big Data field

Here are the major research fields  where BigData is involved

1) Improving Data analytic techniques- Gather all datas,filter them out on certain constraints and use them to take confident decisions.

2) Natural Language Processing methods - Use NL-processing techniques on Big Data to find out the current sentimental trend and it can be used on business,politics,finance ...etc

3) BigData tools and deployment platforms -Conventional tools are inefficient to handle Bigdata, Lots of research is needed in these fields.

4) Better datamining techniques-Data mining is the method to grab data from various platforms.Improved distributed crawling techniques and algorithms are need for scrape data from multiple platforms.

5) Algorithms for Data visualization-In order to visualize the required information from a pool of random  data, powerful algorithms are crucial for accurate result.

6) Lots more...


Here are the research topics that might be relevant to healthcare and bigdata:
  1. Sentiment analysis
  2. Live drug response analysis
  3. Heterogeneous information integration at large volume of data
  4. Security and privacy issues related to Healthcare infomation exchange.
  5. Metadata management
  6. Information retrieval tools for efficient data searching.
  7. Fraud detection