Hadoop Distributed File System: HDFS, the storage layer of Hadoop, is a distributed,
scalable, Java-based file system adept at storing large volumes of unstructured
data.
MapReduce: MapReduce is a software framework that serves as the
compute layer of Hadoop. MapReduce jobs are divided into two (obviously named)
parts. The “Map” function divides a query into multiple parts and processes
data at the node level. The “Reduce” function aggregates the results of the
“Map” function to determine the “answer” to the query.
Hive: Hive is a Hadoop-based data warehousing-like framework
originally developed by Facebook. It allows users to write queries in a
SQL-like language called HiveQL, which are then converted to MapReduce. This
allows SQL programmers with no MapReduce experience to use the warehouse and
makes it easier to integrate with business intelligence and visualization tools
such as Microstrategy, Tableau, Revolutions Analytics, etc.
Pig: Pig Latin is a Hadoop-based language developed by Yahoo. It
is relatively easy to learn. Pig is a high level scripting language that
is used with Apache Hadoop. Pig enables data workers to write
complex data transformations without knowing Java. Pig's simple SQL-like
scripting language is called Pig Latin, and appeals to developers
already familiar with scripting languages and SQL
HBase: HBase is a non-relational database that allows for
low-latency, quick lookups in Hadoop. It adds transactional capabilities to
Hadoop, allowing users to conduct updates, inserts and deletes. eBay and
Facebook use HBase heavily.
Apache Flume
Apache Flume is a tool/service/data
ingestion mechanism for collecting aggregating and transporting large amounts
of streaming data such as log data, events etc.. from various web servers to a
centralized data store. It is a highly reliable, distributed, and configurable
tool that is principally designed to transfer streaming data from various
sources to HDFS.
Oozie: Oozie is a workflow processing system that lets users
define a series of jobs written in multiple languages – such as Map Reduce, Pig
and Hive -- then intelligently link them to one another. Oozie allows users to
specify, for example, that a particular query is only to be initiated after
specified previous jobs on which it relies for data are completed.
Ambari: Ambari is a web-based set of tools for deploying,
administering and monitoring Apache Hadoop clusters. It's development is being
led by engineers from Hortonworoks, which include Ambari in its Hortonworks
Data Platform.
Avro: Avro is a data serialization system that allows for
encoding the schema of Hadoop files. It is adept at parsing data and performing
removed procedure calls.
Mahout: Apache Mahout is a project of the Apache Software
Foundation to produce free implementations of distributed or otherwise scalable
machine learning algorithms focused primarily in the areas of collaborative
filtering, clustering and classification. Many of the implementations use the
Apache Hadoop platform
Sqoop: Sqoop is a connectivity tool for moving data from
non-Hadoop data stores – such as relational databases and data warehouses –
into Hadoop. It allows users to specify the target location inside of Hadoop
and instruct Sqoop to move data from Oracle, Teradata or other relational
databases to the target.
HCatalog: HCatalog is a centralized metadata management and sharing
service for Apache Hadoop. It allows for a unified view of all data in Hadoop
clusters and allows diverse tools, including Pig and Hive, to process any data
elements without needing to know physically where in the cluster the data is stored.
BigTop: BigTop is an effort to create a more formal process or
framework for packaging and interoperability testing of Hadoop's sub-projects
and related components with the goal improving the Hadoop platform as a whole.
R is a programming language and software environment for statistical
analysis, graphics representation and reporting. R was created by Ross
Ihaka and Robert Gentleman at the University of Auckland, New Zealand,
and is currently developed by the R Development Core Team. R is freely available under the GNU General Public License, and
pre-compiled binary versions are provided for various operating systems
like Linux, Windows and Mac. This programming language was named R, based on the first
letter of first name of the two R authors (Robert Gentleman and Ross
Ihaka), and partly a play on the name of the Bell Labs Language S.
Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management technology.
YARN is one of the key features in the second-generation Hadoop 2
version of the Apache Software Foundation's open source distributed
processing framework. Originally described by Apache as a redesigned
resource manager, YARN is now characterized as a large-scale,
distributed operating system for big data applications.
No comments:
Post a Comment