Hadoop

My boss has been encouraging me to learn some new technologies related to my interest in database development. He mentioned that Hadoop may be such a candidate. Hadoop is a set of open source utilities. They are used to solve problems with a massive amount of data and computation. They rely on distributed storage implemented by HDFS.

The programming model used by Hadoop is MapReduce. Data files are split into large blocks. These blacks are distributed across nodes. Code is also transferred to these nodes. Data is processed in parallel. The result is faster than conventional computing, even faster than mainframes. One of the biggest users of the tech is Yahoo for their search.

Hadoop is mainly written in Java. Some of it is in C. And there are some shell scripts involved. You access it through a Java API. However any programming language can use it via the Thift API or some other third party library. It does require the JRE. Files can be in the gigabyte or terabyte size. It does not require a RAID and data replication is built in. Usage is for immutable files. Concurrent write is not supported.

HDFS is a distributed, scalable, and portable file system. It is also written in Java. It acts like a data store. It is not POSIX compliant (which helps it achieve great speed). It also has a Java API. There are also shell commands that can interact with it.

Apache has many packages that go atop Hadoop. These include Pig, Hive, HBase, Spark, and Ooozie.