All posts for the month March, 2012

Try Mongo DB

MongoDB is an open source document-oriented NoSQL database system.

MongoDB makes part of the “new” NoSQL family of database systems. Instead of storing data in tables as is made in a “classical” relational database, MongoDB store structure data as JSON-like documents with dynamic schemas, making easier and faster the integration of data in certain type of applications.

This is a 3-min tutorial which allows to scratch the surface of MongoDB. Click here

Graph processing platforms to run large-scale algorithms (such as page rank, shared connections, personalization-based popularity, etc.) have become quite popular. Some recent examples include Pregel and HaLoop. For general-purpose big data computation, the map-reduce computing model has been well adopted and the most deployed map-reduce infrastructure is Apache Hadoop. Apache Giraph!  implements a graph-processing framework that is launched as a typical Hadoop job to leverage existing Hadoop infrastructure, such as Amazon’s EC2. Giraph builds upon the graph-oriented nature of Pregel but additionally adds fault-tolerance to the coordinator process with the use of ZooKeeper as its centralized coordination service.

This video shows a tutorial about Apache Giraph!

Big companies based on “data as source of value” like Google, Yahoo, Facebook and Linkedin require very large-scale data processing in big data centers. For this purpose, there a set of stacked technologies and tools which compose a complete ecosystem for enabling efficient data processing. The following tutorial shows what are the internals of these companies providing the overview of the different layers used in their ecosystem.

See the Google, Yahoo, Facebook, Linkedin and Cloudera ecosystems in the following link.

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

See Apache Hive Cloudera Tutorial in order to a really understanding tutorial.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

At the present time, Pig’s infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig’s language layer currently consists of a textual language called Pig Latin, which has the following key properties:

  • Ease of programming. It is trivial to achieve parallel execution of simple, “embarrassingly parallel” data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
  • Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
  • Extensibility. Users can create their own functions to do special-purpose processing.

See the following tutorial from Cloudera in order to discover more about its features: Apache Pig Cloudera Tutorial