All posts for the month April, 2012

Apache Stanbol

Apache Stanbol components are meant to be accessed over RESTful interfaces to provide semantic services for content management. Thus, one application is to extend traditional content management systems with (internal or external) semantic services. Additionally, Apache Stanbol let’s you create new types of content management systems with semantics at their core. The current code is written in Java and based on the OSGi component framework.

Apache Stanbol’s main features are:

  • Content Enhancement
    Services that add semantic information to “non-semantic” pieces of content.
  • Reasoning
    Services that are able to retrieve additional semantic information about the content based on the semantic information retrieved via content enhancement.
  • Knowledge Models
    Services that are used to define and manipulate the data models (e.g. ontologies) that are used to store the semantic information.
  • Persistence
    Services that store (or cache) semantic information, i.e. enhanced content, entities, facts, and make it searchable.

Apache Stanbol features provide the basics to create content management systems with semantically advanced user interfaces. Those user interfaces benefit from the semantic information that can be handled by Apache Stanbol. See the documentation and usage scenarios pages for more details.

Online demos of the basic/stable features of Apache Stanbol are available here and here. An experimental/full version of Apache Stanbol is available here.

Apache Hama

Apache Hama is a pure BSP (Bulk Synchronous Parallel) computing framework on top of HDFS (Hadoop Distributed File System) for massive scientific computations such as matrix, graph and network algorithms. Currently, it has the following features:

  • Job submission and management interface.
  • Multiple tasks per node.
  • Input/Output Formatter.
  • Checkpoint recovery.
  • Support to run in the Cloud using Apache Whirr.
  • Support to run with Hadoop YARN.

The following slides shows a basic introduction to Hama 0.4.


CloudStack is open source software written in java that is designed to deploy and manage large networks of virtual machines, as a highly available, scalable cloud computing platform. CloudStack current supports the most popular hypervisors VMware, Oracle VM, KVM, XenServer and Xen Cloud Platform. CloudStack offers three ways to manage cloud computing environments: a easy-to-use web interface, command line and a full-featured RESTful API.

Key Features

This is a summary of CloudStack’s features:

  • One Cloud, Multiple Hypervisors – With CloudStack, a single cloud deployment can run multiple hypervisor implementations of multiple types. Based on a pluggable architecture, CloudStack software works with a variety of hypervisors including Oracle VM, KVM, vSphere and Citrix XenServer to give customers complete freedom to choose the right hypervisor for their workload.
  • Massively scalable infrastructure management – CloudStack lets you manage tens of thousands of servers across geographically distributed datacenters through a linearly scalable, centralized management server that eliminates the need for intermediate cluster-level management servers. No single component failure can cause cluster or cloud-wide outage, enabling downtime-free management server maintenance and reducing the workload of managing a large-scale cloud deployment.
  • Easy-To-Use Web Interface – CloudStack makes it simple to manage your cloud infrastructure with a feature-rich user interface implemented on top of the CloudStack API. Fully AJAX-based and compatible with most popular web browsers, the solution can be easily integrated with your existing portal for seamless administration. A real-time view of the aggregated storage, IP pools, CPU, memory and other resources in use gives you better visibility and control over your cloud.
  • Robust RESTful API – CloudStack implements industry-standard APIs on top of a low-level CloudStack API with its own unique and innovative features. Although the CloudStack API is documented, maintained and supported, CloudStack does not assert it as your only option—work is underway to create API adapters that implement Amazon EC2/S3 API and the vCloud API on top of the CloudStack API. Future cloud API standards from bodies such as DMTF will be implemented as they become available.
The following video shows a very graphical introduction to the usage of cloudstack.

Apache Kafka

Apache Kafka is a distributed publish-subscribe messaging system. It is designed to support:

  • Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages.
  • High-throughput: even with very modest hardware Kafka can support hundreds of thousands of messages per second.
  • Explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.
  • Support for parallel data load into Hadoop.

Kafkacan handle all activity stream data and processing on a consumer-scale web site. This kind of activity (page views, searches, and other user actions) are a key ingredient in many of the social feature on the modern web. This data is typically handled by “logging” and ad hoc log aggregation solutions due to the throughput requirements. This kind of ad hoc solution is a viable solution to providing logging data to an offline analysis system like Hadoop, but is very limiting for building real-time processing. Kafka aims to unify offline and online processing by providing a mechanism for parallel load into Hadoop as well as the ability to partition real-time consumption over a cluster of machines.

The use for activity stream processing makes Kafka comparable to Facebook’s Scribe or Apache Flume (incubating), though the architecture and primitives are very different for these systems and make Kafka more comparable to a traditional messaging system.

In case you want to see more about the architectural design of Kafka, just see Kafka’s design page.

In the past, semantic web technologies do not really fit on web pages due to the scarce methods of integration between ontologies and html pages. HTML microdata mechanism  allows machine-readable data to be embedded in HTML documents in an easy-to-write manner, with an unambiguous parsing model. It is compatible with numerous other data formats including RDF and JSON. It can attach concept of a given taxonomy/ontology to words, paragraphs or complete HTML pages in a machine-readable format allowing for further inference.

In case you can to read the current draft, just see HTML 5 microdata W3C specification.