Chủ Nhật, 12 tháng 8, 2012

Map/Reduce Input and Output



The Map/Reduce framework operates exclusively on pairs, that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output of the job, conceivably of different types.

The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.

Input and Output types of a Map/Reduce job:

(input) -> map -> -> combine -> -> reduce -> (output)

Thứ Ba, 17 tháng 7, 2012

Spring Bean Scope Singleton by Default

While studying difference between GOF Singleton pattern and
SpringFramework Singleton scope for the Bean/POJO, I thought of
writing this blog.

As you might be knowing that default scope for the bean within
Spring container is singleton, and while using Spring's ApplicationContext
and BeanFactory we will be getting a single instance to work with..

Thus we may be using the bean instance from the BeanFactory of singleton
in nature, but it is quite different from the GOF Singleton Pattern.
 
As GOF Singleton Pattern will be looking at an instance that is single
or unique per class and per classloader level.

But SpringFramework Bean singleton scope is of single in nature
but within SpringFramework container boundary/level.

So for a particular Spring ApplicationContext and BeanFactory, there
could be a single bean instance for a particular bean with a unique id.

So let us explore more on this with the help of following example,
where I will be creating a single application context and BeanFactory using
ClassPathXmlApplicationContext api from SpringFramework by passing this
example test.xml configuration file as argument parameter.

Then I am retrieving a Bean instance form the BeanFactory using a bean id
and then setting value for the instance variable of this Bean instance.

On retrieving another Bean instance for the second time for the same bean
id as that of bean id used in the first case.

Just to imagine in case Spring container manages singleton scope for this
Bean, then I should expect the value set in the first case,
to be retained in the second case as well.

import org.springframework.beans.factory.*;
import org.springframework.context.*;
import org.springframework.context.support.*;

public class Testclient{
  public Testclient() {
    try{
        ApplicationContext appContext =
            new ClassPathXmlApplicationContext("test.xml");
        BeanFactory factory = appContext;
        //Retrieving bean from the bean factory.
        //Bean id as "one"
        One one1 = (One)factory.getBean("one");
        one1.setName("test name");

        //Bean id as "one"
        One one2 = (One)factory.getBean("one");
        System.out.println("Name from the singleton bean :"+one2.getName());
    } catch(Exception ex) {
        ex.printStackTrace();
    }
  }
  public static void main(String args[]) {
      new Testclient();
  }
}
Above bold red color text shows the steps/code used to retrieve Bean instance from Spring container and then some random name value is set to this instance. While the bold blue colored text above, shows the way another bean instance is retrieved from Spring container using the same bean id. The value from the second bean instance will be able to return the value that is already set in the earlier step. This shows that both these instances are the same, thus can be treated as singleton from Spring's singleton scope. This example is then slightly modified to include another instance of SpringFramework's ApplicationContext and BeanFactory instances again, and then the same ways some random value for name field is set in first case, and then the value is retrieved from the another returned Bean instance from Spring Container. But now there is a difference being observed, the returned value is null, not the original value that is set in first case.
import org.springframework.beans.factory.*;
import org.springframework.context.*;
import org.springframework.context.support.*;

public class Testclient{
  public Testclient() {
    try{
        ApplicationContext appContext =
            new ClassPathXmlApplicationContext("test.xml");
        BeanFactory factory = appContext;
        //Retrieving bean from the bean factory.
        One one1 = (One)factory.getBean("one");
        one1.setName("test name");

        ApplicationContext appContextNew =
            new ClassPathXmlApplicationContext("test.xml");
        BeanFactory factoryNew = appContextNew;
        One one2 = (One)factoryNew.getBean("one");
        System.out.println("Name from the singleton bean :"+one2.getName());
    } catch(Exception ex) {
        ex.printStackTrace();
    }
  }
  public static void main(String args[]) {
      new Testclient();
  }
}
The above modified code uses two different Spring Context and so we have two different bean instances, while one context taking the setter value and the other context returning null value. Code for the Bean in this example, is as follows:
public class One
{
  private String name;
  public void setName(String arg) {
   name = arg;
  }
  public String getName() {
   return name;
  }
}
and the Spring configuration file as follows:



  
  

 

Thứ Sáu, 13 tháng 7, 2012

Java - Serialization

Java provides a mechanism, called object serialization where an object can be represented as a sequence of bytes that includes the object's data as well as information about the object's type and the types of data stored in the object.
After a serialized object has been written into a file, it can be read from the file and deserialized that is, the type information and bytes that represent the object and its data can be used to recreate the object in memory.
Most impressive is that the entire process is JVM independent, meaning an object can be serialized on one platform and deserialized on an entirely different platform.
Classes ObjectInputStream and ObjectOutputStream are high-level streams that contain the methods for serializing and deserializing an object.
The ObjectOutputStream class contains many write methods for writing various data types, but one method in particular stands out:
public final void writeObject(Object x) throws IOException
The above method serializes an Object and sends it to the output stream. Similarly, the ObjectInputStream class contains the following method for deserializing an object:
public final Object readObject() throws IOException, 
                                 ClassNotFoundException
This method retrieves the next Object out of the stream and deserializes it. The return value is Object, so you will need to cast it to its appropriate data type.
To demonstrate how serialization works in Java, I am going to use the Employee class that we discussed early on in the book. Suppose that we have the following Employee class, which implements the Serializable interface:
public class Employee implements java.io.Serializable
{
   public String name;
   public String address;
   public int transient SSN;
   public int number;
   public void mailCheck()
   {
      System.out.println("Mailing a check to " + name
                           + " " + address);
   }
}
Notice that for a class to be serialized successfully, two conditions must be met:
  • The class must implement the java.io.Serializable interface.
  • All of the fields in the class must be serializable. If a field is not serializable, it must be marked transient.
If you are curious to know if a Java Satandard Class is serializable or not, check the documentation for the class. The test is simple: If the class implements java.io.Serializable, then it is serializable; otherwise, it's not.

Serializing an Object:

The ObjectOutputStream class is used to serialize an Object. The following SerializeDemo program instantiates an Employee object and serializes it to a file.
When the program is done executing, a file named employee.ser is created. The program does not generate any output, but study the code and try to determine what the program is doing.
Note: When serializing an object to a file, the standard convention in Java is to give the file a .ser extension.
import java.io.*;

public class SerializeDemo
{
   public static void main(String [] args)
   {
      Employee e = new Employee();
      e.name = "Reyan Ali";
      e.address = "Phokka Kuan, Ambehta Peer";
      e.SSN = 11122333;
      e.number = 101;
      try
      {
         FileOutputStream fileOut =
         new FileOutputStream("employee.ser");
         ObjectOutputStream out =
                            new ObjectOutputStream(fileOut);
         out.writeObject(e);
         out.close();
          fileOut.close();
      }catch(IOException i)
      {
          i.printStackTrace();
      }
   }
}

Deserializing an Object:

The following DeserializeDemo program deserializes the Employee object created in the SerializeDemo program. Study the program and try to determine its output:
import java.io.*;
   public class DeserializeDemo
   {
      public static void main(String [] args)
      {
         Employee e = null;
         try
         {
            FileInputStream fileIn =
                          new FileInputStream("employee.ser");
            ObjectInputStream in = new ObjectInputStream(fileIn);
            e = (Employee) in.readObject();
            in.close();
            fileIn.close();
        }catch(IOException i)
        {
            i.printStackTrace();
            return;
        }catch(ClassNotFoundException c)
        {
            System.out.println(.Employee class not found.);
            c.printStackTrace();
            return;
        }
        System.out.println("Deserialized Employee...");
        System.out.println("Name: " + e.name);
        System.out.println("Address: " + e.address);
        System.out.println("SSN: " + e.SSN);
        System.out.println("Number: " + e.number);
    }
}
This would produce following result:
Deserialized Employee...
Name: Reyan Ali
Address:Phokka Kuan, Ambehta Peer
SSN: 0
Number:101
Here are following important points to be noted:
  • The try/catch block tries to catch a ClassNotFoundException, which is declared by the readObject() method. For a JVM to be able to deserialize an object, it must be able to find the bytecode for the class. If the JVM can't find a class during the deserialization of an object, it throws a ClassNotFoundException.
  • Notice that the return value of readObject() is cast to an Employee reference.
  • The value of the SSN field was 11122333 when the object was serialized, but because the field is transient, this value was not sent to the output stream. The SSN field of the deserialized Employee object is 0.

Chủ Nhật, 15 tháng 4, 2012

Tomcat can not run on localhost - caused by IPV6

address="0.0.0.0" maxHttpHeaderSize="8192"
               maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
               enableLookups="false" redirectPort="8443" acceptCount="100"
               connectionTimeout="20000" disableUploadTimeout="true" />

Thứ Ba, 27 tháng 3, 2012

MongoDB: MapReduce Functions for Grouping

SQL GROUP BY allows you to perform aggregate functions on data sets; To count all of the stores in each state, to average a series of related numbers, etc. MongoDB has some aggregate functions but they are fairly limited in scope. The MongoDB group function also suffers from the fact that it does not work on sharded configurations. So how do you perform grouped queries using MongoDB? By using MapReduce functions of course (you read the title right?)

Understanding MapReduce

Understanding MapReduce requires, or at least is made much easier by, understanding functional programming concepts. map and reduce (fold, inject) are functions that come from Lisp and have been inherited by a lot of languages (Scheme, Smalltalk, Ruby, Python).
map
A higher-order function which transforms a list by applying a function to each of its elements. Its return value is the transformed list. In MongoDB terms, the map is a function that is run for each Document in a collection and can return a value for that row to be included in the transformed list.
reduce
A higher-order function that iterates an arbitrary function over a data structure and builds up a return value. The reduce function takes the values returned by map and allows you to run a function to manipulate those values in some way.

Some Examples

Let’s start with some sample data:
db.factories.insert( { name: "Miller", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Lakefront", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Point", metro: { city: "Steven's Point", state: "WI" } } );
db.factories.insert( { name: "Pabst", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Blatz", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Coors", metro: { city: "Golden Springs", state: "CO" } } );
db.factories.find()
Lets say I want to count the number of factories in each of the cities (ignore the fact that I could have the same city in more than one state, I don’t in my data). For a count, I write a function that “emits” the group by key and a value that you can count. It can be any value, but for simplicity I’ll make it 1. emit() is a MongoDB server-side function that you use to identify a value in a row that should be added to the transformed list. If emit() is not called then the values for that row will be excluded from the results.
mapCity = function () {
    emit(this.metro.city, 1);
}
The next piece is the reduce() function. The reduce function will be passed a key and an array of values that were collected by the map() function. I know my map function returns a 1 for each row keyed by city. So the reduce function will be called with a key “Golden Springs” and a single-element array containing a 1. For “Milwaukee” it will be passed an 4-element array of 1s.
reduceCount = function (k, vals) {
    var sum = 0;
    for (var i in vals) {
        sum += vals[i];
    }
    return sum;
}
With those 2 functions I can call the mapReduce function to perform my Query.
res = db.factories.mapReduce(mapCity, reduceCount)
db[res.result].find()
This results in:

{ "_id" : "Golden Springs", "value" : 1 }
{ "_id" : "Milwaukee", "value" : 4 }
{ "_id" : "Steven's Point", "value" : 1 }
Counting is not the only thing I can do of course. Anything can be returned by the map function including complex JSON objects. In this example I combine the names of all of the Factories in a given City into a simple comma-separated list.
mapCity = function () {
    emit(this.metro.city, this.name);
}
reduceNames = function (k, vals) {
    return vals.join(",");
}
res = db.factories.mapReduce(mapCity, reduceNames)
db[res.result].find()
Give you:

{ "_id" : "Golden Springs", "value" : "Coors" }
{ "_id" : "Milwaukee", "value" : "Miller,Lakefront,Pabst,Blatz" }
{ "_id" : "Steven's Point", "value" : "Point" }

Chủ Nhật, 5 tháng 2, 2012

Cassandra Client

Starting the CLI

You can start the CLI using the bin/cassandra-cli script in your Cassandra installation (bin\cassandra-cli.bat on windows). If you are evaluating a local cassandra node then be sure that it has been correctly configured and successfully started before starting the CLI.

If successful you will see output similar to this:

Welcome to cassandra CLI.  Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit.

You must then specify a system to connect to:

connect localhost/9160;

Creating a Keyspace

We first create a keyspace to run our examples in.

create keyspace Twissandra;

Selecting the keyspace to user

We must then select our example keyspace as our new context before we can run any queries.

use Twissandra;

To Create A Column

We can then create a column to play with.

create column family User with comparator = UTF8Type;

For the later examples to work you must also update the schema using the following command. This will set the return type for the first and last name to make them human readable. It will also add and index for the age field so that you filter your gets using the Users name field.

update column family User with         column_metadata =         [         {column_name: first, validation_class: UTF8Type},         {column_name: last, validation_class: UTF8Type},         {column_name: age, validation_class: UTF8Type, index_type: KEYS}         ];

To Add Data

To add data we want to into our new column we must first specify our default key type otherwise we would have to specify it for each key using the format [utf8('keyname')] this is probably advisable if you have mixed key types but makes simple cases harder to read.

So we run the command below, which will last the length of you cli session. On quitting and restarting we must run it again.

assume User keys as utf8;

and then we add our data.

set User['jsmith']['first'] = 'John'; set User['jsmith']['last'] = 'Smith'; set User['jsmith']['age'] = '38';

If you get the error like this cannot parse 'John' as hex bytes, then it likely you either haven't set your default key type or you haven't updated your schema as in the create column example.

The set command uses API#insert

To Update Data

If we need to update a value we simply set it again.

set User['jsmith']['first'] = 'Jack';

To Get Data

Now let's read back the jsmith row to see what it contains:

get User['jsmith'];

The get command uses API#get_slice

To Query Data

get User where age = '12';

For help

help;

To Quit

quit;

To Execute Script

bin/cassandra-cli -host localhost -port 9160 -f script.txt

Install Cassandra on Windows

As I said before you can run from an operating system that Java has a runtime for. So the first and probably most obvious one for a Windows developer, is running Cassandra on Windows. To install Cassandra on windows just follow these steps:

  1. Extract Cassandra to a directory of your choice (I used c:\development\cassandra)
  2. Set the following environment variables
    1. JAVA_HOME (To the directory where you install the JRE, this should not be the bin directory)
    2. CASSANDRA_HOME (To the directory you extracted the files to in step 1)
  3. Modify your Cassandra config file as you like and don’t forget to update the directory locations from a UNIX like path to something on your windows directory (in my example the config file is located at c:\development\cassandra\conf\storage-conf.xml)
  4. Open cmd and run the cassandra.bat file (in my example the batch file is located at c:\development\cassandra\bin\cassandra.bat)
    1
    2
    cd c:\development\cassandra\bin\
    .\cassandra.bat
  5. You can verify that Cassandra is running, by trying to connect to the server. To do this open a new cmd and run the cassandra-cli.bat file from the bin directory.
    1
    2
    3
    cd c:\development\cassandra\bin\
    .\cassandra-cli.bat
    connect localhost/9160

This is easy to get running, but there is some manual process that you have to go through each time to get the server running. In the future when you want to start up the Cassandra database for development, just repeat Step 4.