Friday, June 15, 2012


In this post we'll see how to build flume-ng from trunk and use it for data aggregation.

Prerequisites :

In order to to do a hassle free build we should have following two things pre-installed on our box :
1- Thrift
2- Apache Maven-3.x.x

Build the project :

Once we are done with this we have to build flume-ng from the trunk. Use following commands to do this :

$ svn co flume

This will create a directory flume inside our /home/username/ directory. Now go inside this directory and start the build process.

$ cd flume
$ mvn3 install -DskipTests

NOTE : If everything was fine then you will receive a BUILD SUCCESS message after this. But sometimes you may get an error somewhat like this :

[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1:18.308s
[INFO] Finished at: Wed Jun 20 15:56:03 IST 2012
[INFO] Final Memory: 64M/945M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.3.2:compile (default-compile) on project flume-thrift-source: Compilation failure
[ERROR] Failure executing javac, but could not parse the error:
[ERROR] The system is out of resources.
[ERROR] Consult the following stack trace for details.
[ERROR] java.lang.OutOfMemoryError: PermGen space
[ERROR] at java.lang.ClassLoader.defineClass1(Native Method)
[ERROR] at java.lang.ClassLoader.defineClassCond(
[ERROR] at java.lang.ClassLoader.defineClass(
[ERROR] at
[ERROR] at
[ERROR] at$000(
[ERROR] at$
[ERROR] at Method)
[ERROR] at
[ERROR] at org.codehaus.plexus.compiler.javac.IsolatedClassLoader.loadClass(
[ERROR] at$13.process(
[ERROR] at
[ERROR] at
[ERROR] at
[ERROR] at
[ERROR] at
[ERROR] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[ERROR] at sun.reflect.NativeMethodAccessorImpl.invoke(
[ERROR] at sun.reflect.DelegatingMethodAccessorImpl.invoke(
[ERROR] at java.lang.reflect.Method.invoke(
[ERROR] at org.codehaus.plexus.compiler.javac.JavacCompiler.compileInProcess(
[ERROR] at org.codehaus.plexus.compiler.javac.JavacCompiler.compile(
[ERROR] at org.apache.maven.plugin.AbstractCompilerMojo.execute(
[ERROR] at org.apache.maven.plugin.CompilerMojo.execute(
[ERROR] at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(
[ERROR] at org.apache.maven.lifecycle.internal.MojoExecutor.execute(
[ERROR] at org.apache.maven.lifecycle.internal.MojoExecutor.execute(
[ERROR] at org.apache.maven.lifecycle.internal.MojoExecutor.execute(
[ERROR] at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(
[ERROR] at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(
[ERROR] at org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(
[ERROR] at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(
[ERROR] -> [Help 1]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :flume-thrift-source

This error means that the JVM running Maven has run out of memory. In order to fix this we just need to assign more memory to the JVM. Use following command to do that :

export MAVEN_OPTS='-Xmx512m -XX:MaxPermSize=128m'

This will create a directory with name as target inside the flume directory. Look for the following two packages inside the target directory :

1- flume-ng-dist-1.2.0-incubating-SNAPSHOT-dist.tar.gz - A binary distribution of 
    flume- ng which is ready to run.
2- flume-ng-dist-1.2.0-incubating-SNAPSHOT-src.tar.gz - A source-only 
     distribution of flume-ng.

Since we just want to use flume-ng, the thing of our interest is the binary distribution. So we'll extract it and copy at some convenient location.

Configuration and usage :

In order to collect data thriugh flume-ng we need to configure and start the agents. The configuration of agents is written in configuration files that are kept inside the /conf directory.

We'll take an example use-case where we'll aggregate data from the Apache web server logs into a directory inside the Hdfs.

To do that, first we have to write the configuration file for our agent. Here is a simple configuration file :

agent1.sources = tail
agent1.channels = MemoryChannel-2
agent1.sinks = HDFS
agent1.sources.tail.type = exec
agent1.sources.tail.command = tail -F /var/log/apache2/access.log.1
agent1.sources.tail.channels = MemoryChannel-2 = MemoryChannel-2
agent1.sinks.HDFS.type = hdfs
agent1.sinks.HDFS.hdfs.path = hdfs://localhost:9000/flume
agent1.sinks.HDFS.hdfs.file.Type = DataStream
agent1.channels.MemoryChannel-2.type = memory

Save this file as agent1.conf inside the /conf directory. Now use the below specified command to start this agent that will tail data from the location /var/log/apache2/access.log.1 dump it into the /flume directory inside your Hdfs :

$ bin/flume-ng agent -n agent1 -c /conf -f conf/agent1.conf

This command will start an agent named agent1 which will start tailing data from the Apache web server logs and putting into the /flume directory. To verify the result point your web browser to http://localhost:50070. There you can see the /flume directory and inside this directory you will find files containing the log data. These files will have names somewhat similar to this, FlumeData.1339411663333.

If you want to go further then you can visit my another blog that shows how we can use flume-ng to move data into an Hbase table at this link :

I hope through this post you were able to build and use flume-ng successfully. Please provide me with your valuable comments and suggestions if you have any.

How to work with Avro data using Apache Spark(Spark SQL API)

We all know how cool Spark is when it comes to fast, general-purpose cluster computing. Apart from the core APIs Spark also provides a rich ...