The first Hbase sink was commited to the Flume 1.2.x trunk few days ago. In this post we'll see how we can use this sink to collect data from a file stored in the local filesystem and dump this data into an Hbase table. We should have Flume built from the trunk in order to achieve that. If you haven't built it yet and looking for some help, you can visit my other post that shows how to build and use Flume-NG at this link :
First of all we have to write the configuration file for our agent. This agent will collect data from the file and dump it into the Hbase table. A simple configuration file might look like this :
hbase-agent.sources = tail
hbase-agent.sinks = sink1
hbase-agent.channels = ch1
hbase-agent.sources.tail.type = exec
hbase-agent.sources.tail.command = tail -F /home/mohammad/demo.txt
hbase-agent.sources.tail.channels = ch1
hbase-agent.sinks.sink1.type = org.apache.flume.sink.hbase.HBaseSink
hbase-agent.sinks.sink1.channel = ch1
hbase-agent.sinks.sink1.table = demo
hbase-agent.sinks.sink1.columnFamily = cf
hbase-agent.sinks.sink1.serializer = org.apache.flume.sink.hbase.SimpleHbaseEventSerializer
hbase-agent.sinks.sink1.serializer.payloadColumn = col1
hbase-agent.sinks.sink1.serializer.keyType = timestamp
hbase-agent.sinks.sink1.serializer.rowPrefix = 1
hbase-agent.sinks.sink1.serializer.suffix = timestamp
Save this in a file called hbase-agent.conf inside the /conf directory of your Flume distribution. Now start your Hadoop and Hbase and create a table called demo with a column family called cf. Now open another terminal and change you directory to /conf inside your FlumeHome. Then start your agent using the below specified command :
$ bin/flume-ng agent -n hbase-agent -c conf/ -f conf/hbase-agent.conf
Now go back to your Hbase shell and scan the demo table. If everything was ok then you will see something like this :
hbase(main):004:0> scan 'demo'
11339770815331 column=cf:col1, timestamp=1339770818340, value=value1
11339770815332 column=cf:col1, timestamp=1339770818342, value=value6
2 row(s) in 0.0500 seconds
NOTE : I have taken a small text file called demo.txt here which has following few lines in it
We all know how cool Spark is when it comes to fast, general-purpose cluster computing. Apart from the core APIs Spark also provides a rich ...
Hive is a wonderful tool for those who like to perform batch operations to process their large amounts of data residing on a Hadoop cluster ...
SSH (Secure Shell) is a network protocol secure data communication, remote shell services or command execution and other secure network ser...