Sunday, May 3, 2015

Uploading Files or Stream Log Data from Windows to Hadoop Cluster(HDFS)

There are several ways you can upload data into Hadoop Cluster(HDFS). In this post I will focus on uploading files from Windows to HDFS using Flume. We will drop a file on windows folder and Flume will automatically pick up the files create them in HDFS.

You can follow same steps if you want to capture IIS logs into HDFS for processing.

Please read Flume user guide Flume User Guide for explanation of configuration settings if you are not familiar with them.

There are some of other ways you can upload files to HDFS :

-          HUE Web page
-          WebHDFS REST API
-          NFS mount on Linux box and then run HDFS dfs –put command.
-      FTP files to linux machine and then run HDFS dfs -put command
 

FLUME Architecture for this Presentation.





Step 1 : Download and Install CYGWIN :
              Here is a link to download Cygwin
              unzip the downloaded file into c:\cygwin64 location.

Step 2:  Download JRE :
            Download Java 7 update 80  and copy download file "jre-7u80-windows-x64.tar.gz" to c:\cygwin64\home\vineet.kumar\java folder.
       
            Double click on cygwin Icon on your desktop

            $cd /home/vineet.kumar/java
            $gunzip jre-7u80-windows-x64.tar.gz
            $tar -xvf  jre-7u80-windows-x64.tar


Step 3: Download and Install Flume from Flume Download Link
                Copy downloaded file "apache-flume-1.5.2-bin.tar.gz" file to c:\cygwin\home\vineet.kumar\ folder
                       
                $ unzip apache-flume-1.5.2-bin.tar.gz
                $ tar -xvf  apache-flume-1.5.2-bin.tar

Verify the directory Structure. 

vineet.kumar@WindowsPC ~/apache-flume-1.5.2-bin
$ ls -ltr
total 197
-rw-r--r--  1 vineet.kumar Domain Users  1779 Nov 12 14:41 README
-rw-r--r--  1 vineet.kumar Domain Users 22517 Nov 12 14:41 LICENSE
-rw-r--r--  1 vineet.kumar Domain Users  6172 Nov 12 14:41 DEVNOTES
-rw-r--r--  1 vineet.kumar Domain Users  1586 Nov 12 15:02 RELEASE-NOTES
-rw-r--r--  1 vineet.kumar Domain Users 62228 Nov 12 15:02 CHANGELOG
-rw-r--r--  1 vineet.kumar Domain Users   249 Nov 12 15:13 NOTICE
drwxr-xr-x+ 1 vineet.kumar Domain Users     0 Nov 12 15:49 docs
drwxr-xr-x+ 1 vineet.kumar Domain Users     0 Apr 27 12:13 tools
drwxr-xr-x+ 1 vineet.kumar Domain Users     0 Apr 27 12:13 lib
drwxr-xr-x+ 1 vineet.kumar Domain Users     0 Apr 27 12:32 bin
drwxr-xr-x+ 1 vineet.kumar Domain Users     0 Apr 27 13:53 logs
drwxr-xr-x+ 1 vineet.kumar Domain Users     0 May  1 15:56 conf

Now, Create a folder on Windows machine for input files.

$ mkdir logs_input       

Windows Explorer location:C:\cygwin64\home\vineet.kumar\apache-flume-1.5.2-bin\logs_input\


Open .profile or .bash_profile and add following environment variables

    JAVA_HOME = /home/vineet.kumar/jre1.7.0_80; export JAVA_HOME
    FLUME_HOME=/home/vineet.kumar/apache-flume-1.5.2-bin; export FLUME_HOME
    PATH=$JAVA_HOME/bin:$JAVA_HOME/lib:$PATH; export $PATH

Go to FLUME_HOME/bin folder or /home/vineet.kumar/apache-flume-1.5.2-bin/bin folder and open flume-ng file and make changes to following line.

From :

 $EXEC $JAVA_HOME/bin/java $JAVA_OPTS $FLUME_JAVA_OPTS "${arr_java_props[@]}" -cp "$FLUME_CLASSPATH" \
      -Djava.library.path=$FLUME_JAVA_LIBRARY_PATH "$FLUME_APPLICATION_CLASS" $*


To:

  $EXEC $JAVA_HOME/bin/java $JAVA_OPTS $FLUME_JAVA_OPTS "${arr_java_props[@]}" -cp "`cygpath -wp $FLUME_CLASSPATH`" \
      -Djava.library.path=$FLUME_JAVA_LIBRARY_PATH "$FLUME_APPLICATION_CLASS" $*



 Step 4: Configure Agent configuration file on Windows PC:

 Directory monitoring use spooldir source type and for this source Avro will be sink. This Avro sink on windows machine will be pointing to Avro Source at Linux Server. Here is the configuration file contents on Windows machine. 

Configuration file name: hdfs_client.conf and I placed it under FLUME_HOME/conf directory or C:\cygwin64\home\vineet.kumar\apache-flume-1.5.2-bin\conf Folder. 
Agent name on windows: a1


a1.channels = c1
a1.sources = r1
a1.sinks = k1

a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = logs_input

a1.sources.r1.fileHeader = true
a1.sources.r1.fileHeaderKey = file

a1.sources.r1.basenameHeader = true
a1.sources.r1.basenameHeaderKey = basename

a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type=timestamp



# Describe the sink
a1.sinks.k1.type = avro

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname=192.168.159.128
a1.sinks.k1.port=34343


Interceptor(i1) will be use to create folder on HDFS based on year and month.


Step 5: Configure Agent on Linux Server(Node on HDFS cluster):

Configuration file name on Linux server :collector.conf, location: /etc/flume-ng/conf 
Agent Name: Collector

collector.sources=av1
collector.sources.av1.type=avro
collector.sources.av1.bind=0.0.0.0
collector.sources.av1.port=34343
collector.sources.av1.channels=ch1
collector.channels=ch1
collector.channels.ch1.type=memory
collector.channels.ch1.capacity=10000
collector.channels.ch1.transactionCapacity=1000
collector.sinks=k1
collector.sinks.k1.type=hdfs
collector.sinks.k1.channel=ch1
collector.sinks.k1.hdfs.path=/user/cloudera/in/%Y/%m
collector.sinks.k1.hdfs.fileType = DataStream
collector.sinks.k1.hdfs.writeFormat = Text
collector.sinks.k1.hdfs.rollSize = 104217728
collector.sinks.k1.hdfs.rollCount = 0
collector.sinks.k1.hdfs.rollInterval = 900
collector.sinks.k1.hdfs.batchSize = 1000

collector.sinks.k1.hdfs.filePrefix = %{basename}


Step 6: Start agent on Windows and Linux Server.

Agent on Windows :

  $FLUME_HOME/bin/flume-ng agent --conf conf --conf-file $FLUME_HOME/conf/hdfs_client.conf --name a1 -Dflume.root.logger=INFO,console

You will see the  following output after Agent Startup.

2015-05-03 11:18:53,580 (pool-4-thread-1) [INFO - org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(SpoolDirectorySource.java:254)] Spooling Directory Source runner has shutdown.
2015-05-03 11:18:54,082 (pool-4-thread-1) [INFO - org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(SpoolDirectorySource.java:254)] Spooling Directory Source runner has shutdown.
2



Agent on Linux

 flume-ng agent --conf conf --conf-file /etc/flume-ng/conf/collector.conf --name collector -Dflume.root.logger=INFO,console

You will see the following output after agent startup

15/05/03 08:23:04 INFO source.AvroSource: Avro source av1 started.
15/05/03 08:23:07 INFO ipc.NettyServer: [id: 0xa3c0311e, /192.168.159.1:61201 => /192.168.159.128:34343] OPEN
15/05/03 08:23:07 INFO ipc.NettyServer: [id: 0xa3c0311e, /192.168.159.1:61201 => /192.168.159.128:34343] BOUND: /192.168.159.128:34343
15/05/03 08:23:07 INFO ipc.NettyServer: [id: 0xa3c0311e, /192.168.159.1:61201 => /192.168.159.128:34343] CONNECTED: /192.168.159.1:61201


Time to place some files on Windows folder. I create small test.txt file and I place this on windows folder(logs_input)

 c:\>Type c:\test.txt
1,One
2,Two
3,Three
4,Four
5,Five
6,Six
7,Seven
8,Eight
9,Nine
10,Ten

 c:\>copy c:\test.txt c:\cygwin64\home\vineet.kumar\apache-flume-1.5.2-bin\logs_input

1 file copied

As soon as I place file on logs_input folder it's extension changed to .COMPLETED, which means file picked up by Source Agent.
 
c:\>dir c:\cygwin64\home\vineet.kumar\apache-flume-1.5.2-bin\logs_input

test.txt.COMPLETED

Now, Verify the file and it's contents on HDFS

$ hdfs dfs -ls /user/cloudera/in/2015/
 -rw-r--r--   1 cloudera cloudera         80 2015-05-03 08:29 /user/cloudera/in/2015/05/test.txt.1430666773025

[cloudera@quickstart conf]$ hdfs dfs -cat /user/cloudera/in/2015/05/test.txt.1430666773025
1,One
2,Two
3,Three
4,Four
5,Five
6,Six
7,Seven
8,Eight
9,Nine
10,Ten

I have tested this with 200 MB multiple files and the way I have setup my configuration files is that it creates new files after 100MB and after 15 min it closes the file(removes .tmp) extension.

I have skipped explanation of various configuration file settings. I would say read Flume user guide for explanation.