There are several ways you can upload data into Hadoop Cluster(HDFS).
In this post I will focus on uploading files from Windows to HDFS using
Flume. We will drop a file on windows folder and Flume will
automatically pick up the files create them in HDFS.
You can follow same steps if you want to capture IIS logs into HDFS for processing.
Step 1 : Download
and Install CYGWIN :
Here is a link to download Cygwin
unzip the downloaded file into c:\cygwin64 location.
Step 2: Download JRE :
Download Java 7 update 80 and copy download file "jre-7u80-windows-x64.tar.gz" to c:\cygwin64\home\vineet.kumar\java folder.
Double click on cygwin Icon on your desktop
$cd /home/vineet.kumar/java
$gunzip jre-7u80-windows-x64.tar.gz
$tar -xvf jre-7u80-windows-x64.tar
Step 3: Download and Install Flume from Flume Download Link
Copy downloaded file "apache-flume-1.5.2-bin.tar.gz" file to c:\cygwin\home\vineet.kumar\ folder
$ unzip apache-flume-1.5.2-bin.tar.gz
$ tar -xvf apache-flume-1.5.2-bin.tar
Verify the directory Structure.
vineet.kumar@WindowsPC ~/apache-flume-1.5.2-bin
$ ls -ltr
total 197
-rw-r--r-- 1 vineet.kumar Domain Users 1779 Nov 12 14:41 README
-rw-r--r-- 1 vineet.kumar Domain Users 22517 Nov 12 14:41 LICENSE
-rw-r--r-- 1 vineet.kumar Domain Users 6172 Nov 12 14:41 DEVNOTES
-rw-r--r-- 1 vineet.kumar Domain Users 1586 Nov 12 15:02 RELEASE-NOTES
-rw-r--r-- 1 vineet.kumar Domain Users 62228 Nov 12 15:02 CHANGELOG
-rw-r--r-- 1 vineet.kumar Domain Users 249 Nov 12 15:13 NOTICE
drwxr-xr-x+ 1 vineet.kumar Domain Users 0 Nov 12 15:49 docs
drwxr-xr-x+ 1 vineet.kumar Domain Users 0 Apr 27 12:13 tools
drwxr-xr-x+ 1 vineet.kumar Domain Users 0 Apr 27 12:13 lib
drwxr-xr-x+ 1 vineet.kumar Domain Users 0 Apr 27 12:32 bin
drwxr-xr-x+ 1 vineet.kumar Domain Users 0 Apr 27 13:53 logs
drwxr-xr-x+ 1 vineet.kumar Domain Users 0 May 1 15:56 conf
Now, Create a folder on Windows machine for input files.
$ mkdir logs_input
Windows Explorer location:C:\cygwin64\home\vineet.kumar\apache-flume-1.5.2-bin\logs_input\
Open .profile or .bash_profile and add following environment variables
JAVA_HOME = /home/vineet.kumar/jre1.7.0_80; export JAVA_HOME
FLUME_HOME=/home/vineet.kumar/apache-flume-1.5.2-bin; export FLUME_HOME
PATH=$JAVA_HOME/bin:$JAVA_HOME/lib:$PATH; export $PATH
Go to FLUME_HOME/bin folder or /home/vineet.kumar/apache-flume-1.5.2-bin/bin folder and open flume-ng file and make changes to following line.
From :
$EXEC $JAVA_HOME/bin/java $JAVA_OPTS $FLUME_JAVA_OPTS "${arr_java_props[@]}" -cp "$FLUME_CLASSPATH" \
-Djava.library.path=$FLUME_JAVA_LIBRARY_PATH "$FLUME_APPLICATION_CLASS" $*
To:
$EXEC $JAVA_HOME/bin/java $JAVA_OPTS $FLUME_JAVA_OPTS "${arr_java_props[@]}" -cp "`cygpath -wp $FLUME_CLASSPATH`" \
-Djava.library.path=$FLUME_JAVA_LIBRARY_PATH "$FLUME_APPLICATION_CLASS" $*
Step 6: Start agent on Windows and Linux Server.
Agent on Windows :
$FLUME_HOME/bin/flume-ng agent --conf conf --conf-file $FLUME_HOME/conf/hdfs_client.conf --name a1 -Dflume.root.logger=INFO,console
You will see the following output after Agent Startup.
2015-05-03 11:18:53,580 (pool-4-thread-1) [INFO - org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(SpoolDirectorySource.java:254)] Spooling Directory Source runner has shutdown.
2015-05-03 11:18:54,082 (pool-4-thread-1) [INFO - org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(SpoolDirectorySource.java:254)] Spooling Directory Source runner has shutdown.
2
Agent on Linux
flume-ng agent --conf conf --conf-file /etc/flume-ng/conf/collector.conf --name collector -Dflume.root.logger=INFO,console
You will see the following output after agent startup
15/05/03 08:23:04 INFO source.AvroSource: Avro source av1 started.
15/05/03 08:23:07 INFO ipc.NettyServer: [id: 0xa3c0311e, /192.168.159.1:61201 => /192.168.159.128:34343] OPEN
15/05/03 08:23:07 INFO ipc.NettyServer: [id: 0xa3c0311e, /192.168.159.1:61201 => /192.168.159.128:34343] BOUND: /192.168.159.128:34343
15/05/03 08:23:07 INFO ipc.NettyServer: [id: 0xa3c0311e, /192.168.159.1:61201 => /192.168.159.128:34343] CONNECTED: /192.168.159.1:61201
Time to place some files on Windows folder. I create small test.txt file and I place this on windows folder(logs_input)
c:\>Type c:\test.txt
1,One
2,Two
3,Three
4,Four
5,Five
6,Six
7,Seven
8,Eight
9,Nine
10,Ten
c:\>copy c:\test.txt c:\cygwin64\home\vineet.kumar\apache-flume-1.5.2-bin\logs_input
1 file copied
As soon as I place file on logs_input folder it's extension changed to .COMPLETED, which means file picked up by Source Agent.
c:\>dir c:\cygwin64\home\vineet.kumar\apache-flume-1.5.2-bin\logs_input
test.txt.COMPLETED
Now, Verify the file and it's contents on HDFS
$ hdfs dfs -ls /user/cloudera/in/2015/
-rw-r--r-- 1 cloudera cloudera 80 2015-05-03 08:29 /user/cloudera/in/2015/05/test.txt.1430666773025
[cloudera@quickstart conf]$ hdfs dfs -cat /user/cloudera/in/2015/05/test.txt.1430666773025
1,One
2,Two
3,Three
4,Four
5,Five
6,Six
7,Seven
8,Eight
9,Nine
10,Ten
I have tested this with 200 MB multiple files and the way I have setup my configuration files is that it creates new files after 100MB and after 15 min it closes the file(removes .tmp) extension.
I have skipped explanation of various configuration file settings. I would say read Flume user guide for explanation.
You can follow same steps if you want to capture IIS logs into HDFS for processing.
Please read Flume user guide Flume User Guide for explanation of configuration settings if you are not familiar with them.
There are some of other ways you can upload files to HDFS :
-
HUE Web page
-
WebHDFS REST API
-
NFS mount on Linux box and then run HDFS dfs –put
command.
- FTP files to linux machine and then run HDFS dfs -put command
FLUME Architecture for this Presentation.
Here is a link to download Cygwin
unzip the downloaded file into c:\cygwin64 location.
Step 2: Download JRE :
Download Java 7 update 80 and copy download file "jre-7u80-windows-x64.tar.gz" to c:\cygwin64\home\vineet.kumar\java folder.
Double click on cygwin Icon on your desktop
$cd /home/vineet.kumar/java
$gunzip jre-7u80-windows-x64.tar.gz
$tar -xvf jre-7u80-windows-x64.tar
Step 3: Download and Install Flume from Flume Download Link
Copy downloaded file "apache-flume-1.5.2-bin.tar.gz" file to c:\cygwin\home\vineet.kumar\ folder
$ unzip apache-flume-1.5.2-bin.tar.gz
$ tar -xvf apache-flume-1.5.2-bin.tar
Verify the directory Structure.
vineet.kumar@WindowsPC ~/apache-flume-1.5.2-bin
$ ls -ltr
total 197
-rw-r--r-- 1 vineet.kumar Domain Users 1779 Nov 12 14:41 README
-rw-r--r-- 1 vineet.kumar Domain Users 22517 Nov 12 14:41 LICENSE
-rw-r--r-- 1 vineet.kumar Domain Users 6172 Nov 12 14:41 DEVNOTES
-rw-r--r-- 1 vineet.kumar Domain Users 1586 Nov 12 15:02 RELEASE-NOTES
-rw-r--r-- 1 vineet.kumar Domain Users 62228 Nov 12 15:02 CHANGELOG
-rw-r--r-- 1 vineet.kumar Domain Users 249 Nov 12 15:13 NOTICE
drwxr-xr-x+ 1 vineet.kumar Domain Users 0 Nov 12 15:49 docs
drwxr-xr-x+ 1 vineet.kumar Domain Users 0 Apr 27 12:13 tools
drwxr-xr-x+ 1 vineet.kumar Domain Users 0 Apr 27 12:13 lib
drwxr-xr-x+ 1 vineet.kumar Domain Users 0 Apr 27 12:32 bin
drwxr-xr-x+ 1 vineet.kumar Domain Users 0 Apr 27 13:53 logs
drwxr-xr-x+ 1 vineet.kumar Domain Users 0 May 1 15:56 conf
Now, Create a folder on Windows machine for input files.
$ mkdir logs_input
Windows Explorer location:C:\cygwin64\home\vineet.kumar\apache-flume-1.5.2-bin\logs_input\
JAVA_HOME = /home/vineet.kumar/jre1.7.0_80; export JAVA_HOME
FLUME_HOME=/home/vineet.kumar/apache-flume-1.5.2-bin; export FLUME_HOME
PATH=$JAVA_HOME/bin:$JAVA_HOME/lib:$PATH; export $PATH
Go to FLUME_HOME/bin folder or /home/vineet.kumar/apache-flume-1.5.2-bin/bin folder and open flume-ng file and make changes to following line.
From :
$EXEC $JAVA_HOME/bin/java $JAVA_OPTS $FLUME_JAVA_OPTS "${arr_java_props[@]}" -cp "$FLUME_CLASSPATH" \
-Djava.library.path=$FLUME_JAVA_LIBRARY_PATH "$FLUME_APPLICATION_CLASS" $*
To:
$EXEC $JAVA_HOME/bin/java $JAVA_OPTS $FLUME_JAVA_OPTS "${arr_java_props[@]}" -cp "`cygpath -wp $FLUME_CLASSPATH`" \
-Djava.library.path=$FLUME_JAVA_LIBRARY_PATH "$FLUME_APPLICATION_CLASS" $*
Step 4: Configure
Agent configuration file on Windows PC:
Directory
monitoring use spooldir source type and for this source Avro will be
sink. This Avro sink on windows machine will be pointing to Avro Source
at Linux Server. Here is the configuration file contents on Windows
machine.
Configuration
file name: hdfs_client.conf and I placed it under FLUME_HOME/conf
directory or C:\cygwin64\home\vineet.kumar\apache-flume-1.5.2-bin\conf
Folder.
Agent name on windows: a1
a1.channels = c1
a1.sources = r1
a1.sinks = k1
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = logs_input
a1.sources.r1.fileHeader = true
a1.sources.r1.fileHeaderKey = file
a1.sources.r1.basenameHeader = true
a1.sources.r1.basenameHeaderKey = basename
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type=timestamp
# Describe the sink
a1.sinks.k1.type = avro
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname=192.168.159.128
a1.sinks.k1.port=34343
a1.sources = r1
a1.sinks = k1
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = logs_input
a1.sources.r1.fileHeader = true
a1.sources.r1.fileHeaderKey = file
a1.sources.r1.basenameHeader = true
a1.sources.r1.basenameHeaderKey = basename
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type=timestamp
# Describe the sink
a1.sinks.k1.type = avro
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname=192.168.159.128
a1.sinks.k1.port=34343
Interceptor(i1) will be use to create folder on HDFS based on year and month.
Step 5: Configure
Agent on Linux Server(Node on HDFS cluster):
Configuration file name on Linux server :collector.conf, location: /etc/flume-ng/conf
Agent Name: Collector
collector.sources=av1
collector.sources.av1.type=avro
collector.sources.av1.bind=0.0.0.0
collector.sources.av1.port=34343
collector.sources.av1.channels=ch1
collector.channels=ch1
collector.channels.ch1.type=memory
collector.channels.ch1.capacity=10000
collector.channels.ch1.transactionCapacity=1000
collector.sinks=k1
collector.sinks.k1.type=hdfs
collector.sinks.k1.channel=ch1
collector.sinks.k1.hdfs.path=/user/cloudera/in/%Y/%m
collector.sinks.k1.hdfs.fileType = DataStream
collector.sinks.k1.hdfs.writeFormat = Text
collector.sinks.k1.hdfs.rollSize = 104217728
collector.sources.av1.type=avro
collector.sources.av1.bind=0.0.0.0
collector.sources.av1.port=34343
collector.sources.av1.channels=ch1
collector.channels=ch1
collector.channels.ch1.type=memory
collector.channels.ch1.capacity=10000
collector.channels.ch1.transactionCapacity=1000
collector.sinks=k1
collector.sinks.k1.type=hdfs
collector.sinks.k1.channel=ch1
collector.sinks.k1.hdfs.path=/user/cloudera/in/%Y/%m
collector.sinks.k1.hdfs.fileType = DataStream
collector.sinks.k1.hdfs.writeFormat = Text
collector.sinks.k1.hdfs.rollSize = 104217728
collector.sinks.k1.hdfs.rollCount = 0
collector.sinks.k1.hdfs.rollInterval = 900
collector.sinks.k1.hdfs.batchSize = 1000
collector.sinks.k1.hdfs.filePrefix = %{basename}
collector.sinks.k1.hdfs.rollInterval = 900
collector.sinks.k1.hdfs.batchSize = 1000
collector.sinks.k1.hdfs.filePrefix = %{basename}
Agent on Windows :
$FLUME_HOME/bin/flume-ng agent --conf conf --conf-file $FLUME_HOME/conf/hdfs_client.conf --name a1 -Dflume.root.logger=INFO,console
You will see the following output after Agent Startup.
2015-05-03 11:18:53,580 (pool-4-thread-1) [INFO - org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(SpoolDirectorySource.java:254)] Spooling Directory Source runner has shutdown.
2015-05-03 11:18:54,082 (pool-4-thread-1) [INFO - org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(SpoolDirectorySource.java:254)] Spooling Directory Source runner has shutdown.
2
Agent on Linux
flume-ng agent --conf conf --conf-file /etc/flume-ng/conf/collector.conf --name collector -Dflume.root.logger=INFO,console
You will see the following output after agent startup
15/05/03 08:23:04 INFO source.AvroSource: Avro source av1 started.
15/05/03 08:23:07 INFO ipc.NettyServer: [id: 0xa3c0311e, /192.168.159.1:61201 => /192.168.159.128:34343] OPEN
15/05/03 08:23:07 INFO ipc.NettyServer: [id: 0xa3c0311e, /192.168.159.1:61201 => /192.168.159.128:34343] BOUND: /192.168.159.128:34343
15/05/03 08:23:07 INFO ipc.NettyServer: [id: 0xa3c0311e, /192.168.159.1:61201 => /192.168.159.128:34343] CONNECTED: /192.168.159.1:61201
Time to place some files on Windows folder. I create small test.txt file and I place this on windows folder(logs_input)
c:\>Type c:\test.txt
1,One
2,Two
3,Three
4,Four
5,Five
6,Six
7,Seven
8,Eight
9,Nine
10,Ten
c:\>copy c:\test.txt c:\cygwin64\home\vineet.kumar\apache-flume-1.5.2-bin\logs_input
1 file copied
As soon as I place file on logs_input folder it's extension changed to .COMPLETED, which means file picked up by Source Agent.
c:\>dir c:\cygwin64\home\vineet.kumar\apache-flume-1.5.2-bin\logs_input
test.txt.COMPLETED
Now, Verify the file and it's contents on HDFS
$ hdfs dfs -ls /user/cloudera/in/2015/
-rw-r--r-- 1 cloudera cloudera 80 2015-05-03 08:29 /user/cloudera/in/2015/05/test.txt.1430666773025
[cloudera@quickstart conf]$ hdfs dfs -cat /user/cloudera/in/2015/05/test.txt.1430666773025
1,One
2,Two
3,Three
4,Four
5,Five
6,Six
7,Seven
8,Eight
9,Nine
10,Ten
I have tested this with 200 MB multiple files and the way I have setup my configuration files is that it creates new files after 100MB and after 15 min it closes the file(removes .tmp) extension.
I have skipped explanation of various configuration file settings. I would say read Flume user guide for explanation.
