Hadoop is made only for linux(:]). On windows it can be run through cygwin. You need to install Cygwin with ssh and sshd included. Installation of cygwin is weired you need to select open ssh specifically while installing cygwin. If you have cygwin installed you can check whether it contains open ssh or not by typing command ssh.
Then you will need jdk from sun. Put java folder in your c drive(or any other drive where you wish). I am specific about it because if the path to jdk folder has any intermediate folder whose name has space in it, then you need to use double quote or \[space] for that folder name.
Get an stable release of Hadoop and inside the conf/hadoop-env.sh uncomment tha java path as
export JAVA_HOME=path to jdk example if every thing is default is
export JAVA_HOME=/cygdrive/c/Program\ Files/Java/jdk[version]
The conf/hadoop-site.xml file is basically a properties file that lets you configure all sorts of HDFS and MapReduce parameters on a per-machine basis. You can just copy the XML below into your conf/hadoop-site.xml file.
After these change you need to run dos2unix command for file hadoop-env.sh otherwise you get some error which says it doesn’t recognise the sequence ‘\r’ or something of that sort. It is because the file has been changed to windows mode. If at any time you get error of this kind just do
Now you need to configure your ssh in cygwin
Hadoop uses SSH to allow the master computer(s) in a cluster to start and stop processes on the slave computers. So you will need to generate a public-private key pair for your user on each cluster machine and exchange each machine’s public key with other machine in cluster.
To generate a key pair, open Cygwin and issue the following commands ($> is the command prompt):
$> ssh-keygen -t dsa -P ” -f ~/.ssh/id_dsa
now to add the public key to each other machine you need to do ssh that machine and do ok when asked for adding the public key permanently other way is do
$> cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Now, you should be able to SSH into your local machine using the following command:
$> ssh localhost
To quit the SSH session and go back to your regular terminal, use:
Now that you have public and private key pairs on each machine in your cluster, you need to share your public keys around to permit passwordless login from one machine to the other. Once a machine has a public key, it can safely authenticate a request from a remote machine that is encrypted using the private key that matches that public key.
If you have problem in connecting to localhost via ssh localhost command reason may be because of firewall is stopping the port 22 to be used by ssh.
Configure your hosts file: Not strictly required if your systems which run hadoop are static.
Open your Windows hosts file located at c:\windows\system32\drivers\etc\hosts and add the following lines
master IP address
slave IP address
It already has a line with which assigns IP address to localhost which is 127.0.0.1
You are done! To check it, just create a java project in eclipse(eclipse because you need to export it as jar file with all classes and dependencies). There is a program named WordCount.java inside hadoop release use this as your example project. Export the jar file of hadoop in eclipse project
In Eclipse, right-click on your project, go to Build Paths then Add External Archives. Browse to the hadoop folder and select the file hadoop-version-core.jar.
Now import the project as jar from eclipse (let us call the project jar WordCount.jar).
To start your cluster, make sure you’re in cygwin on the master and have changed to your hadoop installation directory. To fully start your cluster, you’ll need to start DFS first and then MapReduce.
Issue the following command:
Running the job
Launch cygwin go to directory hadoop[version]
Put the WordCount.jar in your C drive now create a folder named input also create a long text file inside input folder name it input.txt.
Copy input files into HDFS
Make a directory in the Hadoop Distributed File System (dfs) for your input files.
To copy input data files into dfs from your home directory, do the following:
bin/hadoop dfs -copyFromLocal ../input .
.=current folder you may use some other location.
Finally, to run the job execute the following command
./bin/hadoop jar ../WordCount.jar WordCount ../input ../output
If you have not started hadoop or firewall is preventing the port to be used by hadoop you get error like already tried num time to connect or not able to connect, then do
if problem persists it means port is not open.
This will place the result files in a directory called “output” in the dfs. You can then copy these files back to your CAC home directory by executing the following(same as what you have done for input file):
bin/hadoop dfs -copyToLocal output output
Note, that one output file is produced for each reduce job you run. The WordCount example uses the system-configured limit of the number of reduce jobs, so do not be surprised to see 10-20 output files (the exact number depends on the number of cluster nodes running and their configuration). You can control this limit programatically via the setNumReduceTasks() method of the JobConf class in the hadoop API. Refer to the map reduce tutorial for more details on running map reduce jobs.
When you are finished with the output files, you should delete the output directory. Hadoop will not automatically do this for you, and it will throw an error if you run it while there is an old output directory. To do this, execute:
/usr/local/hadoop/bin/hadoop dfs -rmr output
Retrieve the results
The results have been written to a new folder called output on your c drive. There should be one file, named part-00000 which lists all the words on this webpage, along with their occurrence count. Note, that before running hadoop again you will need to delete the entire output folder, since hadoop will not do this for you.
To stop MapReduce, issue the following command on the master:
To stop DFS, issue the following command on the master:
To stop all services related to hadoop do
Problems I faced
First is related to files changed in dos format so need to use dos2unix command. You get error like command not found ‘r’
Second problem is can’t connect to localhost:port_num already tried (num) time.
After running hadoop successfully first time I again get this problem next day. It again consumes a lot of time because ssh localhost is working fine. I have also tried the exact IP address in conf/hadoop-site.xml so loopback is prevented but it doesn’t solve the problem. I don’t know how I get away from that problem exactly because I was just playing around when it starts working again. In between I have reformatted the dfs.
$ bin/hadoop namenode -format