最近业余时间折腾一下Hadoop,记录一下这个过程,参考的是这篇教程http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/。
操作系统: Ubuntu 12.04 x64
软件依赖:
- jdk6;
hadoop官方推荐使用sun-jdk,而不是open-jdk。由于证书问题Ubuntu并不能直接用apt-get来安装sun-jdk,所以安装sun-jdk这一步就有点麻烦;
- ssh
- rsync
安装依赖:
$ sudo apt-get install ssh
$ sudo apt-get install rsync
安装sun-jdk-6:
- 去官网下载http://www.oracle.com/technetwork/java/javase/downloads/jdk6u37-downloads-1859587.html相应的版本,32位就x86,64位就x64,我下载的是jdk-6u37-linux-x64.bin;
- sudo mkdir /usr/java
- cd /usr/java
- 把下载好的jdk拷贝到/usr/java目录;
- 安装:
sudo chmod +x jdk-6u37-linux-x64.bin
安装完成后会在/usr/java目录下多出一个jdk1.6.0_37的目录;
- 配置环境变量,编辑/etc/bashrc,添加如下内容:
JAVA_HOME=/usr/java/jdk1.6.0_37
PATH=$PATH:$JAVA_HOME/bin
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$CLASSPATH
export JAVA_HOME PATH CLASSPATH
- 使环境变量生效:
source /etc/bashrc
- 由于ubuntu的默认jdk是open-jdk,所以需要做一下替换:
sudo update-alternatives –install /usr/bin/java java /usr/java/jdk1.6.0_37/bin/java 999
sudo update-alternatives –install /usr/bin/javac javac /usr/java/jdk1.6.0_37/bin/javac 999
sudo update-alternatives –install /usr/bin/javadoc javadoc /usr/java/jdk1.6.0_37/bin/javadoc 999
sudo update-alternatives –install /usr/bin/javac javac /usr/java/jdk1.6.0_37/bin/javac 999
sudo update-alternatives –config java (会让你选择,选择刚安装的版本就行了)
sudo update-alternatives –config javac (同上)
ls -lh /etc/alternatives/java* 检查一下
到此,就完成了Hadoop依赖的安装,下面安装Hadoop。
在系统中为Hadoop添加用户:
$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser
按提示贴写密码等信息就ok了。
$ su - hduser
切换到hduser用户继续下面的步骤
安装Hadoop:
配置SSH:
- 首先切换到hduser这个用户,并确hduser能ssh到localhost:
su hduser
ssh localhost
- 如果不能,查看你是否有~/./ssh目录和~/./ssh/id_rsa.pub文件,如果没有执行:
ssh-keygen -t rsa -P ” -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
- 如果有:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
- 确保能ssh localhost就ok了;
配置环境变量:
编辑/home/hduser/.bashrc,并在末尾添加:
# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop
# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"
# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
hadoop fs -cat $1 | lzop -dc | head -1000 | less
}
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin
添加完成后运行下面的命令使配置生效:
$source /home/hduser/.bashrc
配置hadoop目录下的conf/hadoop-env.sh:
export JAVA_HOME=/usr/java/jdk1.6.0_37
为hadoop添加临时目录:
$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp
$ sudo chmod 755 /app/hadoop/tmp
添加如下内容到hadoop目录下的conf/core-site.xml中,并放置于<configuration></configuration>标签之间:
<configuration>
<!-- In: conf/core-site.xml -->
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.
</description>
</property>
</configuration>
同上一步一样,修改conf/mapred-site.xml:
<configuration>
<!-- In: conf/mapred-site.xml -->
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
修改conf/hdfs-site.xml:
<configuration>
<!-- In: conf/hdfs-site.xml -->
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
执行例子:
格式化HDFS:
$ hadoop namenode -format
启动Hadoop:
$ start-all.sh #关闭Hadoop的命令是stop-all.sh
用jps命令显示正在运行的hadoop进程:
hduser@alex-sina:/usr/local/hadoop$ jps
15751 NameNode
16775 Jps
16046 DataNode
16669 TaskTracker
16399 JobTracker
16313 SecondaryNameNode
hadoop的web管理界面:
ok,安装的记录到此结束,随后还会记录一些学习hadoop的过程。