Hadoop2.7.3完全分布式(虛擬機(jī))

jdk1.8 + Hadoop2.7.3 + Spark2.2.0 + Scala2.11.8

hadoop 2.7之后的tar.gz包都是64位的

1 clone之前

1.1 安裝vmware,安裝centos7

網(wǎng)絡(luò)連接選host-only
centos7選基礎(chǔ)設(shè)施服務(wù)器(Infrastructure Server)

1.2 修改hostname,改網(wǎng)絡(luò)配置,克隆之后需要分別改

在宿主機(jī)用ifconfig(/sbin/ifconfig)查看vmnet1虛擬網(wǎng)卡(對應(yīng)于vmware的host-only模式)對應(yīng)的網(wǎng)關(guān)(inet addr),這里是192.168.176.1

打算安裝一臺master
192.168.176.100 master
兩臺slave
192.168.176.101 slave1
192.168.176.102 slave2

hostnamectl set-hostname master

systemctl stop firewalld
systemctl disable firewalld

vi /etc/sysconfig/network-scripts/ifcfg-ens33  // "ens33" it depends
TYPE=Ethernet
IPADDR=192.168.176.100
NETMASK=255.255.255.0
GATEWAY=192.168.176.1
PEERDNS=no

vi /etc/sysconfig/network

NETWORKING=yes
GATEWAY=192.168.176.1

vi /etc/resolv.conf
nameserver 192.168.1.1

service network restart

現(xiàn)在已經(jīng)可以ping通宿主機(jī),用sftp上傳安裝文件,用ssh操作master, slave1, slave2

1.3 改hosts

vi /etc/hosts
192.168.176.100 master
192.168.176.101 slave1
192.168.176.102 slave2

1.4 解壓jdk,hadoop,spark,scala...

cd /usr/local
tar -zxvf ... // it depends

修改profile

vim /etc/profile

JAVA_HOME=/usr/java/jdk1.8.0_144
JRE_HOME=$JAVA_HOME/jre
DERBY_HOME=$JAVA_HOME/db
PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
CLASSPATH=:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib
export JAVA_HOME JRE_HOME DERBY_HOME PATH CLASSPATH

export HADOOP_HOME=/usr/local/hadoop-2.7.3
export SCALA_HOME=/usr/local/scala-2.11.8
export SPARK_HOME=/usr/local/spark-2.2.0-bin-hadoop2.7
export HIVE_HOME=/usr/local/apache-hive-2.3.0-bin
export HBASE_HOME=/usr/local/hbase-2.0.0-alpha-1
export ZOOKEEPER_HOME=/usr/local/zookeeper-3.4.10

export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SCALA_HOME/bin:$SPARK_HOME/bin:$HIVE_HOME/bin:$HBASE_HOME/bin:$ZOOKEEPER_HOME/bin

1.5 配置hadoop

mkdir tmp hdfs hdfs/data hdfs/name

分別修改 hadoop-env.sh, core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml, slaves
默認(rèn)配置查看如下:

core-default.xml
hdfs-site.xml
mapred-default.xml
yarn-default.xml

cd /usr/local/hadoop-2.7.3/etc/hadoop

vi hadoop-env.sh
// 修改JAVA_HOME=/usr/java/jdk1.8.0_144


vi core-site.xml
<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://master:9000</value>
        </property>
        <property>
                <name>hadoop.tmp.dir</name>
                <value>/usr/local/hadoop-2.7.3/tmp</value>
        </property>
    <property>
        <name>io.file.buffer.size</name>
        <value>131072</value>
    </property>
</configuration>


vi hdfs-site.xml
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/usr/local/hadoop-2.7.3/hdfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/usr/local/hadoop-2.7.3/hdfs/data</value>
    </property>
</configuration>


vi yarn-site.xml
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>


// cp mapred-site.xml.templete mapred-site.xml
vi mapred-site.xml
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

vi slaves
slave1
slave2

1.6 建立非root用戶hadoop

useradd hadoop
passwd hadoop

// 給hadoop用戶開文件權(quán)限
chown -R hadoop:hadoop ./hadoop-2.7.3

1.7 從CST轉(zhuǎn)換為UTC:

cp -af /usr/share/zoneinfo/UTC /etc/localtime
date

1.8 spark配置

cd /usr/local/spark-2.2.0-bin-hadoop2.7/conf
vi slaves
slave1
slave2

vi spark-env.sh
# spark setting
export JAVA_HOME=/usr/java/jdk1.8.0_144
export SCALA_HOME=/usr/local/scala-2.11.8
export SPARK_MASTER_IP=master
export SPARK_WORKER_MEMORY=8g
export SPAKR_WORKER_CORES=4
export HADOOP_CONF_DIR=/usr/local/hadoop-2.7.3/etc/hadoop

2 clone之后

得到slave1, slave2,修改slave1和slave2的hostname和網(wǎng)絡(luò)配置

2.1 root(或者h(yuǎn)adoop)用戶ssh免密碼互聯(lián)master, slave1, slave2

// cp ~
ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys

在root(hadoop)下分別把id_rsa.pub分別復(fù)制到其他兩臺機(jī)器的authorized_keys中,用ssh命令互相連接測試 ssh slave1, ssh slave2, ssh master

3 其他

3.1 宿主機(jī)為linux、windows分別實(shí)現(xiàn)VMware三種方式上網(wǎng)

http://linuxme.blog.51cto.com/1850814/389691

3.2 常用命令

jps
start-dfs.sh
start-yarn.sh
start-all.sh
hadoop-daemon.sh start namenode
hadoop-daemon.sh start datanode

netstat -ntlp
hadoop dfsadmin -report | more

hadoop

// web ui

http://192.168.176.100:50070

3.3 Hadoop FileSystem

http://hadoop.apache.org/docs/r1.0.4/cn/hdfs_shell.html

hadoop fs -cat URI [URI …]
hadoop fs -cp URI [URI …] <dest>
hadoop fs -copyFromLocal <localsrc> URI // 除了限定源路徑是一個本地文件外,和put命令相似。
hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst> // 除了限定目標(biāo)路徑是一個本地文件外,和get命令類似。
hadoop fs -du URI [URI …]
hadoop fs -dus <args>
hadoop fs -get <from> <to>
hadoop fs -put
hadoop fs -ls <args>
hadoop fs -lsr <args> // 遞歸版的ls
hadoop fs -mkdir <paths> // 只能一級級的建目錄
hadoop fs -mv URI [URI …] <dest> // 將文件從源路徑移動到目標(biāo)路徑
hadoop fs -rm
hadoop fs -rmr   // 遞歸版的rm


hadoop dfs, hadoop fs, hdfs dfs的區(qū)別
Hadoop fs:使用面最廣,可以操作任何文件系統(tǒng)。
hadoop dfs與hdfs dfs:只能操作HDFS文件系統(tǒng)相關(guān)(包括與Local FS間的操作),前者已經(jīng)Deprecated,一般使用后者。

4. Hive部署

在安裝Hive前,先安裝MySQL,以MySQL作為元數(shù)據(jù)庫,Hive默認(rèn)的元數(shù)據(jù)庫是內(nèi)嵌的Derby,但因其有單會話限制,所以選用MySQL。MySQL部署在hadoop-master節(jié)點(diǎn)上,Hive服務(wù)端也安裝在hive-master上

元數(shù)據(jù)(Metadata),又稱中介數(shù)據(jù)、中繼數(shù)據(jù),為描述數(shù)據(jù)的數(shù)據(jù)(data about data),主要是描述數(shù)據(jù)屬性(property)的信息,用來支持如指示存儲位置、歷史數(shù)據(jù)、資源查找、文件記錄等功能。元數(shù)據(jù)算是一種電子式目錄,為了達(dá)到編制目錄的目的,必須在描述并收藏?cái)?shù)據(jù)的內(nèi)容或特色,進(jìn)而達(dá)成協(xié)助數(shù)據(jù)檢索的目的。

4.1 Hive環(huán)境變量配置(見1.4)

4.2 Hive配置

http://www.itdecent.cn/p/978a77a1d6a2

將$HIVE_HOME/conf/下的兩個文件重命名:
           mv hive-default.xml.template hive-site.xml
           mv hive-env.sh.template hive-env.sh

vim hive-env.sh
配置其中的HADOOP_HOME,將HADOOP_HOME前面的#號去掉

vim hive-site.xml(到處參考)
hive.metastore.schema.verification // false
// 在hive目錄下創(chuàng)建tmp文件夾
${system:java.io.tmpdir} 改為tmp目錄
${system:user.name} 改為用戶名,這里是root
修改連接mysql的jdbc

啟動Hive 的 Metastore Server服務(wù)進(jìn)程
nohup是永久執(zhí)行,執(zhí)行結(jié)果會在當(dāng)前目錄生成一個nohup.out日志文件,可以查看執(zhí)行信息
&是指在后臺運(yùn)行

hive --service metastore &
// 推薦使用nohup啟動,不會隨著對話結(jié)束而停止
nohup hive --service metastore &

Hive第一次登錄需要初始化(*)

schematool -dbType mysql -initSchema

4.3 MySQL安裝方式,建議第一種方式

4.3.1 Linux-Generic

官網(wǎng)下載MySQL Community Server,操作系統(tǒng)選Linux-Generic

https://www.bilibili.com/video/av6147498/?from=search&seid=673467972510968006
http://blog.csdn.net/u013980127/article/details/52261400

自己用這種方式安裝的

  1. 安裝
    檢查庫文件是否存在,如有刪除。
    rpm -qa | grep mysql

官網(wǎng)下載Linux - Generic (glibc 2.12) (x86, 64-bit), Compressed TAR Archive

https://dev.mysql.com/downloads/mysql/

解壓即可

tar -xvf mysql-5.7.19-linux-glibc2.12-i686.tar.gz
  1. 檢查mysql組和用戶是否存在,如無創(chuàng)建mysql:mysql
cat /etc/group | grep mysql
cat /etc/passwd | grep mysql
groupadd mysql
useradd -r -g mysql mysql
  1. 修改資源使用配置文件
sudo vim /etc/security/limits.conf

mysql hard nofile 65535
mysql soft nofile 65535

  1. 初始化,啟動一個實(shí)例
vim /etc/my.cnf
[mysqld]
port=3306
socket=/tmp/mysql.sock
user=mysql
datadir=...
...

啟動實(shí)例,注意修改里面的內(nèi)容

cd to top dir of mysql
bin/mysql_install_db --user=mysql --basedir=/usr/local/mysql/ --datadir=/usr/local/mysql/data/
  1. 初始化root用戶的密碼為12345
    第一次啟動,使用初始化密碼
cat /root/.mysql_secret

啟動mysql實(shí)例,敲入/root/.mysql_secret中的密碼

mysql uroot -p    // 這里myql加入到了環(huán)境變量中

添加mysql環(huán)境變量

export PATH=$PATH:/usr/local/mysql/bin

進(jìn)去之后修改密碼

SET PASSWORD = PASSWORD('12345');
flush privileges;

下次啟動時使用修改后的密碼

mysql uroot -p   // 密碼12345
  1. 繼續(xù),添加遠(yuǎn)程訪問權(quán)限
use mysql;
update user set host = '%' where user = 'root';

重啟服務(wù)生效

/etc/init.d/mysqld restart
  1. 為master創(chuàng)建hive用戶,密碼為12345,用來鏈接hive
mysql>CREATE USER 'hive' IDENTIFIED BY '12345';
mysql>GRANT ALL PRIVILEGES ON *.* TO 'hive'@'master' WITH GRANT OPTION;
mysql>flush privileges;

啟動方式

mysql -h master -uhive -p
  1. 設(shè)置為開機(jī)自啟動
sudo chkconfig mysql on

4.3.2 Yum Repository

wget http://repo.mysql.com/mysql57-community-release-el7-11.noarch.rpm  
// 或者到https://dev.mysql.com/downloads/repo/yum/下載rpm

rpm -ivh mysql57-community-release-el7-11.noarch.rpm 

yum install mysql-server

4.4 spark sql 支持hive

按照官方doc的說法,只需要把$HIVE_HOME/conf下hive-site.xml, core-site.xml文件copy到$SPARK_HOME/conf下即可,同時用scp傳到slave機(jī)器上

Spark SQL also supports reading and writing data stored in Apache Hive. However, since Hive has a large number of dependencies, these dependencies are not included in the default Spark distribution. If Hive dependencies can be found on the classpath, Spark will load them automatically. Note that these Hive dependencies must also be present on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries (SerDes) in order to access data stored in Hive.

Configuration of Hive is done by placing your hive-site.xml, core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) file in conf/.

When working with Hive, one must instantiate SparkSession with Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions. Users who do not have an existing Hive deployment can still enable Hive support. When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started. Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse. You may need to grant write privilege to the user who starts the Spark application.

4.5 Hive 操作

Hive四種數(shù)據(jù)導(dǎo)入方式

http://blog.csdn.net/lifuxiangcaohui/article/details/40588929

分區(qū):在Hive中,表的每一個分區(qū)對應(yīng)表下的相應(yīng)目錄,所有分區(qū)的數(shù)據(jù)都是存儲在對應(yīng)的目錄中。比如wyp表有dt和city兩個分區(qū),則對應(yīng)dt=20131218,city=BJ對應(yīng)表的目錄為/user/hive/warehouse/dt=20131218/city=BJ,所有屬于這個分區(qū)的數(shù)據(jù)都存放在這個目錄中。

UDF(User-Defined-Function),用戶自定義函數(shù)對數(shù)據(jù)進(jìn)行處理。UDF函數(shù)可以直接應(yīng)用于select語句,對查詢結(jié)構(gòu)做格式化處理后,再輸出內(nèi)容。自定義UDF需要繼承org.apache.hadoop.hive.ql.UDF。需要實(shí)現(xiàn)evaluate函數(shù)。evaluate函數(shù)支持重載。

http://blog.csdn.net/dajuezhao/article/details/5753001

spark UDF org.apache.spark.sql.expressions.UserDefinedAggregateFunction

http://spark.apache.org/docs/latest/sql-programming-guide.html#untyped-user-defined-aggregate-functions

Hive 創(chuàng)建名為dual的測試表

create table dual (dummy string);
// 退出hive進(jìn)入bash
echo 'X' > dual.txt
// 進(jìn)入hive
load data local inpath '/home/hadoop/dual.txt' overwrite into table daul;

Hive 正則表達(dá)式

http://blog.csdn.net/bitcarmanlee/article/details/51106726

HIVE json格式數(shù)據(jù)的處理

http://www.cnblogs.com/casicyuan/p/4375080.html

5. 常用問題

hadoop多次格式化后,導(dǎo)致datanode啟動不了

http://blog.csdn.net/longzilong216/article/details/20648387

MapReduce任務(wù)運(yùn)行到running job卡住

http://blog.csdn.net/yang398835/article/details/52205487

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容