您的位置: 首页 > 文章 > 使用shark-shell写一个简单的wordcount

使用shark-shell写一个简单的wordcount

分类: 文章 • 2025-01-02 10:28:58

cm5.14中的spark版本是默认是1.6，安装2.2版本的需要手动安装

1.6和2.2版本可以共存

1. 安装准备：

csd包：http://archive.cloudera.com/spark2/csd/

SPARK2_ON_YARN-2.2.0.cloudera3.jar

使用shark-shell写一个简单的wordcount

parcel包：http://archive.cloudera.com/spark2/parcels/2.2.0.cloudera3/

注意：

这里cloudera3对应上边下载的jar包的3，版本要一致

el5对应centos5

el6对应centos6

el7对应centos7

parcel，parcel.sha1, manifest.json 三个都要下载

使用shark-shell写一个简单的wordcount

2. 开始安装

停掉cm server 和agent

使用shark-shell写一个简单的wordcount

使用shark-shell写一个简单的wordcount

上传parcel包到机器的/opt/cloudera/parcel-repo目录下（最好是cm server机器下的这个目录，如果放在agent机器下可能读取不到）

如果已经存在 manifest.json文件，需要把他备份掉，把刚下载的放进来

SPARK2-2.2.0.cloudera3-1.cdh5.13.3.p0.556753-el6.parcel.sha1 改名为

SPARK2-2.2.0.cloudera3-1.cdh5.13.3.p0.556753-el6.parcel.sha

使用shark-shell写一个简单的wordcount

启动cm集群

使用shark-shell写一个简单的wordcount

使用shark-shell写一个简单的wordcount

主机 -- Parcel

可以看到 spark2 就是刚刚下载的spark包

使用shark-shell写一个简单的wordcount

这里点击分配

使用shark-shell写一个简单的wordcount

到这里可能会卡在已**这步，可以重新回到上一步骤，点击**

使用shark-shell写一个简单的wordcount

使用shark-shell写一个简单的wordcount

**完成后回到主页点击添加服务

找到spark2 -- 继续 --选择主机后等待安装完成

使用shark-shell写一个简单的wordcount

3. 测试spark是否可以正常使用

用spark自带的计算圆周率的jar包测试（这里可能会报错）

jar包位置

/opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/ spark-examples_2.11-2.2.0.cloudera3.jar

注意：

spark2的目录为/opt/cloudera/parcels/SPARK2/lib/spark2

旧的spark1.6的目录为/opt/cloudera/parcels/CDH/lib/spark

不要进错目录，要不然使用的还是1.6版本的spark

因为安装时自动加入了环境变量，可以直接在任意目录直接启动 spark2-shell，spark2-submit

切换到hdfs用户

su - hdfs

提交任务使用yarn模式（注意路径不要写错替换自己的路径）

spark2-submit --master yarn --class org.apache.spark.examples.SparkPi --executor-memory 1G /opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/ spark-examples_2.11-2.2.0.cloudera3.jar 10

由于我用的虚拟机配置比较低设置的内存分配比较小，这里为了测试把他调大写

使用shark-shell写一个简单的wordcount

去yarn的配置里搜索 yarn.scheduler.maximum-allocation-mb 调整为1500M 重启 spark 和 yarn 再次执行测试

使用shark-shell写一个简单的wordcount

后再次提交任务，可以看到已经算出计算结果

使用shark-shell写一个简单的wordcount

找到yarn的webUI查看计算任务查看执行成功

使用shark-shell写一个简单的wordcount

测试spark-shell

使用shark-shell写一个简单的wordcount

使用1.6版本spark报错

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream

这个错误是由于 spark的编译是没有将hadoop的classpath编译进去的，所以必须在spark-env.sh中指定hadoop中的所有jar包。

使用shark-shell写一个简单的wordcount

cd /opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib/spark/conf

使用shark-shell写一个简单的wordcount

添加 export SPARK_DIST_CLASSPATH=$(hadoop classpath)

使用shark-shell写一个简单的wordcount

保存退出，再次提交任务（记得切换到hdfs用户，否则报错没有权限）