Spark源码
1. 环境
- java 1.8
- scala 2.11.1
- spark 2.0.2
- maven 3.5.4
2. 操作
2.1 修改pom文件
2.2 编译
直接导入项目是不行的,因为有些文件需要先编译才会产生,不过要编译环境也不难,之前不是已经配置过的maven嘛。将下载好的spark2.0.2解压,进入到其目录,打开cmd,运行如下命令就行
mvn -T 4 -DskipTests clean package
- 报错: Cannot run program “bash”
ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project spark-core_2.11: An Ant BuildException has occured: Execute failed: java.io.IOException: Cannot run program “bash” (in directory “F:\yuanma\spark-2.0.2\core”): CreateProcess error=2, 系统找不到指定的文件。
在spark2.0.2源码目录运行git bash,然后运行
mvn -T 4 -DskipTests clean package
当然可以指定依赖hadoop什么的, 见下面:
mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package
- 报错: OutOfMemoryError
[ERROR] Java heap space -> [Help 1]
[ERROR] java.lang.OutOfMemoryError: Java heap space
先运行: export MAVEN_OPTS=’-Xms1024m -Xmx1024m’
在运行: mvn -T 4 -DskipTests clean package
然后就成功了~
2.2 源码导入
open => ${spark2.4.3}_path/pom.xml => Open as Project
2.3 运行
到examples -> src -> scala -> LogQuery -> 右键 Run
2.4 运行时碰到的问题记录
2.4.1 Error:(45, 66) not found: type SparkFlumeProtocol
Error:(45, 66) not found: type SparkFlumeProtocol
val transactionTimeout: Int, val backOffInterval: Int) extends SparkFlumeProtocol with Logging {
Error:(70, 39) not found: type EventBatch
override def getEventBatch(n: Int): EventBatch = {
解决:
- 导入maven的flume sink依赖
- 需要设置Spark Project External Flume Sink sources target为souce ,target下层除scala外的其他目录为excluded
2.4.2 java.lang.NoClassDefFoundError: scala/collection/immutable/List
java.lang.NoClassDefFoundError: scala/collection/immutable/List
2.4.3 Caused by: java.lang.ClassNotFoundException: org.apache.spark.SparkConf
Caused by: java.lang.ClassNotFoundException: org.apache.spark.SparkConf
2.4.4 NoClassDefFoundError: com/google/common/cache/CacheLoader
Exception in thread “main” java.lang.NoClassDefFoundError: com/google/common/cache/CacheLoader