Spark源码

1. 环境

  1. java 1.8
  2. scala 2.11.1
  3. spark 2.0.2
  4. maven 3.5.4

2. 操作

2.1 修改pom文件

Spark源码
Spark源码

2.2 编译

直接导入项目是不行的,因为有些文件需要先编译才会产生,不过要编译环境也不难,之前不是已经配置过的maven嘛。将下载好的spark2.0.2解压,进入到其目录,打开cmd,运行如下命令就行

mvn -T 4 -DskipTests clean package

  1. 报错: Cannot run program “bash”

ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project spark-core_2.11: An Ant BuildException has occured: Execute failed: java.io.IOException: Cannot run program “bash” (in directory “F:\yuanma\spark-2.0.2\core”): CreateProcess error=2, 系统找不到指定的文件。

在spark2.0.2源码目录运行git bash,然后运行
mvn -T 4 -DskipTests clean package

当然可以指定依赖hadoop什么的, 见下面:
mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package

  1. 报错: OutOfMemoryError

[ERROR] Java heap space -> [Help 1]
[ERROR] java.lang.OutOfMemoryError: Java heap space

先运行: export MAVEN_OPTS=’-Xms1024m -Xmx1024m’
在运行: mvn -T 4 -DskipTests clean package

然后就成功了~
Spark源码

2.2 源码导入

open => ${spark2.4.3}_path/pom.xml => Open as Project

2.3 运行

到examples -> src -> scala -> LogQuery -> 右键 Run

2.4 运行时碰到的问题记录

2.4.1 Error:(45, 66) not found: type SparkFlumeProtocol

Error:(45, 66) not found: type SparkFlumeProtocol
val transactionTimeout: Int, val backOffInterval: Int) extends SparkFlumeProtocol with Logging {
Error:(70, 39) not found: type EventBatch
override def getEventBatch(n: Int): EventBatch = {

解决:

  1. 导入maven的flume sink依赖
    Spark源码
  2. 需要设置Spark Project External Flume Sink sources target为souce ,target下层除scala外的其他目录为excluded
    Spark源码

2.4.2 java.lang.NoClassDefFoundError: scala/collection/immutable/List

java.lang.NoClassDefFoundError: scala/collection/immutable/List

Spark源码

2.4.3 Caused by: java.lang.ClassNotFoundException: org.apache.spark.SparkConf

Caused by: java.lang.ClassNotFoundException: org.apache.spark.SparkConf

Spark源码

2.4.4 NoClassDefFoundError: com/google/common/cache/CacheLoader

Exception in thread “main” java.lang.NoClassDefFoundError: com/google/common/cache/CacheLoader

Spark源码