IDEA 打包 spark 程序 并在远程 hadoop HA 上运行测试
1. idea 安装创建 (略)
2。创建 scala 的 Maven 项目 (略)
3。导入maven 依赖 (重要)
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>SparkHbase</groupId> <artifactId>SparkHbase</artifactId> <version>1.0-SNAPSHOT</version> <inceptionYear>2008</inceptionYear> <properties> <hadoop.version>2.6.0</hadoop.version> <hbase.version>1.2.0</hbase.version> <spark.version>1.6.0</spark.version> </properties> <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-client</artifactId> <version>${hbase.version}</version> </dependency> <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-server</artifactId> <version>${hbase.version}</version> </dependency> <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase</artifactId> <version>${hbase.version}</version> <type>pom</type> </dependency> </dependencies> <build> <sourceDirectory>src/main/scala</sourceDirectory> <plugins> <plugin> <groupId>org.scala-tools</groupId> <artifactId>maven-scala-plugin</artifactId> <version>2.15.2</version> <executions> <execution> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>3.1.0</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <createDependencyReducedPom>false</createDependencyReducedPom> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes> </filter> </filters> <!--<transformers>--> <!--<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">--> <!--<mainClass>com.test.SparkCount</mainClass>--> <!--</transformer>--> <!--</transformers>--> </configuration> </execution> </executions> </plugin> </plugins> </build> </project>
关于此项选择:《maven 用的 不是很好 不太懂》
这一项中 添加此配置 可以解决打包之后的依赖冲突;如果不加 此项 有可能在打包之后的jar 中 出现包的依赖重复,运行时提示错误信息:
java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does notmatch signer information of other classes in the same package
遇到这种情况 需要对包 进行处理: (执行此命令)
zip -d SparkHBase.jar META-INF/*.SF META-INF/*.DSA META-INF/*.RSA
<excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes>
4。打包:
测试代码:
package SparkTest import org.apache.spark.{SparkConf, SparkContext} object TestStreaming { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster("spark://slaver3:7077").setAppName("2018_3_19") // setJars(List("D:\\project\\SparkHbase\\out\\artifacts\\SparkHBase_jar")) val ssc = new SparkContext(conf); val input = ssc.textFile("hdfs://Machenmaster/hbase/data.txt") val words = input.flatMap(line => line.split(" ")).map(word =>(word,1)).reduceByKey((x,y)=>x+y) println(words) words.saveAsTextFile("hdfs://Machenmaster/hbase/OUT") } }
可以使用maven 也可以使用idea自己的打包方式:
【1】我用的是IDEA 自己的打包方式;
【2】和之前文章介绍的 一样 把 所有的依赖包 全部去了 (linux 环境中已经全部拥有)
【3】上传至集群中 ,运行:
spark-submit --class SparkTest.TestStreaming SparkHBase.jar