Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

注意:spark scala  hadoop直接的版本匹配非常重要!

一、Spark2.0.1安装

https://archive.apache.org/dist/spark/spark-2.0.1/中下载编译好的spark包,选择

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

因为已经编译好了,直接解压即可。但是注意:解压的路径中不能有空格。比如解压到D盘 :  D:\spark-2.0.1-bin-hadoop2.7

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

解压完成后,配置环境变量,path路径

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

然后在cmd中运行spark-shell

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

会提示没有hadoop环境,接下来会讲hadoop环境的安装!

二、scala 2.11.8 安装

https://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.msi 下载scala

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

下载完成后,直接安装,安装完成后程序会自动在环境变量里添加path

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

在cmd中运行scala version 命令:

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

表示安装成功。

三、IDEA集成scala开发环境

可以参考:http://dblab.xmu.edu.cn/blog/1327/

首先下载IDEA的scala插件,在plugin插件商店安装

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

此处我已安装好,因为下载慢的原因,也可以直接从https://confluence.jetbrains.com/display/SCA/Scala+Plugin+for+IntelliJ+IDEA 去下载插件包,然后本地安装

安装完成重启IDEA后,创建scala projecr

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

选择IDEA,适合初学者,点击下一步,选择SDK,没有SDK可选,点击create,就可以看到之前安装好的scala2.11.8

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

然后新建个 scala object类,因为只有object才有main方法

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

至此Scala环境搭建完毕,最后搭建hadoop环境。

四、hadoop2.7.7 搭建

下载hadoop,从:https://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.7.7/ 下载

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

windows需要有管理员权限去打开压缩包,否则解压时候会提示 “客户端没有所需的授权”

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

然后到环境变量部分设置HADOOP_HOME为Hadoop的解压目录,如图所示:

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

然后再设置该目录下的bin目录到系统变量的PATH下,我这里也就是C:\Hadoop\bin,如果已经添加了HADOOP_HOME系统变量,也可用%HADOOP_HOME%\bin来指定bin文件夹路径名。

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

这两个系统变量设置好后,开启一个新的cmd窗口,然后直接输入spark-shell命令。如图所示:

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

java.io.IOException: Could not locate executable D:\hadoop-2.7.7\bin\winutils.exe in the Hadoop binaries.

按照提示,可以去 https://github.com/steveloughran/winutils 选择你安装的Hadoop版本号,然后进入到bin目录下,找到winutils.exe文件,下载方法是点击winutils.exe文件,进入之后在页面的右上方部分有一个Download按钮,点击下载即可。 如图所示:

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

 

  将下载好bin目录覆盖到Hadoop的bin目录下,

 

但是运行时会提示无法在64位环境中执行,此时建议去:http://www.pc6.com/softview/SoftView_578664.html 下载对应版本的bin,将bin目录覆盖到hadoop的bin目录下

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

新建classpath变量,设置为D:\hadoop-2.7.7\bin\winutils.exe

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

同时,将bin\hadoop.dll复制到C:\Windows\System32目录下

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

设置spark安装目录的权限不能为隐藏和只读

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

因为无法识别系统配置的JAVA_HOME,所以需要配置hadoop的java home路径, 打开D:\hadoop-2.7.7\etc\hadoop\hadoop-env.cmd 文件

set JAVA_HOME="D:\hadoop-2.7.7\jdk1.8.0_181"

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

注意JAVA_HOME中的路径不能有空格,否则无法识别

修改D:\hadoop-2.7.7\etc\hadoop\core-site.xml

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/hadoop/tmp</value>
        <description>Abase for other temporary directories.</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
 

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

修改D:\hadoop-2.7.7\etc\hadoop\hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>    
        <name>dfs.namenode.name.dir</name>    
        <value>file:/hadoop/data/dfs/namenode</value>    
    </property>    
    <property>    
        <name>dfs.datanode.data.dir</name>    
        <value>file:/hadoop/data/dfs/datanode</value>  
    </property>
</configuration>

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

默认是在hadoop安装的同级目录下创新hadoop文件夹作为name节点和data节点

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

配置好后,到sbin目录下运行start-dfs.cmd

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

会启动namenode和datanode,但是在启动datanode时候提示

java.lang.RuntimeException: Error while running command to get file permissions : java.io.IOException: Cannot run program "D:\hadoop-2.7.7\bin\winutils.exe": CreateProcess error=740, 请求的操作需要提升。

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

 

通过hadoop version可以查看hadoop版本

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

在浏览器中打开链接:http://localhost:50070/dfshealth.html#tab-overview

即可查看:

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

 

五、通过maven创建spark项目

我们点击初始界面的Create New Project进入如图界面。并按图创建Maven工程文件。

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

创建完成后,在project右键点击Add Framework Support.......,

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

在pom.xml中加入spark必要的jar

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>spark</groupId>
  <artifactId>picc-spark</artifactId>
  <version>1.0-SNAPSHOT</version>

  <name>picc-spark</name>
  <!-- FIXME change it to the project's website -->
  <url>http://www.example.com</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
    <hadoopVersion>2.7.2</hadoopVersion>
    <sparkVersion>2.0.1</sparkVersion>
    <scala.version>2.11</scala.version>
  </properties>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>

    <!-- Hadoop start -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>${hadoopVersion}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-hdfs</artifactId>
      <version>${hadoopVersion}</version>
    </dependency>

    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-mapreduce-client-core</artifactId>
      <version>${hadoopVersion}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>${hadoopVersion}</version>
    </dependency>
    <!-- Hadoop -->

    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_${scala.version}</artifactId>
      <version>${sparkVersion}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_${scala.version}</artifactId>
      <version>${sparkVersion}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_${scala.version}</artifactId>
      <version>${sparkVersion}</version>
      <scope>provided</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-hive_${scala.version}</artifactId>
      <version>${sparkVersion}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-mllib_${scala.version}</artifactId>
      <version>${sparkVersion}</version>
    </dependency>
  </dependencies>

  <build>
    <pluginManagement><!-- lock down plugins versions to avoid using Maven defaults (may be moved to parent pom) -->
      <plugins>
        <!-- clean lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#clean_Lifecycle -->
        <plugin>
          <artifactId>maven-clean-plugin</artifactId>
          <version>3.1.0</version>
        </plugin>
        <!-- default lifecycle, jar packaging: see https://maven.apache.org/ref/current/maven-core/default-bindings.html#Plugin_bindings_for_jar_packaging -->
        <plugin>
          <artifactId>maven-resources-plugin</artifactId>
          <version>3.0.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-compiler-plugin</artifactId>
          <version>3.8.0</version>
        </plugin>
        <plugin>
          <artifactId>maven-surefire-plugin</artifactId>
          <version>2.22.1</version>
        </plugin>
        <plugin>
          <artifactId>maven-jar-plugin</artifactId>
          <version>3.0.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-install-plugin</artifactId>
          <version>2.5.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-deploy-plugin</artifactId>
          <version>2.8.2</version>
        </plugin>
        <!-- site lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#site_Lifecycle -->
        <plugin>
          <artifactId>maven-site-plugin</artifactId>
          <version>3.7.1</version>
        </plugin>
        <plugin>
          <artifactId>maven-project-info-reports-plugin</artifactId>
          <version>3.0.0</version>
        </plugin>
      </plugins>
    </pluginManagement>
  </build>
</project>

编写wordcount程序

Spark2.0.1 + scala2.11.8 + hadoop2.7 + IDEA开发环境搭建

运行成功!!