CDH集群上部署Python3环境及运行Pyspark作业

Anaconda与Python版本对应关系表

Anaconda2/3 Python23 Python2
5.2.0 3.6.5 2.7.14
5.1.0 3.6.4 2.7.14
5.0.1 3.6.3 2.7.14
5.0.0 3.6.2 2.7.13
4.4.0 3.6.1 2.7.13
4.3.1 3.6.0 2.7.13
4.3.0 3.6.0 2.7.13
4.2.0 3.5.2 2.7.12
4.1.1 3.5.2 2.7.12
4.1.0 3.5.1 2.7.11
4.0.0 3.5.1 2.7.11
  1. 下载anaconda安装包

    wget https://repo.continuum.io/archive/Anaconda3-4.4.0-Linux-x86_64.sh
    
  2. 安装anaconda
    bash Anaconda3-4.4.0-Linux-x86_64.sh

    按回车键

    [email protected]:/home/hd_user# bash Anaconda3-4.4.0-Linux-x86_64.sh 
    
    Welcome to Anaconda3 4.4.0 (by Continuum Analytics, Inc.)
    
    In order to continue the installation process, please review the license
    agreement.
    Please, press ENTER to continue
    >>>                                                                                            # (按回车键)
    ===================================
    Anaconda End User License Agreement
    ===================================
    

    输入yes

    Copyright 2017, Continuum Analytics, Inc.
    ...      																						# 省略
    kerberos (krb5, non-Windows platforms)
    A network authentication protocol designed to provide strong authentication
    for client/server applications by using secret-key cryptography.
    
    cryptography
    A Python library which exposes cryptographic recipes and primitives.
    
    Do you approve the license terms? [yes|no]
    >>> yes 																					  # 输入 yes
    Anaconda3 will now be installed into this location:
    /root/anaconda3
    

    输入安装路径 /opt/cloudera/anaconda3
    如果提示“tar (child): bzip2: Cannot exec: No such file or directory”,需要先安装bzip2。sudo yum -y install bzip2

      - Press ENTER to confirm the location
      - Press CTRL-C to abort the installation
      - Or specify a different location below
    
    [/root/anaconda3] >>> /opt/cloudera/anaconda3         # 输入安装路径 /opt/cloudera/anaconda3
    PREFIX=/opt/cloudera/anaconda3
    installing: python-3.6.1-2 ...
    installing: _license-1.1-py36_1 ...
    

    设置anaconda的PATH路径:
    为了确保pyspark任务提交后使用python3,故输入no,重新设置PATH

    installing: alabaster-0.7.10-py36_0 ...
    ...       																			# 省略
    installing: zlib-1.2.8-3 ...
    installing: anaconda-4.4.0-np112py36_0 ...
    installing: conda-4.3.21-py36_0 ...
    installing: conda-env-2.6.0-0 ...
    Python 3.6.1 :: Continuum Analytics, Inc.
    creating default environment...
    installation finished.
    Do you wish the installer to prepend the Anaconda3 install location
    to PATH in your /root/.bashrc ? [yes|no]
    [no] >>> no															  # 输入 no
    
    You may wish to edit your .bashrc or prepend the Anaconda3 install location:
    
    $ export PATH=/opt/cloudera/anaconda3/bin:$PATH
    
    Thank you for installing Anaconda3!
    
    Share your notebooks and packages on Anaconda Cloud!
    Sign up for free: https://anaconda.org
    
    
  3. 设置anaconda3的环境变量

    [[email protected] ~]# echo "export PATH=/opt/cloudera/anaconda3/bin:$PATH" >> /etc/profile
    [[email protected] ~]# source /etc/profile
    [[email protected] ~]# env |grep PATH
    PATH=/opt/cloudera/anaconda3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
    
  4. 验证Python版本

    [email protected]:/home/hd_user# python
    Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:09:58) 
    [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> 
    

    [email protected]:/home/hd_user# python -V
    Python 3.6.1 :: Anaconda 4.4.0 (64-bit)
    
  5. 在CM配置Spark的Python环境

    export PYSPARK_PYTHON=/opt/cloudera/anaconda3/bin/python
    export PYSPARK_DRIVER_PYTHON=/opt/cloudera/anaconda3/bin/python
    

    CDH集群上部署Python3环境及运行Pyspark作业
    重启相关服务。

  6. 使用Pyspark命令测试

    x = sc.parallelize([1,2,3])
    y = x.flatMap(lambda x: (x, 100*x, x**2))
    print(x.collect())
    print(y.collect())
    
    [email protected]-dev-41:/home/charles# pyspark
    Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:09:58) 
    [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel).
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 1.6.0
          /_/
    
    Using Python version 3.6.1 (default, May 11 2017 13:09:58)
    SparkContext available as sc, HiveContext available as sqlContext.
    >>> x = sc.parallelize([1,2,3])
    >>> y = x.flatMap(lambda x: (x, 100*x, x**2))
    >>> print(x.collect())
    [1, 2, 3]                                                                       
    >>> print(y.collect())
    [1, 100, 1, 2, 200, 4, 3, 300, 9]                                               
    >>> 
    
    

    参考:
    https://mp.weixin.qq.com/s?__biz=MzI4OTY3MTUyNg==&mid=2247496668&idx=1&sn=4461854378270ea0741e91047a541b9b&chksm=ec2923d5db5eaac30108f19e44ea763ea6e06f26089437ef9b6a1f44204ed1e259efd2fc59ef&scene=21#wechat_redirect