CDH集群上部署Python3环境及运行Pyspark作业
Anaconda与Python版本对应关系表
Anaconda2/3 | Python23 | Python2 |
---|---|---|
5.2.0 | 3.6.5 | 2.7.14 |
5.1.0 | 3.6.4 | 2.7.14 |
5.0.1 | 3.6.3 | 2.7.14 |
5.0.0 | 3.6.2 | 2.7.13 |
4.4.0 | 3.6.1 | 2.7.13 |
4.3.1 | 3.6.0 | 2.7.13 |
4.3.0 | 3.6.0 | 2.7.13 |
4.2.0 | 3.5.2 | 2.7.12 |
4.1.1 | 3.5.2 | 2.7.12 |
4.1.0 | 3.5.1 | 2.7.11 |
4.0.0 | 3.5.1 | 2.7.11 |
-
下载anaconda安装包
wget https://repo.continuum.io/archive/Anaconda3-4.4.0-Linux-x86_64.sh
-
安装anaconda
bash Anaconda3-4.4.0-Linux-x86_64.sh
按回车键
[email protected]:/home/hd_user# bash Anaconda3-4.4.0-Linux-x86_64.sh Welcome to Anaconda3 4.4.0 (by Continuum Analytics, Inc.) In order to continue the installation process, please review the license agreement. Please, press ENTER to continue >>> # (按回车键) =================================== Anaconda End User License Agreement ===================================
输入yes
Copyright 2017, Continuum Analytics, Inc. ... # 省略 kerberos (krb5, non-Windows platforms) A network authentication protocol designed to provide strong authentication for client/server applications by using secret-key cryptography. cryptography A Python library which exposes cryptographic recipes and primitives. Do you approve the license terms? [yes|no] >>> yes # 输入 yes Anaconda3 will now be installed into this location: /root/anaconda3
输入安装路径
/opt/cloudera/anaconda3
如果提示“tar (child): bzip2: Cannot exec: No such file or directory”,需要先安装bzip2。sudo yum -y install bzip2
- Press ENTER to confirm the location - Press CTRL-C to abort the installation - Or specify a different location below [/root/anaconda3] >>> /opt/cloudera/anaconda3 # 输入安装路径 /opt/cloudera/anaconda3 PREFIX=/opt/cloudera/anaconda3 installing: python-3.6.1-2 ... installing: _license-1.1-py36_1 ...
设置anaconda的PATH路径:
为了确保pyspark任务提交后使用python3,故输入no,重新设置PATHinstalling: alabaster-0.7.10-py36_0 ... ... # 省略 installing: zlib-1.2.8-3 ... installing: anaconda-4.4.0-np112py36_0 ... installing: conda-4.3.21-py36_0 ... installing: conda-env-2.6.0-0 ... Python 3.6.1 :: Continuum Analytics, Inc. creating default environment... installation finished. Do you wish the installer to prepend the Anaconda3 install location to PATH in your /root/.bashrc ? [yes|no] [no] >>> no # 输入 no You may wish to edit your .bashrc or prepend the Anaconda3 install location: $ export PATH=/opt/cloudera/anaconda3/bin:$PATH Thank you for installing Anaconda3! Share your notebooks and packages on Anaconda Cloud! Sign up for free: https://anaconda.org
-
设置anaconda3的环境变量
[[email protected] ~]# echo "export PATH=/opt/cloudera/anaconda3/bin:$PATH" >> /etc/profile [[email protected] ~]# source /etc/profile [[email protected] ~]# env |grep PATH PATH=/opt/cloudera/anaconda3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
-
验证Python版本
[email protected]:/home/hd_user# python Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:09:58) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>>
或
[email protected]:/home/hd_user# python -V Python 3.6.1 :: Anaconda 4.4.0 (64-bit)
-
在CM配置Spark的Python环境
export PYSPARK_PYTHON=/opt/cloudera/anaconda3/bin/python export PYSPARK_DRIVER_PYTHON=/opt/cloudera/anaconda3/bin/python
重启相关服务。 -
使用Pyspark命令测试
x = sc.parallelize([1,2,3]) y = x.flatMap(lambda x: (x, 100*x, x**2)) print(x.collect()) print(y.collect())
[email protected]-dev-41:/home/charles# pyspark Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:09:58) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.6.0 /_/ Using Python version 3.6.1 (default, May 11 2017 13:09:58) SparkContext available as sc, HiveContext available as sqlContext. >>> x = sc.parallelize([1,2,3]) >>> y = x.flatMap(lambda x: (x, 100*x, x**2)) >>> print(x.collect()) [1, 2, 3] >>> print(y.collect()) [1, 100, 1, 2, 200, 4, 3, 300, 9] >>>