伴随着梦想前行
15/2
2019

pyspark使用自定义的python

出于种种原因,spark集群节点中的环境会不一致,会出现一些莫名其妙的错误,在没有root权限的情况下,可以使用自己配置好的python包分发到集群节点上使用;

比方说,本机是python2.7,节点是python2.6的错误:

Exception: Python in worker has different version 2.6 than that in driver 2.7, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)

修改后的 /spark2.3/python/pyspark/shell.py,默认的脚本会用/spark2.3/conf/spark-env.sh覆盖自己设置的环境变量

export PYTHONSTARTUP="/data/src/shell.py"

本机调用的python环境

export PYSPARK_PYTHON=/data/python27_ml/bin/python2.7

不用python的shell,用jupyter做为交互环境

export PYSPARK_DRIVER_PYTHON=/data/python27_ml/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip 0.0.0.0 --port 8888 --notebook-dir /data/notebook --config /data/src/notebook_config.json"

运行, /data/python27_ml.tgz是本地路径,要分发到节点上的python包

exec "${SPARK_HOME}"/bin/spark-submit \
pyspark-shell-main \
--name "PySparkShell" \
--conf "spark.yarn.dist.archives=/data/python27_ml.tgz" \
"$@"

/data/src/shell.py
需要在SparkContext初始化之前设置环境变量PYSPARK_PYTHON
os.environ['PYSPARK_PYTHON'] = './python27_ml.tgz/python27_ml/bin/python'

分发的python包找不到对应python可执行文件的错误:

19/02/15 14:55:43 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.160.108.154, executor 9): java.io.IOException: Cannot run program "./python27/bin/python": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:168)

+ MORE