python - How to run code on the AWS cluster using Apache-Spark? -
i've written python code on summing numbers in first-column each csv file follow:
import os, sys, inspect, csv ### current directory path. curr_dir = os.path.split(inspect.getfile(inspect.currentframe()))[0] ### setup environment variables spark_home_dir = os.path.realpath(os.path.abspath(os.path.join(curr_dir, "../spark"))) python_dir = os.path.realpath(os.path.abspath(os.path.join(spark_home_dir, "./python"))) os.environ["spark_home"] = spark_home_dir os.environ["pythonpath"] = python_dir ### setup pyspark directory path pyspark_dir = python_dir sys.path.append(pyspark_dir) ### import pyspark pyspark import sparkconf, sparkcontext ### specify data file directory, , load data files data_path = os.path.realpath(os.path.abspath(os.path.join(curr_dir, "./test_dir"))) ### myfunc add numbers in first column. def myfunc(s): total = 0 if s.endswith(".csv"): cr = csv.reader(open(s,"rb")) row in cr: total += int(row[0]) return total def main(): ### initialize sparkconf , sparkcontext conf = sparkconf().setappname("ruofan").setmaster("spark://ec2-52-26-177-197.us-west-2.compute.amazonaws.com:7077") sc = sparkcontext(conf = conf) datafile = sc.wholetextfiles(data_path) ### sent application in each of slave node temp = datafile.map(lambda (path, content): myfunc(str(path).strip('file:'))) ### collect result , print out. x in temp.collect(): print x if __name__ == "__main__": main()
i use apache-spark parallelize summation process several csv files using same python code. i've done following steps:
- i've created 1 master , 2 slave nodes on aws.
- i've used bash command
$ scp -r -i my-key-pair.pem my_dir root@ec2-52-27-82-124.us-west-2.compute.amazonaws.com
upload directorymy_dir
including python code csv files onto cluster master node. - i've login master node, , there used bash command
$ ./spark/copy-dir my_dir
send python code csv files slave nodes. i've setup environment variables on master node:
$ export spark_home=~/spark
$ export pythonpath=$spark_home/python/:$pythonpath
however, when run python code on master node: $ python sum.py
, shows following error:
traceback (most recent call last): file "sum.py", line 18, in <module> pyspark import sparkconf, sparkcontext file "/root/spark/python/pyspark/__init__.py", line 41, in <module> pyspark.context import sparkcontext file "/root/spark/python/pyspark/context.py", line 31, in <module> pyspark.java_gateway import launch_gateway file "/root/spark/python/pyspark/java_gateway.py", line 31, in <module> py4j.java_gateway import java_import, javagateway, gatewayclient importerror: no module named py4j.java_gateway
i have no ideas error. also, wondering if master node automatically calls slave nodes run in parallel. appreciate if can me.
here how debug particular import error.
- ssh master node
- run python repl
$ python
- try failing import line
>> py4j.java_gateway import java_import, javagateway, gatewayclient
- if fails, try running
>> import py4j
- if fails, means system either not have py4j installed or cannot find it.
- exit repl
>> exit()
- try installing py4j
$ pip install py4j
(you'll need have pip installed) - open repl
$ python
- try importing again
>> py4j.java_gateway import java_import, javagateway, gatewayclient
- if works,
>> exit()
, try running$ python sum.py
again
Comments
Post a Comment