python - How to run code on the AWS cluster using Apache-Spark? -


i've written python code on summing numbers in first-column each csv file follow:

import os, sys, inspect, csv  ### current directory path. curr_dir = os.path.split(inspect.getfile(inspect.currentframe()))[0]  ### setup environment variables spark_home_dir = os.path.realpath(os.path.abspath(os.path.join(curr_dir, "../spark"))) python_dir = os.path.realpath(os.path.abspath(os.path.join(spark_home_dir, "./python"))) os.environ["spark_home"] = spark_home_dir os.environ["pythonpath"] = python_dir  ### setup pyspark directory path pyspark_dir = python_dir sys.path.append(pyspark_dir)  ### import pyspark pyspark import sparkconf, sparkcontext  ### specify data file directory, , load data files data_path = os.path.realpath(os.path.abspath(os.path.join(curr_dir, "./test_dir")))  ### myfunc add numbers in first column. def myfunc(s):     total = 0     if s.endswith(".csv"):             cr = csv.reader(open(s,"rb"))             row in cr:                 total += int(row[0])                 return total  def main(): ### initialize sparkconf , sparkcontext     conf = sparkconf().setappname("ruofan").setmaster("spark://ec2-52-26-177-197.us-west-2.compute.amazonaws.com:7077")     sc = sparkcontext(conf = conf)     datafile = sc.wholetextfiles(data_path)      ### sent application in each of slave node     temp = datafile.map(lambda (path, content): myfunc(str(path).strip('file:')))      ### collect result , print out.     x in temp.collect():             print x  if __name__ == "__main__":     main() 

i use apache-spark parallelize summation process several csv files using same python code. i've done following steps:

  1. i've created 1 master , 2 slave nodes on aws.
  2. i've used bash command $ scp -r -i my-key-pair.pem my_dir root@ec2-52-27-82-124.us-west-2.compute.amazonaws.com upload directory my_dir including python code csv files onto cluster master node.
  3. i've login master node, , there used bash command $ ./spark/copy-dir my_dir send python code csv files slave nodes.
  4. i've setup environment variables on master node:

    $ export spark_home=~/spark

    $ export pythonpath=$spark_home/python/:$pythonpath

however, when run python code on master node: $ python sum.py, shows following error:

traceback (most recent call last):   file "sum.py", line 18, in <module>     pyspark import sparkconf, sparkcontext   file "/root/spark/python/pyspark/__init__.py", line 41, in <module>     pyspark.context import sparkcontext   file "/root/spark/python/pyspark/context.py", line 31, in <module>     pyspark.java_gateway import launch_gateway   file "/root/spark/python/pyspark/java_gateway.py", line 31, in <module>     py4j.java_gateway import java_import, javagateway, gatewayclient importerror: no module named py4j.java_gateway 

i have no ideas error. also, wondering if master node automatically calls slave nodes run in parallel. appreciate if can me.

here how debug particular import error.

  1. ssh master node
  2. run python repl $ python
  3. try failing import line >> py4j.java_gateway import java_import, javagateway, gatewayclient
  4. if fails, try running >> import py4j
  5. if fails, means system either not have py4j installed or cannot find it.
  6. exit repl >> exit()
  7. try installing py4j $ pip install py4j (you'll need have pip installed)
  8. open repl $ python
  9. try importing again >> py4j.java_gateway import java_import, javagateway, gatewayclient
  10. if works, >> exit() , try running $ python sum.py again

Comments

Popular posts from this blog

c# - Better 64-bit byte array hash -

webrtc - Which ICE candidate am I using and why? -

php - Zend Framework / Skeleton-Application / Composer install issue -