本地火花安装不能按预期工作

问题描述:

我在本地机器上安装了Hadoop-Spark。我试图连接到AWS S3,并成功地做到了这一点。为此我使用了hadoop-aws-2.8.0.jar。但是,我一直试图使用EMR提供的jar文件emr-ddb-hadoop.jar连接到DynamoDB。我已经安装了所有的AWS依赖关系,并且可以在本地使用。但是,我一直在获得以下例外。本地火花安装不能按预期工作

java.lang.ClassCastException: org.apache.hadoop.dynamodb.read.DynamoDBInputFormat cannot be cast to org.apache.hadoop.mapreduce.InputFormat 

这是我的代码片段。

import sys 
import os 

if 'SPARK_HOME' not in os.environ: 
    os.environ['SPARK_HOME'] = "/usr/local/Cellar/spark" 
    os.environ[ 
    'PYSPARK_SUBMIT_ARGS'] = '--jars /usr/local/Cellar/hadoop/2.8.0/libexec/share/hadoop/tools/lib/emr-ddb-hadoop.jar,' \ 
          '/home/aws-java-sdk/1.11.201/lib/aws-java-sdk-1.11.201.jar pyspark-shell' 
    sys.path.append("/usr/local/Cellar/spark/python") 
    sys.path.append("/usr/local/Cellar/spark/python") 
    sys.path.append("/usr/local/Cellar/spark/python/lib/py4j-0.10.4-src.zip") 

try: 
    from pyspark.sql import SparkSession, SQLContext, Row 
    from pyspark import SparkConf, SparkContext 
    from pyspark.sql.window import Window 
    import pyspark.sql.functions as func 
    from pyspark.sql.functions import lit, lag, col, udf 
    from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType, DoubleType, TimestampType, LongType 
except ImportError as e: 
    print("error importing spark modules", e) 
    sys.exit(1) 

spark = SparkSession \ 
    .builder \ 
    .master("spark://xxx.local:7077") \ 
    .appName("Sample") \ 
    .getOrCreate() 
sc = spark.sparkContext 
conf = {"dynamodb.servicename": "dynamodb", \ 
    "dynamodb.input.tableName": "test-table", \ 
    "dynamodb.endpoint": "http://dynamodb.us-east-1.amazonaws.com/", \ 
    "dynamodb.regionid": "us-east-1", \ 
    "mapred.input.format.class": "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat"} 
dynamo_rdd = sc.newAPIHadoopRDD('org.apache.hadoop.dynamodb.read.DynamoDBInputFormat', 
    'org.apache.hadoop.io.Text', 
    'org.apache.hadoop.dynamodb.DynamoDBItemWritable', 
    conf=conf) 
dynamo_rdd.collect() 

我还没有使用newAPIHadoopRDD。使用旧的API,它没有问题。

这里是工作样例我跟着,

https://aws.amazon.com/blogs/big-data/analyze-your-data-on-amazon-dynamodb-with-apache-spark/

+0

试了这一点。得到了相同的错误跟踪。引起:java.lang.ClassNotFoundException:com.amazonaws.services.dynamodbv2.model.AttributeValue – ZZzzZZzz

+0

看起来像你的emr-ddb-hadoop.jar没有这些类文件。我在我的罐子里检查过,它在那里。 – Kannaiyan

+0

我已经从maven回购下载它。任何链接来获取你的jar文件? – ZZzzZZzz