14/10/24 14:51:40 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 14/10/24 14:51:40 ERROR lzo.GPLNativeCodeLoader: Could not load native gpl library java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1738) at java.lang.Runtime.loadLibrary0(Runtime.java:823) at java.lang.System.loadLibrary(System.java:1028) at com.hadoop.compression.lzo.GPLNativeCodeLoader.<clinit>(GPLNativeCodeLoader.java:32) at com.hadoop.compression.lzo.LzoCodec.<clinit>(LzoCodec.java:71) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:249) at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1836) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1801) at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:128)
解决方法可以参考 Spark连接Hadoop读取HDFS问题小结 这篇文章,执行以下命令,然后重启服务即可:
cp /usr/lib/hadoop/lib/native/libgplcompression.so $JAVA_HOME/jre/lib/amd64/ cp /usr/lib/hadoop/lib/native/libhadoop.so $JAVA_HOME/jre/lib/amd64/ cp /usr/lib/hadoop/lib/native/libsnappy.so $JAVA_HOME/jre/lib/amd64/
使用 spark-submit 以 Standalone 模式运行 SparkPi 程序的命令如下:
$ spark-submit --class org.apache.spark.examples.SparkPi --master spark://cdh1:7077 /usr/lib/spark/lib/spark-examples-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar 10
需要说明的是:Standalone mode does not support talking to a kerberized HDFS,如果你以 spark-shell --master spark://cdh1:7077 方式访问安装有 kerberos 的 HDFS 集群上访问数据时,会出现下面异常:
15/04/02 11:58:32 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, bj03-bi-pro-hdpnamenn): java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "cdh1/192.168.56.121"; destination host is: "192.168.56.121":8020; org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764) org.apache.hadoop.ipc.Client.call(Client.java:1415) org.apache.hadoop.ipc.Client.call(Client.java:1364) org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) com.sun.proxy.$Proxy17.getBlockLocations(Unknown Source)
4.2 Spark On Mesos 模式 4.3 Spark on Yarn 模式Spark on Yarn 模式同样也支持两种在 Yarn 上启动 Spark 的方式,一种是 cluster 模式,Spark driver 在 Yarn 的 application master 进程中运行,客户端在应用初始化完成之后就会退出;一种是 client 模式,Spark driver 运行在客户端进程中。Spark on Yarn 模式是可以访问配置有 kerberos 的 HDFS 文件的。
CDH Spark中,以 cluster 模式启动,命令如下:
$ spark-submit --class path.to.your.Class --deploy-mode cluster --master yarn [options] <app jar> [app options]
CDH Spark中,以 client 模式启动,命令如下:
$ spark-submit --class path.to.your.Class --deploy-mode client --master yarn [options] <app jar> [app options]
以SparkPi程序为例:
$ spark-submit --class org.apache.spark.examples.SparkPi \ --deploy-mode cluster \ --master yarn \ --num-executors 3 \ --driver-memory 4g \ --executor-memory 2g \ --executor-cores 1 \ --queue thequeue \ /usr/lib/spark/lib/spark-examples-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar \ 10
另外,运行在 YARN 集群之上的时候,可以手动把 spark-assembly 相关的 jar 包拷贝到 hdfs 上去,然后设置 SPARK_JAR 环境变量:
$ hdfs dfs -mkdir -p /user/spark/share/lib $ hdfs dfs -put $SPARK_HOME/lib/spark-assembly.jar /user/spark/share/lib/spark-assembly.jar $ SPARK_JAR=hdfs://<nn>:<port>/user/spark/share/lib/spark-assembly.jar
5. Spark-SQLSpark 安装包中包括了 Spark-SQL ,运行 spark-sql 命令,在 cdh5.2 中会出现下面异常: