Spark资源调度和配置说明

日期：2020-06-17 栏目：程序人生浏览：次

Spark资源调度有两类（这里主要介绍的是yarn为master的调度）

1、master管理的调度

当你在Hadoop集群上运行你的spark应用程序的时候，每个应用程序都将申请获取相应的一组独立的用以运行该taske的JVM资源资源，默认情况下，使用的调度方式是一组静态的也是最简单的方式，它允许每个应用程序获取它所需要的最大限制的资源，并且直到运行结束前，一直拥有他们。

yarn可以使用 –num-executors和—executor-memory以及—executor-core来分配spark应用程序所需要的executors和每个executor所能使用的memory、core资源

yarn也提供一种动态的资源管理来分配应用程序需要的资源。也就是应用程序根据你的应用来适当增加或减少你所使用的资源。并且这样特性目前只yarn才支持

配置安装：

所有可用的属性使用spark.dynamicAllocation.* 配置

启用动态资源管理选项：spark.dynamicAllocation.enabled

配置executers动态分配使用spark.dynamicAllocation.minExecutors

和spark.dynamicAllocation.maxExecutors

设置spark.shuffle.service.enabled 为true启用shuffle service（yarn的由org.apache.spark.yarn.network.YarnShuffleService实现

如果使用动态的资源管理，那额外的还需要启动一个shuffle服务一确保被executor所读写的shuffle文件在executor退出后被保存

启用方法：set spark.shuffle.service.enabled to true

在yarn中启用shuffle service的步骤：

Build Spark with the YARN profile. Skip this step if you are using a pre-packaged distribution.
Locate the spark-<version>-yarn-shuffle.jar. This should be under $SPARK_HOME/network/yarn/target/scala-<version> if you are building Spark yourself, and under lib if you are using a distribution.
Add this jar to the classpath of all NodeManagers in your cluster.
In the yarn-site.xml on each node, add spark_shuffle to yarn.nodemanager.aux-services, then set yarn.nodemanager.aux-services.spark_shuffle.class to org.apache.spark.network.yarn.YarnShuffleService. Additionally, set all relevantspark.shuffle.service.* configurations.
Restart all NodeManagers in your cluster.
动态资源管理策略相关的参数：

spark.dynamicAllocation.schedulerBacklogTimeout

spark.dynamicAllocation.sustainedSchedulerBacklogTimeout

2、spark应用程序管理的调度

在一个executor中根据任务的不同将会并行的运行的不同jobs，它们之间也存在资源竞争，并且spark的调度室线程安全的和支持应用程序服务多用户请求（例如多用户查询）默认情况下spark使用的是FIFO的方式调度，spark也支持FAIR调度

设置方式：

conf = new SparkConf().setMaster(...).setAppName(...)
conf.set("spark.scheduler.mode", "FAIR")
sc = new SparkContext(conf)
默认情况下，使用公平调度的时候，所有应用程序将具有相同优先级
使用公平调度的时候可以设置调度池，让不同的用户的应用运行在不同的优先级：
默认池是default poll
job的优先池可以如下修改：
sc.setLocalProperty("spark.scheduler.pool", "pool1")添加
sc.setLocalProperty("spark.scheduler.pool", null)删除
配置池的属性
添加配置文件：
conf.set("spark.scheduler.allocation.file", "/path/to/file")
配置文件格式：
<?xml version="1.0"?>
<allocations>
<pool> 池名称
<schedulingMode>FAIR</schedulingMode> 调度算法
<weight>1</weight> 优先级
<minShare>2</minShare> 最小资源分配
</pool>
<pool>
<schedulingMode>FIFO</schedulingMode>
<weight>2</weight>
<minShare>3</minShare>
</pool>
</allocations>

更多Spark相关教程见以下内容：

CentOS 7.0下安装并配置Spark

Spark1.0.0部署指南

Spark官方文档 - 中文翻译

CentOS 6.2(64位)下安装Spark0.8.0详细记录

Spark简介及其在Ubuntu下的安装使用

安装Spark集群(在CentOS上)

Hadoop vs Spark性能对比

Spark安装与学习

Spark 并行计算模型

Ubuntu 14.04 LTS 安装 Spark 1.6.0 （伪分布式）

转载注明出处：https://www.heiqu.com/5efdf20d4769fa87dcb1e166263050fd.html

Spark资源调度和配置说明

相关推荐