How Do I Enable Fair Scheduler in PySpark?

up vote
2
down vote

favorite

According to the docs

Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool “local property” to the SparkContext in the thread that’s submitting them

And

The pool properties can be set by creating an XML file, similar to conf/fairscheduler.xml.template, and either putting a file named fairscheduler.xml on the classpath, or setting spark.scheduler.allocation.file property in your SparkConf

So I can do the first part easily enough:

__sp_conf = SparkConf()

__sp_conf.set("spark.scheduler.mode", "FAIR")

sc = SparkContext(conf=__sp_conf)

sc.setLocalProperty("spark.scheduler.pool", "default")

But how do I get an xml file called fairscheduler.xml onto the classpath? Also, the classpath of what? Just the driver? Every executor?

I've tried using the addFile() fuction on SparkContext but that's really for being able to access files from your jobs, I don't think it adds anything to the classpath.

My other thought was modifying the PYSPARK_SUBMIT_ARGS environment variable to try messing around with the command sent to spark-submit but I'm not sure there's a way to alter the classpath using that method. Additionally, this would only alter the classpath of the driver, not every executor which I'm not sure would work or not.

To be clear, if I don't provide the fairscheduler.xml file Spark complains

WARN FairSchedulableBuilder:66 - Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.

edited Nov 10 at 3:24

asked Nov 10 at 2:26

FGreg

5,15164383

spark-submit --help; there are several useful options. I think the one you really want is --properties-file which allows you to provide an entire properties file - but you could do it with --conf spark.scheduler.pool=default
– Elliott Frisch
Nov 10 at 2:32

@ElliottFrisch I've set the pool to default already using the spark context sc.setLocalProperty("spark.scheduler.pool", "default"). I need to also somehow provide the fairscheduler.xml file or else Spark complains and defaults back to FIFO order
– FGreg
Nov 10 at 3:20

Run the command I provided. You'll see (also) --files, you can use that to add "fairscheduler.xml" to each container.
– Elliott Frisch
Nov 10 at 3:29

@ElliottFrisch --files does the same thing as sc.addFile which I've tried and does not work. According to the docs that option is a "Comma-separated list of files to be placed in the working directory of each executor. Globs are allowed." It does not say it puts the files on the root of the classpath which is what needs to happen according to github.com/apache/spark/blob/…
– FGreg
Nov 10 at 3:34

Did you try --jars? These are comments. Maybe someone will answer you.
– Elliott Frisch
Nov 10 at 3:46

add a comment |

up vote
2
down vote

favorite

According to the docs

Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool “local property” to the SparkContext in the thread that’s submitting them

And

The pool properties can be set by creating an XML file, similar to conf/fairscheduler.xml.template, and either putting a file named fairscheduler.xml on the classpath, or setting spark.scheduler.allocation.file property in your SparkConf

So I can do the first part easily enough:

__sp_conf = SparkConf()

__sp_conf.set("spark.scheduler.mode", "FAIR")

sc = SparkContext(conf=__sp_conf)

sc.setLocalProperty("spark.scheduler.pool", "default")

But how do I get an xml file called fairscheduler.xml onto the classpath? Also, the classpath of what? Just the driver? Every executor?

I've tried using the addFile() fuction on SparkContext but that's really for being able to access files from your jobs, I don't think it adds anything to the classpath.

To be clear, if I don't provide the fairscheduler.xml file Spark complains

WARN FairSchedulableBuilder:66 - Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.

edited Nov 10 at 3:24

asked Nov 10 at 2:26

FGreg

5,15164383

spark-submit --help; there are several useful options. I think the one you really want is --properties-file which allows you to provide an entire properties file - but you could do it with --conf spark.scheduler.pool=default
– Elliott Frisch
Nov 10 at 2:32

@ElliottFrisch I've set the pool to default already using the spark context sc.setLocalProperty("spark.scheduler.pool", "default"). I need to also somehow provide the fairscheduler.xml file or else Spark complains and defaults back to FIFO order
– FGreg
Nov 10 at 3:20

Run the command I provided. You'll see (also) --files, you can use that to add "fairscheduler.xml" to each container.
– Elliott Frisch
Nov 10 at 3:29

@ElliottFrisch --files does the same thing as sc.addFile which I've tried and does not work. According to the docs that option is a "Comma-separated list of files to be placed in the working directory of each executor. Globs are allowed." It does not say it puts the files on the root of the classpath which is what needs to happen according to github.com/apache/spark/blob/…
– FGreg
Nov 10 at 3:34

Did you try --jars? These are comments. Maybe someone will answer you.
– Elliott Frisch
Nov 10 at 3:46

add a comment |

up vote
2
down vote

favorite

According to the docs

Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool “local property” to the SparkContext in the thread that’s submitting them

And

The pool properties can be set by creating an XML file, similar to conf/fairscheduler.xml.template, and either putting a file named fairscheduler.xml on the classpath, or setting spark.scheduler.allocation.file property in your SparkConf

So I can do the first part easily enough:

__sp_conf = SparkConf()

__sp_conf.set("spark.scheduler.mode", "FAIR")

sc = SparkContext(conf=__sp_conf)

sc.setLocalProperty("spark.scheduler.pool", "default")

But how do I get an xml file called fairscheduler.xml onto the classpath? Also, the classpath of what? Just the driver? Every executor?

I've tried using the addFile() fuction on SparkContext but that's really for being able to access files from your jobs, I don't think it adds anything to the classpath.

To be clear, if I don't provide the fairscheduler.xml file Spark complains

WARN FairSchedulableBuilder:66 - Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.

edited Nov 10 at 3:24

asked Nov 10 at 2:26

FGreg

5,15164383

According to the docs

Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool “local property” to the SparkContext in the thread that’s submitting them

And

The pool properties can be set by creating an XML file, similar to conf/fairscheduler.xml.template, and either putting a file named fairscheduler.xml on the classpath, or setting spark.scheduler.allocation.file property in your SparkConf

So I can do the first part easily enough:

__sp_conf = SparkConf()

__sp_conf.set("spark.scheduler.mode", "FAIR")

sc = SparkContext(conf=__sp_conf)

sc.setLocalProperty("spark.scheduler.pool", "default")

But how do I get an xml file called fairscheduler.xml onto the classpath? Also, the classpath of what? Just the driver? Every executor?

I've tried using the addFile() fuction on SparkContext but that's really for being able to access files from your jobs, I don't think it adds anything to the classpath.

To be clear, if I don't provide the fairscheduler.xml file Spark complains

WARN FairSchedulableBuilder:66 - Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.

java apache-spark pyspark

edited Nov 10 at 3:24

asked Nov 10 at 2:26

FGreg

5,15164383

edited Nov 10 at 3:24

asked Nov 10 at 2:26

FGreg

5,15164383

edited Nov 10 at 3:24

asked Nov 10 at 2:26

FGreg

5,15164383

asked Nov 10 at 2:26

FGreg

5,15164383

asked Nov 10 at 2:26

FGreg

5,15164383

spark-submit --help; there are several useful options. I think the one you really want is --properties-file which allows you to provide an entire properties file - but you could do it with --conf spark.scheduler.pool=default
– Elliott Frisch
Nov 10 at 2:32

@ElliottFrisch I've set the pool to default already using the spark context sc.setLocalProperty("spark.scheduler.pool", "default"). I need to also somehow provide the fairscheduler.xml file or else Spark complains and defaults back to FIFO order
– FGreg
Nov 10 at 3:20

Run the command I provided. You'll see (also) --files, you can use that to add "fairscheduler.xml" to each container.
– Elliott Frisch
Nov 10 at 3:29

@ElliottFrisch --files does the same thing as sc.addFile which I've tried and does not work. According to the docs that option is a "Comma-separated list of files to be placed in the working directory of each executor. Globs are allowed." It does not say it puts the files on the root of the classpath which is what needs to happen according to github.com/apache/spark/blob/…
– FGreg
Nov 10 at 3:34

Did you try --jars? These are comments. Maybe someone will answer you.
– Elliott Frisch
Nov 10 at 3:46

add a comment |

spark-submit --help; there are several useful options. I think the one you really want is --properties-file which allows you to provide an entire properties file - but you could do it with --conf spark.scheduler.pool=default
– Elliott Frisch
Nov 10 at 2:32

@ElliottFrisch I've set the pool to default already using the spark context sc.setLocalProperty("spark.scheduler.pool", "default"). I need to also somehow provide the fairscheduler.xml file or else Spark complains and defaults back to FIFO order
– FGreg
Nov 10 at 3:20

Run the command I provided. You'll see (also) --files, you can use that to add "fairscheduler.xml" to each container.
– Elliott Frisch
Nov 10 at 3:29

@ElliottFrisch --files does the same thing as sc.addFile which I've tried and does not work. According to the docs that option is a "Comma-separated list of files to be placed in the working directory of each executor. Globs are allowed." It does not say it puts the files on the root of the classpath which is what needs to happen according to github.com/apache/spark/blob/…
– FGreg
Nov 10 at 3:34

Did you try --jars? These are comments. Maybe someone will answer you.
– Elliott Frisch
Nov 10 at 3:46

spark-submit --help; there are several useful options. I think the one you really want is --properties-file which allows you to provide an entire properties file - but you could do it with --conf spark.scheduler.pool=default
– Elliott Frisch
Nov 10 at 2:32

@ElliottFrisch I've set the pool to default already using the spark context sc.setLocalProperty("spark.scheduler.pool", "default"). I need to also somehow provide the fairscheduler.xml file or else Spark complains and defaults back to FIFO order
– FGreg
Nov 10 at 3:20

Run the command I provided. You'll see (also) --files, you can use that to add "fairscheduler.xml" to each container.
– Elliott Frisch
Nov 10 at 3:29

@ElliottFrisch --files does the same thing as sc.addFile which I've tried and does not work. According to the docs that option is a "Comma-separated list of files to be placed in the working directory of each executor. Globs are allowed." It does not say it puts the files on the root of the classpath which is what needs to happen according to github.com/apache/spark/blob/…
– FGreg
Nov 10 at 3:34

Did you try --jars? These are comments. Maybe someone will answer you.
– Elliott Frisch
Nov 10 at 3:46

add a comment |

2 Answers
2

active

oldest

votes

up vote
2
down vote

Question : But how do I get an xml file called fairscheduler.xml onto the
classpath? Also, the classpath of what? Just the driver? Every executor?

Below points especially #4 can help in this case based on the mode you are submitting job.

Here I am trying to list out all...

To use the Fair Scheduler first assign the appropriate scheduler class
in yarn-site.xml:
```
<property>

  <name>yarn.resourcemanager.scheduler.class</name> 
```
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler

your way of __sp_conf.set or simply below way can work

sudo vim /etc/spark/conf/spark-defaults.conf



spark.master yarn



...

spark.yarn.dist.files

/etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml



spark.scheduler.mode FAIR

spark.scheduler.allocation.file fairscheduler.xml

Copy paste fairscheduler.xml in /home/Hadoop/fairscheduler.xml

<?xml version="1.0"?>

<!--Licensed to the Apache Software Foundation

(ASF) under one or morecontributor license agreements. See the NOTICE

file distributed withthis work for additional information regarding

copyright ownership.The ASF licenses this file to You under the Apache

License, Version 2.0(the "License"); you may not use this file except

in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0 Unless required by
applicable law or agreed to in writing, softwaredistributed under the
License is distributed on an "AS IS" BASIS,WITHOUT WARRANTIES OR
CONDITIONS OF ANY KIND, either express or implied.See the License for
the specific language governing permissions and limitations under the
License.-->

<allocations>

    <pool name="sparkmodule1">

        <schedulingMode>FAIR</schedulingMode>

        <weight>1</weight>

        <minShare>2</minShare>

    </pool>

    <pool name="sparkmodule2">

        <schedulingMode>FAIR</schedulingMode>

        <weight>1</weight>

        <minShare>2</minShare>

    </pool>



<pool name="test">

    <schedulingMode>FIFO</schedulingMode>

    <weight>1</weight>

    <minShare>2</minShare>

</pool>

<pool name="default">

    <schedulingMode>FAIR</schedulingMode>

    <weight>3</weight>

    <minShare>3</minShare>

</pool>

</allocations>

where sparkmodule1... are the modules to which you want to create dedicated pool of resources.

Note: you don't need to mention default pool like sc.setLocalProperty("spark.scheduler.pool", "default") if no matching pool from your fairscheduler.xml it will go in to default pool naturally.

Sample Spark submit like below when you are in cluster mode

spark-submit --name "jobname" --class

--master yarn --deploy-mode cluster
--files /etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml

Note : In client mode if we want to submit a spark job other than
home
directory with client mode create a symlink of fairscheduler.xml to
point to the directory you want to point. for example scripts folder where you are executing spark-submit from
ln –s /home/Hadoop/fairscheduler.xml fairscheduler.xml

Note : If you don't want to copy fairscheduler.xml to /home/hadoop
folder you can create fairscheduler.xml under
/etc/spark/conf/fairscheduler.xml and you can give sym link to the
directory where you are executing spark submit like described above.

References : Spark Fair scheduler example

To cross verify :

The Environment tab displays the values for the different environment and configuration variables, including Java™, Spark, and system properties. fair allocation file path will be here.

like...
enter image description here

edited Nov 14 at 20:50

marc_s

568k12810991249

answered Nov 10 at 6:28

Ram Ghadiyaram

15.7k54175

So it does need to be available to every executor?
– FGreg
Nov 12 at 17:41

AFAIK yes if --deploy-mode cluster
– Ram Ghadiyaram
Nov 12 at 17:46

i dont think for client-mode its needed
– Ram Ghadiyaram
Nov 12 at 17:46

out of my experience if deploy-mode cluster --files was working
– Ram Ghadiyaram
Nov 12 at 17:47

also its nothing specific to pyspark. its at yarn level fair is yarn level. could you remove pyspark from question title ?
– Ram Ghadiyaram
Nov 12 at 17:49

|
show 2 more comments

up vote
0
down vote

The following steps we will take:

Run a simple Spark Application and review the Spark UI History Server.

Create a new Spark FAIR Scheduler pool in an external XML file.

Set the spark.scheduler.pool to the pool created in external XML file.

Update code to use threads to trigger use of FAIR pools and rebuild.

Re-deploy the Spark Application with:
- spark.scheduler.mode configuration variable to FAIR.
- spark.scheduler.allocation.file configuration

Run and review Spark UI History Server.

REFERENCE

Spark Continuous Application with FAIR Scheduler presentation https://www.youtube.com/watch?v=oXwOQKXo9VE

edited Nov 11 at 20:30

Charlie

5812826

answered Nov 11 at 20:16

Nagilla Venkatesh

362

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53235520%2fhow-do-i-enable-fair-scheduler-in-pyspark%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
2
down vote

Question : But how do I get an xml file called fairscheduler.xml onto the
classpath? Also, the classpath of what? Just the driver? Every executor?

Below points especially #4 can help in this case based on the mode you are submitting job.

Here I am trying to list out all...

To use the Fair Scheduler first assign the appropriate scheduler class
in yarn-site.xml:
```
<property>

  <name>yarn.resourcemanager.scheduler.class</name> 
```
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler

your way of __sp_conf.set or simply below way can work

sudo vim /etc/spark/conf/spark-defaults.conf



spark.master yarn



...

spark.yarn.dist.files

/etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml



spark.scheduler.mode FAIR

spark.scheduler.allocation.file fairscheduler.xml

Copy paste fairscheduler.xml in /home/Hadoop/fairscheduler.xml

<?xml version="1.0"?>

<!--Licensed to the Apache Software Foundation

(ASF) under one or morecontributor license agreements. See the NOTICE

file distributed withthis work for additional information regarding

copyright ownership.The ASF licenses this file to You under the Apache

License, Version 2.0(the "License"); you may not use this file except

in compliance with the License. You may obtain a copy of the License at

<allocations>

    <pool name="sparkmodule1">

        <schedulingMode>FAIR</schedulingMode>

        <weight>1</weight>

        <minShare>2</minShare>

    </pool>

    <pool name="sparkmodule2">

        <schedulingMode>FAIR</schedulingMode>

        <weight>1</weight>

        <minShare>2</minShare>

    </pool>



<pool name="test">

    <schedulingMode>FIFO</schedulingMode>

    <weight>1</weight>

    <minShare>2</minShare>

</pool>

<pool name="default">

    <schedulingMode>FAIR</schedulingMode>

    <weight>3</weight>

    <minShare>3</minShare>

</pool>

</allocations>

where sparkmodule1... are the modules to which you want to create dedicated pool of resources.

Note: you don't need to mention default pool like sc.setLocalProperty("spark.scheduler.pool", "default") if no matching pool from your fairscheduler.xml it will go in to default pool naturally.

Sample Spark submit like below when you are in cluster mode

spark-submit --name "jobname" --class

--master yarn --deploy-mode cluster
--files /etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml

Note : In client mode if we want to submit a spark job other than
home
directory with client mode create a symlink of fairscheduler.xml to
point to the directory you want to point. for example scripts folder where you are executing spark-submit from
ln –s /home/Hadoop/fairscheduler.xml fairscheduler.xml

Note : If you don't want to copy fairscheduler.xml to /home/hadoop
folder you can create fairscheduler.xml under
/etc/spark/conf/fairscheduler.xml and you can give sym link to the
directory where you are executing spark submit like described above.

References : Spark Fair scheduler example

To cross verify :

The Environment tab displays the values for the different environment and configuration variables, including Java™, Spark, and system properties. fair allocation file path will be here.

like...
enter image description here

edited Nov 14 at 20:50

marc_s

568k12810991249

answered Nov 10 at 6:28

Ram Ghadiyaram

15.7k54175

So it does need to be available to every executor?
– FGreg
Nov 12 at 17:41

AFAIK yes if --deploy-mode cluster
– Ram Ghadiyaram
Nov 12 at 17:46

i dont think for client-mode its needed
– Ram Ghadiyaram
Nov 12 at 17:46

out of my experience if deploy-mode cluster --files was working
– Ram Ghadiyaram
Nov 12 at 17:47

also its nothing specific to pyspark. its at yarn level fair is yarn level. could you remove pyspark from question title ?
– Ram Ghadiyaram
Nov 12 at 17:49

|
show 2 more comments

up vote
2
down vote

Question : But how do I get an xml file called fairscheduler.xml onto the
classpath? Also, the classpath of what? Just the driver? Every executor?

Below points especially #4 can help in this case based on the mode you are submitting job.

Here I am trying to list out all...

To use the Fair Scheduler first assign the appropriate scheduler class
in yarn-site.xml:
```
<property>

  <name>yarn.resourcemanager.scheduler.class</name> 
```
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler

your way of __sp_conf.set or simply below way can work

sudo vim /etc/spark/conf/spark-defaults.conf



spark.master yarn



...

spark.yarn.dist.files

/etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml



spark.scheduler.mode FAIR

spark.scheduler.allocation.file fairscheduler.xml

Copy paste fairscheduler.xml in /home/Hadoop/fairscheduler.xml

<?xml version="1.0"?>

<!--Licensed to the Apache Software Foundation

(ASF) under one or morecontributor license agreements. See the NOTICE

file distributed withthis work for additional information regarding

copyright ownership.The ASF licenses this file to You under the Apache

License, Version 2.0(the "License"); you may not use this file except

in compliance with the License. You may obtain a copy of the License at

<allocations>

    <pool name="sparkmodule1">

        <schedulingMode>FAIR</schedulingMode>

        <weight>1</weight>

        <minShare>2</minShare>

    </pool>

    <pool name="sparkmodule2">

        <schedulingMode>FAIR</schedulingMode>

        <weight>1</weight>

        <minShare>2</minShare>

    </pool>



<pool name="test">

    <schedulingMode>FIFO</schedulingMode>

    <weight>1</weight>

    <minShare>2</minShare>

</pool>

<pool name="default">

    <schedulingMode>FAIR</schedulingMode>

    <weight>3</weight>

    <minShare>3</minShare>

</pool>

</allocations>

where sparkmodule1... are the modules to which you want to create dedicated pool of resources.

Note: you don't need to mention default pool like sc.setLocalProperty("spark.scheduler.pool", "default") if no matching pool from your fairscheduler.xml it will go in to default pool naturally.

Sample Spark submit like below when you are in cluster mode

spark-submit --name "jobname" --class

--master yarn --deploy-mode cluster
--files /etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml

Note : In client mode if we want to submit a spark job other than
home
directory with client mode create a symlink of fairscheduler.xml to
point to the directory you want to point. for example scripts folder where you are executing spark-submit from
ln –s /home/Hadoop/fairscheduler.xml fairscheduler.xml

Note : If you don't want to copy fairscheduler.xml to /home/hadoop
folder you can create fairscheduler.xml under
/etc/spark/conf/fairscheduler.xml and you can give sym link to the
directory where you are executing spark submit like described above.

References : Spark Fair scheduler example

To cross verify :

The Environment tab displays the values for the different environment and configuration variables, including Java™, Spark, and system properties. fair allocation file path will be here.

like...
enter image description here

edited Nov 14 at 20:50

marc_s

568k12810991249

answered Nov 10 at 6:28

Ram Ghadiyaram

15.7k54175

So it does need to be available to every executor?
– FGreg
Nov 12 at 17:41

AFAIK yes if --deploy-mode cluster
– Ram Ghadiyaram
Nov 12 at 17:46

i dont think for client-mode its needed
– Ram Ghadiyaram
Nov 12 at 17:46

out of my experience if deploy-mode cluster --files was working
– Ram Ghadiyaram
Nov 12 at 17:47

also its nothing specific to pyspark. its at yarn level fair is yarn level. could you remove pyspark from question title ?
– Ram Ghadiyaram
Nov 12 at 17:49

|
show 2 more comments

up vote
2
down vote

Question : But how do I get an xml file called fairscheduler.xml onto the
classpath? Also, the classpath of what? Just the driver? Every executor?

Below points especially #4 can help in this case based on the mode you are submitting job.

Here I am trying to list out all...

To use the Fair Scheduler first assign the appropriate scheduler class
in yarn-site.xml:
```
<property>

  <name>yarn.resourcemanager.scheduler.class</name> 
```
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler

your way of __sp_conf.set or simply below way can work

sudo vim /etc/spark/conf/spark-defaults.conf



spark.master yarn



...

spark.yarn.dist.files

/etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml



spark.scheduler.mode FAIR

spark.scheduler.allocation.file fairscheduler.xml

Copy paste fairscheduler.xml in /home/Hadoop/fairscheduler.xml

<?xml version="1.0"?>

<!--Licensed to the Apache Software Foundation

(ASF) under one or morecontributor license agreements. See the NOTICE

file distributed withthis work for additional information regarding

copyright ownership.The ASF licenses this file to You under the Apache

License, Version 2.0(the "License"); you may not use this file except

in compliance with the License. You may obtain a copy of the License at

<allocations>

    <pool name="sparkmodule1">

        <schedulingMode>FAIR</schedulingMode>

        <weight>1</weight>

        <minShare>2</minShare>

    </pool>

    <pool name="sparkmodule2">

        <schedulingMode>FAIR</schedulingMode>

        <weight>1</weight>

        <minShare>2</minShare>

    </pool>



<pool name="test">

    <schedulingMode>FIFO</schedulingMode>

    <weight>1</weight>

    <minShare>2</minShare>

</pool>

<pool name="default">

    <schedulingMode>FAIR</schedulingMode>

    <weight>3</weight>

    <minShare>3</minShare>

</pool>

</allocations>

where sparkmodule1... are the modules to which you want to create dedicated pool of resources.

Note: you don't need to mention default pool like sc.setLocalProperty("spark.scheduler.pool", "default") if no matching pool from your fairscheduler.xml it will go in to default pool naturally.

Sample Spark submit like below when you are in cluster mode

spark-submit --name "jobname" --class

--master yarn --deploy-mode cluster
--files /etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml

Note : In client mode if we want to submit a spark job other than
home
directory with client mode create a symlink of fairscheduler.xml to
point to the directory you want to point. for example scripts folder where you are executing spark-submit from
ln –s /home/Hadoop/fairscheduler.xml fairscheduler.xml

Note : If you don't want to copy fairscheduler.xml to /home/hadoop
folder you can create fairscheduler.xml under
/etc/spark/conf/fairscheduler.xml and you can give sym link to the
directory where you are executing spark submit like described above.

References : Spark Fair scheduler example

To cross verify :

The Environment tab displays the values for the different environment and configuration variables, including Java™, Spark, and system properties. fair allocation file path will be here.

like...
enter image description here

edited Nov 14 at 20:50

marc_s

568k12810991249

answered Nov 10 at 6:28

Ram Ghadiyaram

15.7k54175

Question : But how do I get an xml file called fairscheduler.xml onto the
classpath? Also, the classpath of what? Just the driver? Every executor?

Below points especially #4 can help in this case based on the mode you are submitting job.

Here I am trying to list out all...

To use the Fair Scheduler first assign the appropriate scheduler class
in yarn-site.xml:
```
<property>

  <name>yarn.resourcemanager.scheduler.class</name> 
```
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler

your way of __sp_conf.set or simply below way can work

sudo vim /etc/spark/conf/spark-defaults.conf



spark.master yarn



...

spark.yarn.dist.files

/etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml



spark.scheduler.mode FAIR

spark.scheduler.allocation.file fairscheduler.xml

Copy paste fairscheduler.xml in /home/Hadoop/fairscheduler.xml

<?xml version="1.0"?>

<!--Licensed to the Apache Software Foundation

(ASF) under one or morecontributor license agreements. See the NOTICE

file distributed withthis work for additional information regarding

copyright ownership.The ASF licenses this file to You under the Apache

License, Version 2.0(the "License"); you may not use this file except

in compliance with the License. You may obtain a copy of the License at

<allocations>

    <pool name="sparkmodule1">

        <schedulingMode>FAIR</schedulingMode>

        <weight>1</weight>

        <minShare>2</minShare>

    </pool>

    <pool name="sparkmodule2">

        <schedulingMode>FAIR</schedulingMode>

        <weight>1</weight>

        <minShare>2</minShare>

    </pool>



<pool name="test">

    <schedulingMode>FIFO</schedulingMode>

    <weight>1</weight>

    <minShare>2</minShare>

</pool>

<pool name="default">

    <schedulingMode>FAIR</schedulingMode>

    <weight>3</weight>

    <minShare>3</minShare>

</pool>

</allocations>

where sparkmodule1... are the modules to which you want to create dedicated pool of resources.

Note: you don't need to mention default pool like sc.setLocalProperty("spark.scheduler.pool", "default") if no matching pool from your fairscheduler.xml it will go in to default pool naturally.

Sample Spark submit like below when you are in cluster mode

spark-submit --name "jobname" --class

--master yarn --deploy-mode cluster
--files /etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml

Note : In client mode if we want to submit a spark job other than
home
directory with client mode create a symlink of fairscheduler.xml to
point to the directory you want to point. for example scripts folder where you are executing spark-submit from
ln –s /home/Hadoop/fairscheduler.xml fairscheduler.xml

Note : If you don't want to copy fairscheduler.xml to /home/hadoop
folder you can create fairscheduler.xml under
/etc/spark/conf/fairscheduler.xml and you can give sym link to the
directory where you are executing spark submit like described above.

References : Spark Fair scheduler example

To cross verify :

The Environment tab displays the values for the different environment and configuration variables, including Java™, Spark, and system properties. fair allocation file path will be here.

like...
enter image description here

edited Nov 14 at 20:50

marc_s

568k12810991249

answered Nov 10 at 6:28

Ram Ghadiyaram

15.7k54175

edited Nov 14 at 20:50

marc_s

568k12810991249

edited Nov 14 at 20:50

marc_s

568k12810991249

edited Nov 14 at 20:50

marc_s

568k12810991249

answered Nov 10 at 6:28

Ram Ghadiyaram

15.7k54175

answered Nov 10 at 6:28

Ram Ghadiyaram

15.7k54175

answered Nov 10 at 6:28

Ram Ghadiyaram

15.7k54175

So it does need to be available to every executor?
– FGreg
Nov 12 at 17:41

AFAIK yes if --deploy-mode cluster
– Ram Ghadiyaram
Nov 12 at 17:46

i dont think for client-mode its needed
– Ram Ghadiyaram
Nov 12 at 17:46

out of my experience if deploy-mode cluster --files was working
– Ram Ghadiyaram
Nov 12 at 17:47

also its nothing specific to pyspark. its at yarn level fair is yarn level. could you remove pyspark from question title ?
– Ram Ghadiyaram
Nov 12 at 17:49

|
show 2 more comments

So it does need to be available to every executor?
– FGreg
Nov 12 at 17:41

AFAIK yes if --deploy-mode cluster
– Ram Ghadiyaram
Nov 12 at 17:46

i dont think for client-mode its needed
– Ram Ghadiyaram
Nov 12 at 17:46

out of my experience if deploy-mode cluster --files was working
– Ram Ghadiyaram
Nov 12 at 17:47

also its nothing specific to pyspark. its at yarn level fair is yarn level. could you remove pyspark from question title ?
– Ram Ghadiyaram
Nov 12 at 17:49

So it does need to be available to every executor?
– FGreg
Nov 12 at 17:41

AFAIK yes if --deploy-mode cluster
– Ram Ghadiyaram
Nov 12 at 17:46

i dont think for client-mode its needed
– Ram Ghadiyaram
Nov 12 at 17:46

out of my experience if deploy-mode cluster --files was working
– Ram Ghadiyaram
Nov 12 at 17:47

also its nothing specific to pyspark. its at yarn level fair is yarn level. could you remove pyspark from question title ?
– Ram Ghadiyaram
Nov 12 at 17:49

|
show 2 more comments

up vote
0
down vote

The following steps we will take:

Run a simple Spark Application and review the Spark UI History Server.

Create a new Spark FAIR Scheduler pool in an external XML file.

Set the spark.scheduler.pool to the pool created in external XML file.

Update code to use threads to trigger use of FAIR pools and rebuild.

Re-deploy the Spark Application with:
- spark.scheduler.mode configuration variable to FAIR.
- spark.scheduler.allocation.file configuration

Run and review Spark UI History Server.

REFERENCE

Spark Continuous Application with FAIR Scheduler presentation https://www.youtube.com/watch?v=oXwOQKXo9VE

edited Nov 11 at 20:30

Charlie

5812826

answered Nov 11 at 20:16

Nagilla Venkatesh

362

add a comment |

up vote
0
down vote

The following steps we will take:

Run a simple Spark Application and review the Spark UI History Server.

Create a new Spark FAIR Scheduler pool in an external XML file.

Set the spark.scheduler.pool to the pool created in external XML file.

Update code to use threads to trigger use of FAIR pools and rebuild.

Re-deploy the Spark Application with:
- spark.scheduler.mode configuration variable to FAIR.
- spark.scheduler.allocation.file configuration

Run and review Spark UI History Server.

REFERENCE

Spark Continuous Application with FAIR Scheduler presentation https://www.youtube.com/watch?v=oXwOQKXo9VE

edited Nov 11 at 20:30

Charlie

5812826

answered Nov 11 at 20:16

Nagilla Venkatesh

362

add a comment |

up vote
0
down vote

The following steps we will take:

Run a simple Spark Application and review the Spark UI History Server.

Create a new Spark FAIR Scheduler pool in an external XML file.

Set the spark.scheduler.pool to the pool created in external XML file.

Update code to use threads to trigger use of FAIR pools and rebuild.

Re-deploy the Spark Application with:
- spark.scheduler.mode configuration variable to FAIR.
- spark.scheduler.allocation.file configuration

Run and review Spark UI History Server.

REFERENCE

Spark Continuous Application with FAIR Scheduler presentation https://www.youtube.com/watch?v=oXwOQKXo9VE

edited Nov 11 at 20:30

Charlie

5812826

answered Nov 11 at 20:16

Nagilla Venkatesh

362

The following steps we will take:

Run a simple Spark Application and review the Spark UI History Server.

Create a new Spark FAIR Scheduler pool in an external XML file.

Set the spark.scheduler.pool to the pool created in external XML file.

Update code to use threads to trigger use of FAIR pools and rebuild.

Re-deploy the Spark Application with:
- spark.scheduler.mode configuration variable to FAIR.
- spark.scheduler.allocation.file configuration

Run and review Spark UI History Server.

REFERENCE

Spark Continuous Application with FAIR Scheduler presentation https://www.youtube.com/watch?v=oXwOQKXo9VE

edited Nov 11 at 20:30

Charlie

5812826

answered Nov 11 at 20:16

Nagilla Venkatesh

362

edited Nov 11 at 20:30

Charlie

5812826

edited Nov 11 at 20:30

Charlie

5812826

edited Nov 11 at 20:30

Charlie

5812826

answered Nov 11 at 20:16

Nagilla Venkatesh

362

answered Nov 11 at 20:16

Nagilla Venkatesh

362

answered Nov 11 at 20:16

Nagilla Venkatesh

362

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Xtykutl