How Do I Enable Fair Scheduler in PySpark?
up vote
2
down vote
favorite
According to the docs
Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool “local property” to the SparkContext in the thread that’s submitting them
And
The pool properties can be set by creating an XML file, similar to conf/fairscheduler.xml.template, and either putting a file named fairscheduler.xml on the classpath, or setting spark.scheduler.allocation.file property in your SparkConf
So I can do the first part easily enough:
__sp_conf = SparkConf()
__sp_conf.set("spark.scheduler.mode", "FAIR")
sc = SparkContext(conf=__sp_conf)
sc.setLocalProperty("spark.scheduler.pool", "default")
But how do I get an xml file called fairscheduler.xml onto the classpath? Also, the classpath of what? Just the driver? Every executor?
I've tried using the addFile() fuction on SparkContext but that's really for being able to access files from your jobs, I don't think it adds anything to the classpath.
My other thought was modifying the PYSPARK_SUBMIT_ARGS environment variable to try messing around with the command sent to spark-submit but I'm not sure there's a way to alter the classpath using that method. Additionally, this would only alter the classpath of the driver, not every executor which I'm not sure would work or not.
To be clear, if I don't provide the fairscheduler.xml file Spark complains
WARN FairSchedulableBuilder:66 - Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
java apache-spark pyspark
add a comment |
up vote
2
down vote
favorite
According to the docs
Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool “local property” to the SparkContext in the thread that’s submitting them
And
The pool properties can be set by creating an XML file, similar to conf/fairscheduler.xml.template, and either putting a file named fairscheduler.xml on the classpath, or setting spark.scheduler.allocation.file property in your SparkConf
So I can do the first part easily enough:
__sp_conf = SparkConf()
__sp_conf.set("spark.scheduler.mode", "FAIR")
sc = SparkContext(conf=__sp_conf)
sc.setLocalProperty("spark.scheduler.pool", "default")
But how do I get an xml file called fairscheduler.xml onto the classpath? Also, the classpath of what? Just the driver? Every executor?
I've tried using the addFile() fuction on SparkContext but that's really for being able to access files from your jobs, I don't think it adds anything to the classpath.
My other thought was modifying the PYSPARK_SUBMIT_ARGS environment variable to try messing around with the command sent to spark-submit but I'm not sure there's a way to alter the classpath using that method. Additionally, this would only alter the classpath of the driver, not every executor which I'm not sure would work or not.
To be clear, if I don't provide the fairscheduler.xml file Spark complains
WARN FairSchedulableBuilder:66 - Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
java apache-spark pyspark
spark-submit --help; there are several useful options. I think the one you really want is--properties-filewhich allows you to provide an entire properties file - but you could do it with--conf spark.scheduler.pool=default
– Elliott Frisch
Nov 10 at 2:32
@ElliottFrisch I've set the pool to default already using the spark contextsc.setLocalProperty("spark.scheduler.pool", "default"). I need to also somehow provide thefairscheduler.xmlfile or else Spark complains and defaults back toFIFOorder
– FGreg
Nov 10 at 3:20
Run the command I provided. You'll see (also)--files, you can use that to add "fairscheduler.xml" to each container.
– Elliott Frisch
Nov 10 at 3:29
@ElliottFrisch--filesdoes the same thing assc.addFilewhich I've tried and does not work. According to the docs that option is a "Comma-separated list of files to be placed in the working directory of each executor. Globs are allowed." It does not say it puts the files on the root of the classpath which is what needs to happen according to github.com/apache/spark/blob/…
– FGreg
Nov 10 at 3:34
Did you try--jars? These are comments. Maybe someone will answer you.
– Elliott Frisch
Nov 10 at 3:46
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
According to the docs
Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool “local property” to the SparkContext in the thread that’s submitting them
And
The pool properties can be set by creating an XML file, similar to conf/fairscheduler.xml.template, and either putting a file named fairscheduler.xml on the classpath, or setting spark.scheduler.allocation.file property in your SparkConf
So I can do the first part easily enough:
__sp_conf = SparkConf()
__sp_conf.set("spark.scheduler.mode", "FAIR")
sc = SparkContext(conf=__sp_conf)
sc.setLocalProperty("spark.scheduler.pool", "default")
But how do I get an xml file called fairscheduler.xml onto the classpath? Also, the classpath of what? Just the driver? Every executor?
I've tried using the addFile() fuction on SparkContext but that's really for being able to access files from your jobs, I don't think it adds anything to the classpath.
My other thought was modifying the PYSPARK_SUBMIT_ARGS environment variable to try messing around with the command sent to spark-submit but I'm not sure there's a way to alter the classpath using that method. Additionally, this would only alter the classpath of the driver, not every executor which I'm not sure would work or not.
To be clear, if I don't provide the fairscheduler.xml file Spark complains
WARN FairSchedulableBuilder:66 - Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
java apache-spark pyspark
According to the docs
Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool “local property” to the SparkContext in the thread that’s submitting them
And
The pool properties can be set by creating an XML file, similar to conf/fairscheduler.xml.template, and either putting a file named fairscheduler.xml on the classpath, or setting spark.scheduler.allocation.file property in your SparkConf
So I can do the first part easily enough:
__sp_conf = SparkConf()
__sp_conf.set("spark.scheduler.mode", "FAIR")
sc = SparkContext(conf=__sp_conf)
sc.setLocalProperty("spark.scheduler.pool", "default")
But how do I get an xml file called fairscheduler.xml onto the classpath? Also, the classpath of what? Just the driver? Every executor?
I've tried using the addFile() fuction on SparkContext but that's really for being able to access files from your jobs, I don't think it adds anything to the classpath.
My other thought was modifying the PYSPARK_SUBMIT_ARGS environment variable to try messing around with the command sent to spark-submit but I'm not sure there's a way to alter the classpath using that method. Additionally, this would only alter the classpath of the driver, not every executor which I'm not sure would work or not.
To be clear, if I don't provide the fairscheduler.xml file Spark complains
WARN FairSchedulableBuilder:66 - Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
java apache-spark pyspark
java apache-spark pyspark
edited Nov 10 at 3:24
asked Nov 10 at 2:26
FGreg
5,15164383
5,15164383
spark-submit --help; there are several useful options. I think the one you really want is--properties-filewhich allows you to provide an entire properties file - but you could do it with--conf spark.scheduler.pool=default
– Elliott Frisch
Nov 10 at 2:32
@ElliottFrisch I've set the pool to default already using the spark contextsc.setLocalProperty("spark.scheduler.pool", "default"). I need to also somehow provide thefairscheduler.xmlfile or else Spark complains and defaults back toFIFOorder
– FGreg
Nov 10 at 3:20
Run the command I provided. You'll see (also)--files, you can use that to add "fairscheduler.xml" to each container.
– Elliott Frisch
Nov 10 at 3:29
@ElliottFrisch--filesdoes the same thing assc.addFilewhich I've tried and does not work. According to the docs that option is a "Comma-separated list of files to be placed in the working directory of each executor. Globs are allowed." It does not say it puts the files on the root of the classpath which is what needs to happen according to github.com/apache/spark/blob/…
– FGreg
Nov 10 at 3:34
Did you try--jars? These are comments. Maybe someone will answer you.
– Elliott Frisch
Nov 10 at 3:46
add a comment |
spark-submit --help; there are several useful options. I think the one you really want is--properties-filewhich allows you to provide an entire properties file - but you could do it with--conf spark.scheduler.pool=default
– Elliott Frisch
Nov 10 at 2:32
@ElliottFrisch I've set the pool to default already using the spark contextsc.setLocalProperty("spark.scheduler.pool", "default"). I need to also somehow provide thefairscheduler.xmlfile or else Spark complains and defaults back toFIFOorder
– FGreg
Nov 10 at 3:20
Run the command I provided. You'll see (also)--files, you can use that to add "fairscheduler.xml" to each container.
– Elliott Frisch
Nov 10 at 3:29
@ElliottFrisch--filesdoes the same thing assc.addFilewhich I've tried and does not work. According to the docs that option is a "Comma-separated list of files to be placed in the working directory of each executor. Globs are allowed." It does not say it puts the files on the root of the classpath which is what needs to happen according to github.com/apache/spark/blob/…
– FGreg
Nov 10 at 3:34
Did you try--jars? These are comments. Maybe someone will answer you.
– Elliott Frisch
Nov 10 at 3:46
spark-submit --help; there are several useful options. I think the one you really want is --properties-file which allows you to provide an entire properties file - but you could do it with --conf spark.scheduler.pool=default– Elliott Frisch
Nov 10 at 2:32
spark-submit --help; there are several useful options. I think the one you really want is --properties-file which allows you to provide an entire properties file - but you could do it with --conf spark.scheduler.pool=default– Elliott Frisch
Nov 10 at 2:32
@ElliottFrisch I've set the pool to default already using the spark context
sc.setLocalProperty("spark.scheduler.pool", "default"). I need to also somehow provide the fairscheduler.xml file or else Spark complains and defaults back to FIFO order– FGreg
Nov 10 at 3:20
@ElliottFrisch I've set the pool to default already using the spark context
sc.setLocalProperty("spark.scheduler.pool", "default"). I need to also somehow provide the fairscheduler.xml file or else Spark complains and defaults back to FIFO order– FGreg
Nov 10 at 3:20
Run the command I provided. You'll see (also)
--files, you can use that to add "fairscheduler.xml" to each container.– Elliott Frisch
Nov 10 at 3:29
Run the command I provided. You'll see (also)
--files, you can use that to add "fairscheduler.xml" to each container.– Elliott Frisch
Nov 10 at 3:29
@ElliottFrisch
--files does the same thing as sc.addFile which I've tried and does not work. According to the docs that option is a "Comma-separated list of files to be placed in the working directory of each executor. Globs are allowed." It does not say it puts the files on the root of the classpath which is what needs to happen according to github.com/apache/spark/blob/…– FGreg
Nov 10 at 3:34
@ElliottFrisch
--files does the same thing as sc.addFile which I've tried and does not work. According to the docs that option is a "Comma-separated list of files to be placed in the working directory of each executor. Globs are allowed." It does not say it puts the files on the root of the classpath which is what needs to happen according to github.com/apache/spark/blob/…– FGreg
Nov 10 at 3:34
Did you try
--jars? These are comments. Maybe someone will answer you.– Elliott Frisch
Nov 10 at 3:46
Did you try
--jars? These are comments. Maybe someone will answer you.– Elliott Frisch
Nov 10 at 3:46
add a comment |
2 Answers
2
active
oldest
votes
up vote
2
down vote
Question : But how do I get an xml file called
fairscheduler.xmlonto the
classpath? Also, the classpath of what? Just the driver? Every executor?
Below points especially #4 can help in this case based on the mode you are submitting job.
Here I am trying to list out all...
To use the Fair Scheduler first assign the appropriate scheduler class
inyarn-site.xml:
<property>
<name>yarn.resourcemanager.scheduler.class</name>
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
your way of
__sp_conf.setor simply below way can work
sudo vim /etc/spark/conf/spark-defaults.conf
spark.master yarn
...
spark.yarn.dist.files
/etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml
spark.scheduler.mode FAIR
spark.scheduler.allocation.file fairscheduler.xml
Copy paste fairscheduler.xml in /home/Hadoop/fairscheduler.xml
<?xml version="1.0"?>
<!--Licensed to the Apache Software Foundation
(ASF) under one or morecontributor license agreements. See the NOTICE
file distributed withthis work for additional information regarding
copyright ownership.The ASF licenses this file to You under the Apache
License, Version 2.0(the "License"); you may not use this file except
in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 Unless required by
applicable law or agreed to in writing, softwaredistributed under the
License is distributed on an "AS IS" BASIS,WITHOUT WARRANTIES OR
CONDITIONS OF ANY KIND, either express or implied.See the License for
the specific language governing permissions and limitations under the
License.-->
<allocations>
<pool name="sparkmodule1">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>2</minShare>
</pool>
<pool name="sparkmodule2">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>2</minShare>
</pool>
<pool name="test">
<schedulingMode>FIFO</schedulingMode>
<weight>1</weight>
<minShare>2</minShare>
</pool>
<pool name="default">
<schedulingMode>FAIR</schedulingMode>
<weight>3</weight>
<minShare>3</minShare>
</pool>
</allocations>
where sparkmodule1... are the modules to which you want to create dedicated pool of resources.
Note: you don't need to mention default pool like
sc.setLocalProperty("spark.scheduler.pool", "default")if no matching pool from your fairscheduler.xml it will go in to default pool naturally.
Sample Spark submit like below when you are in cluster mode
spark-submit --name "jobname" --class
--master yarn --deploy-mode cluster
--files /etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml
Note : In client mode if we want to submit a spark job other than
home
directory with client mode create a symlink of fairscheduler.xml to
point to the directory you want to point. for example scripts folder where you are executing spark-submit from
ln –s /home/Hadoop/fairscheduler.xml fairscheduler.xml
Note : If you don't want to copy fairscheduler.xml to /home/hadoop
folder you can create fairscheduler.xml under
/etc/spark/conf/fairscheduler.xmland you can give sym link to the
directory where you are executing spark submit like described above.
References : Spark Fair scheduler example
To cross verify :
The Environment tab displays the values for the different environment and configuration variables, including Java™, Spark, and system properties. fair allocation file path will be here.
like...

So it does need to be available to every executor?
– FGreg
Nov 12 at 17:41
AFAIK yes if --deploy-mode cluster
– Ram Ghadiyaram
Nov 12 at 17:46
i dont think for client-mode its needed
– Ram Ghadiyaram
Nov 12 at 17:46
out of my experience if deploy-mode cluster --files was working
– Ram Ghadiyaram
Nov 12 at 17:47
also its nothing specific to pyspark. its at yarn level fair is yarn level. could you remove pyspark from question title ?
– Ram Ghadiyaram
Nov 12 at 17:49
|
show 2 more comments
up vote
0
down vote
The following steps we will take:
- Run a simple Spark Application and review the Spark UI History Server.
- Create a new Spark FAIR Scheduler pool in an external XML file.
- Set the spark.scheduler.pool to the pool created in external XML file.
- Update code to use threads to trigger use of FAIR pools and rebuild.
- Re-deploy the Spark Application with:
spark.scheduler.mode configuration variable to FAIR.
spark.scheduler.allocation.file configuration
- Run and review Spark UI History Server.
REFERENCE
Spark Continuous Application with FAIR Scheduler presentation https://www.youtube.com/watch?v=oXwOQKXo9VE
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53235520%2fhow-do-i-enable-fair-scheduler-in-pyspark%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
Question : But how do I get an xml file called
fairscheduler.xmlonto the
classpath? Also, the classpath of what? Just the driver? Every executor?
Below points especially #4 can help in this case based on the mode you are submitting job.
Here I am trying to list out all...
To use the Fair Scheduler first assign the appropriate scheduler class
inyarn-site.xml:
<property>
<name>yarn.resourcemanager.scheduler.class</name>
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
your way of
__sp_conf.setor simply below way can work
sudo vim /etc/spark/conf/spark-defaults.conf
spark.master yarn
...
spark.yarn.dist.files
/etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml
spark.scheduler.mode FAIR
spark.scheduler.allocation.file fairscheduler.xml
Copy paste fairscheduler.xml in /home/Hadoop/fairscheduler.xml
<?xml version="1.0"?>
<!--Licensed to the Apache Software Foundation
(ASF) under one or morecontributor license agreements. See the NOTICE
file distributed withthis work for additional information regarding
copyright ownership.The ASF licenses this file to You under the Apache
License, Version 2.0(the "License"); you may not use this file except
in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 Unless required by
applicable law or agreed to in writing, softwaredistributed under the
License is distributed on an "AS IS" BASIS,WITHOUT WARRANTIES OR
CONDITIONS OF ANY KIND, either express or implied.See the License for
the specific language governing permissions and limitations under the
License.-->
<allocations>
<pool name="sparkmodule1">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>2</minShare>
</pool>
<pool name="sparkmodule2">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>2</minShare>
</pool>
<pool name="test">
<schedulingMode>FIFO</schedulingMode>
<weight>1</weight>
<minShare>2</minShare>
</pool>
<pool name="default">
<schedulingMode>FAIR</schedulingMode>
<weight>3</weight>
<minShare>3</minShare>
</pool>
</allocations>
where sparkmodule1... are the modules to which you want to create dedicated pool of resources.
Note: you don't need to mention default pool like
sc.setLocalProperty("spark.scheduler.pool", "default")if no matching pool from your fairscheduler.xml it will go in to default pool naturally.
Sample Spark submit like below when you are in cluster mode
spark-submit --name "jobname" --class
--master yarn --deploy-mode cluster
--files /etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml
Note : In client mode if we want to submit a spark job other than
home
directory with client mode create a symlink of fairscheduler.xml to
point to the directory you want to point. for example scripts folder where you are executing spark-submit from
ln –s /home/Hadoop/fairscheduler.xml fairscheduler.xml
Note : If you don't want to copy fairscheduler.xml to /home/hadoop
folder you can create fairscheduler.xml under
/etc/spark/conf/fairscheduler.xmland you can give sym link to the
directory where you are executing spark submit like described above.
References : Spark Fair scheduler example
To cross verify :
The Environment tab displays the values for the different environment and configuration variables, including Java™, Spark, and system properties. fair allocation file path will be here.
like...

So it does need to be available to every executor?
– FGreg
Nov 12 at 17:41
AFAIK yes if --deploy-mode cluster
– Ram Ghadiyaram
Nov 12 at 17:46
i dont think for client-mode its needed
– Ram Ghadiyaram
Nov 12 at 17:46
out of my experience if deploy-mode cluster --files was working
– Ram Ghadiyaram
Nov 12 at 17:47
also its nothing specific to pyspark. its at yarn level fair is yarn level. could you remove pyspark from question title ?
– Ram Ghadiyaram
Nov 12 at 17:49
|
show 2 more comments
up vote
2
down vote
Question : But how do I get an xml file called
fairscheduler.xmlonto the
classpath? Also, the classpath of what? Just the driver? Every executor?
Below points especially #4 can help in this case based on the mode you are submitting job.
Here I am trying to list out all...
To use the Fair Scheduler first assign the appropriate scheduler class
inyarn-site.xml:
<property>
<name>yarn.resourcemanager.scheduler.class</name>
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
your way of
__sp_conf.setor simply below way can work
sudo vim /etc/spark/conf/spark-defaults.conf
spark.master yarn
...
spark.yarn.dist.files
/etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml
spark.scheduler.mode FAIR
spark.scheduler.allocation.file fairscheduler.xml
Copy paste fairscheduler.xml in /home/Hadoop/fairscheduler.xml
<?xml version="1.0"?>
<!--Licensed to the Apache Software Foundation
(ASF) under one or morecontributor license agreements. See the NOTICE
file distributed withthis work for additional information regarding
copyright ownership.The ASF licenses this file to You under the Apache
License, Version 2.0(the "License"); you may not use this file except
in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 Unless required by
applicable law or agreed to in writing, softwaredistributed under the
License is distributed on an "AS IS" BASIS,WITHOUT WARRANTIES OR
CONDITIONS OF ANY KIND, either express or implied.See the License for
the specific language governing permissions and limitations under the
License.-->
<allocations>
<pool name="sparkmodule1">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>2</minShare>
</pool>
<pool name="sparkmodule2">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>2</minShare>
</pool>
<pool name="test">
<schedulingMode>FIFO</schedulingMode>
<weight>1</weight>
<minShare>2</minShare>
</pool>
<pool name="default">
<schedulingMode>FAIR</schedulingMode>
<weight>3</weight>
<minShare>3</minShare>
</pool>
</allocations>
where sparkmodule1... are the modules to which you want to create dedicated pool of resources.
Note: you don't need to mention default pool like
sc.setLocalProperty("spark.scheduler.pool", "default")if no matching pool from your fairscheduler.xml it will go in to default pool naturally.
Sample Spark submit like below when you are in cluster mode
spark-submit --name "jobname" --class
--master yarn --deploy-mode cluster
--files /etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml
Note : In client mode if we want to submit a spark job other than
home
directory with client mode create a symlink of fairscheduler.xml to
point to the directory you want to point. for example scripts folder where you are executing spark-submit from
ln –s /home/Hadoop/fairscheduler.xml fairscheduler.xml
Note : If you don't want to copy fairscheduler.xml to /home/hadoop
folder you can create fairscheduler.xml under
/etc/spark/conf/fairscheduler.xmland you can give sym link to the
directory where you are executing spark submit like described above.
References : Spark Fair scheduler example
To cross verify :
The Environment tab displays the values for the different environment and configuration variables, including Java™, Spark, and system properties. fair allocation file path will be here.
like...

So it does need to be available to every executor?
– FGreg
Nov 12 at 17:41
AFAIK yes if --deploy-mode cluster
– Ram Ghadiyaram
Nov 12 at 17:46
i dont think for client-mode its needed
– Ram Ghadiyaram
Nov 12 at 17:46
out of my experience if deploy-mode cluster --files was working
– Ram Ghadiyaram
Nov 12 at 17:47
also its nothing specific to pyspark. its at yarn level fair is yarn level. could you remove pyspark from question title ?
– Ram Ghadiyaram
Nov 12 at 17:49
|
show 2 more comments
up vote
2
down vote
up vote
2
down vote
Question : But how do I get an xml file called
fairscheduler.xmlonto the
classpath? Also, the classpath of what? Just the driver? Every executor?
Below points especially #4 can help in this case based on the mode you are submitting job.
Here I am trying to list out all...
To use the Fair Scheduler first assign the appropriate scheduler class
inyarn-site.xml:
<property>
<name>yarn.resourcemanager.scheduler.class</name>
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
your way of
__sp_conf.setor simply below way can work
sudo vim /etc/spark/conf/spark-defaults.conf
spark.master yarn
...
spark.yarn.dist.files
/etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml
spark.scheduler.mode FAIR
spark.scheduler.allocation.file fairscheduler.xml
Copy paste fairscheduler.xml in /home/Hadoop/fairscheduler.xml
<?xml version="1.0"?>
<!--Licensed to the Apache Software Foundation
(ASF) under one or morecontributor license agreements. See the NOTICE
file distributed withthis work for additional information regarding
copyright ownership.The ASF licenses this file to You under the Apache
License, Version 2.0(the "License"); you may not use this file except
in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 Unless required by
applicable law or agreed to in writing, softwaredistributed under the
License is distributed on an "AS IS" BASIS,WITHOUT WARRANTIES OR
CONDITIONS OF ANY KIND, either express or implied.See the License for
the specific language governing permissions and limitations under the
License.-->
<allocations>
<pool name="sparkmodule1">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>2</minShare>
</pool>
<pool name="sparkmodule2">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>2</minShare>
</pool>
<pool name="test">
<schedulingMode>FIFO</schedulingMode>
<weight>1</weight>
<minShare>2</minShare>
</pool>
<pool name="default">
<schedulingMode>FAIR</schedulingMode>
<weight>3</weight>
<minShare>3</minShare>
</pool>
</allocations>
where sparkmodule1... are the modules to which you want to create dedicated pool of resources.
Note: you don't need to mention default pool like
sc.setLocalProperty("spark.scheduler.pool", "default")if no matching pool from your fairscheduler.xml it will go in to default pool naturally.
Sample Spark submit like below when you are in cluster mode
spark-submit --name "jobname" --class
--master yarn --deploy-mode cluster
--files /etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml
Note : In client mode if we want to submit a spark job other than
home
directory with client mode create a symlink of fairscheduler.xml to
point to the directory you want to point. for example scripts folder where you are executing spark-submit from
ln –s /home/Hadoop/fairscheduler.xml fairscheduler.xml
Note : If you don't want to copy fairscheduler.xml to /home/hadoop
folder you can create fairscheduler.xml under
/etc/spark/conf/fairscheduler.xmland you can give sym link to the
directory where you are executing spark submit like described above.
References : Spark Fair scheduler example
To cross verify :
The Environment tab displays the values for the different environment and configuration variables, including Java™, Spark, and system properties. fair allocation file path will be here.
like...

Question : But how do I get an xml file called
fairscheduler.xmlonto the
classpath? Also, the classpath of what? Just the driver? Every executor?
Below points especially #4 can help in this case based on the mode you are submitting job.
Here I am trying to list out all...
To use the Fair Scheduler first assign the appropriate scheduler class
inyarn-site.xml:
<property>
<name>yarn.resourcemanager.scheduler.class</name>
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
your way of
__sp_conf.setor simply below way can work
sudo vim /etc/spark/conf/spark-defaults.conf
spark.master yarn
...
spark.yarn.dist.files
/etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml
spark.scheduler.mode FAIR
spark.scheduler.allocation.file fairscheduler.xml
Copy paste fairscheduler.xml in /home/Hadoop/fairscheduler.xml
<?xml version="1.0"?>
<!--Licensed to the Apache Software Foundation
(ASF) under one or morecontributor license agreements. See the NOTICE
file distributed withthis work for additional information regarding
copyright ownership.The ASF licenses this file to You under the Apache
License, Version 2.0(the "License"); you may not use this file except
in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 Unless required by
applicable law or agreed to in writing, softwaredistributed under the
License is distributed on an "AS IS" BASIS,WITHOUT WARRANTIES OR
CONDITIONS OF ANY KIND, either express or implied.See the License for
the specific language governing permissions and limitations under the
License.-->
<allocations>
<pool name="sparkmodule1">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>2</minShare>
</pool>
<pool name="sparkmodule2">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>2</minShare>
</pool>
<pool name="test">
<schedulingMode>FIFO</schedulingMode>
<weight>1</weight>
<minShare>2</minShare>
</pool>
<pool name="default">
<schedulingMode>FAIR</schedulingMode>
<weight>3</weight>
<minShare>3</minShare>
</pool>
</allocations>
where sparkmodule1... are the modules to which you want to create dedicated pool of resources.
Note: you don't need to mention default pool like
sc.setLocalProperty("spark.scheduler.pool", "default")if no matching pool from your fairscheduler.xml it will go in to default pool naturally.
Sample Spark submit like below when you are in cluster mode
spark-submit --name "jobname" --class
--master yarn --deploy-mode cluster
--files /etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml
Note : In client mode if we want to submit a spark job other than
home
directory with client mode create a symlink of fairscheduler.xml to
point to the directory you want to point. for example scripts folder where you are executing spark-submit from
ln –s /home/Hadoop/fairscheduler.xml fairscheduler.xml
Note : If you don't want to copy fairscheduler.xml to /home/hadoop
folder you can create fairscheduler.xml under
/etc/spark/conf/fairscheduler.xmland you can give sym link to the
directory where you are executing spark submit like described above.
References : Spark Fair scheduler example
To cross verify :
The Environment tab displays the values for the different environment and configuration variables, including Java™, Spark, and system properties. fair allocation file path will be here.
like...

edited Nov 14 at 20:50
marc_s
568k12810991249
568k12810991249
answered Nov 10 at 6:28
Ram Ghadiyaram
15.7k54175
15.7k54175
So it does need to be available to every executor?
– FGreg
Nov 12 at 17:41
AFAIK yes if --deploy-mode cluster
– Ram Ghadiyaram
Nov 12 at 17:46
i dont think for client-mode its needed
– Ram Ghadiyaram
Nov 12 at 17:46
out of my experience if deploy-mode cluster --files was working
– Ram Ghadiyaram
Nov 12 at 17:47
also its nothing specific to pyspark. its at yarn level fair is yarn level. could you remove pyspark from question title ?
– Ram Ghadiyaram
Nov 12 at 17:49
|
show 2 more comments
So it does need to be available to every executor?
– FGreg
Nov 12 at 17:41
AFAIK yes if --deploy-mode cluster
– Ram Ghadiyaram
Nov 12 at 17:46
i dont think for client-mode its needed
– Ram Ghadiyaram
Nov 12 at 17:46
out of my experience if deploy-mode cluster --files was working
– Ram Ghadiyaram
Nov 12 at 17:47
also its nothing specific to pyspark. its at yarn level fair is yarn level. could you remove pyspark from question title ?
– Ram Ghadiyaram
Nov 12 at 17:49
So it does need to be available to every executor?
– FGreg
Nov 12 at 17:41
So it does need to be available to every executor?
– FGreg
Nov 12 at 17:41
AFAIK yes if --deploy-mode cluster
– Ram Ghadiyaram
Nov 12 at 17:46
AFAIK yes if --deploy-mode cluster
– Ram Ghadiyaram
Nov 12 at 17:46
i dont think for client-mode its needed
– Ram Ghadiyaram
Nov 12 at 17:46
i dont think for client-mode its needed
– Ram Ghadiyaram
Nov 12 at 17:46
out of my experience if deploy-mode cluster --files was working
– Ram Ghadiyaram
Nov 12 at 17:47
out of my experience if deploy-mode cluster --files was working
– Ram Ghadiyaram
Nov 12 at 17:47
also its nothing specific to pyspark. its at yarn level fair is yarn level. could you remove pyspark from question title ?
– Ram Ghadiyaram
Nov 12 at 17:49
also its nothing specific to pyspark. its at yarn level fair is yarn level. could you remove pyspark from question title ?
– Ram Ghadiyaram
Nov 12 at 17:49
|
show 2 more comments
up vote
0
down vote
The following steps we will take:
- Run a simple Spark Application and review the Spark UI History Server.
- Create a new Spark FAIR Scheduler pool in an external XML file.
- Set the spark.scheduler.pool to the pool created in external XML file.
- Update code to use threads to trigger use of FAIR pools and rebuild.
- Re-deploy the Spark Application with:
spark.scheduler.mode configuration variable to FAIR.
spark.scheduler.allocation.file configuration
- Run and review Spark UI History Server.
REFERENCE
Spark Continuous Application with FAIR Scheduler presentation https://www.youtube.com/watch?v=oXwOQKXo9VE
add a comment |
up vote
0
down vote
The following steps we will take:
- Run a simple Spark Application and review the Spark UI History Server.
- Create a new Spark FAIR Scheduler pool in an external XML file.
- Set the spark.scheduler.pool to the pool created in external XML file.
- Update code to use threads to trigger use of FAIR pools and rebuild.
- Re-deploy the Spark Application with:
spark.scheduler.mode configuration variable to FAIR.
spark.scheduler.allocation.file configuration
- Run and review Spark UI History Server.
REFERENCE
Spark Continuous Application with FAIR Scheduler presentation https://www.youtube.com/watch?v=oXwOQKXo9VE
add a comment |
up vote
0
down vote
up vote
0
down vote
The following steps we will take:
- Run a simple Spark Application and review the Spark UI History Server.
- Create a new Spark FAIR Scheduler pool in an external XML file.
- Set the spark.scheduler.pool to the pool created in external XML file.
- Update code to use threads to trigger use of FAIR pools and rebuild.
- Re-deploy the Spark Application with:
spark.scheduler.mode configuration variable to FAIR.
spark.scheduler.allocation.file configuration
- Run and review Spark UI History Server.
REFERENCE
Spark Continuous Application with FAIR Scheduler presentation https://www.youtube.com/watch?v=oXwOQKXo9VE
The following steps we will take:
- Run a simple Spark Application and review the Spark UI History Server.
- Create a new Spark FAIR Scheduler pool in an external XML file.
- Set the spark.scheduler.pool to the pool created in external XML file.
- Update code to use threads to trigger use of FAIR pools and rebuild.
- Re-deploy the Spark Application with:
spark.scheduler.mode configuration variable to FAIR.
spark.scheduler.allocation.file configuration
- Run and review Spark UI History Server.
REFERENCE
Spark Continuous Application with FAIR Scheduler presentation https://www.youtube.com/watch?v=oXwOQKXo9VE
edited Nov 11 at 20:30
Charlie
5812826
5812826
answered Nov 11 at 20:16
Nagilla Venkatesh
362
362
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53235520%2fhow-do-i-enable-fair-scheduler-in-pyspark%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
spark-submit --help; there are several useful options. I think the one you really want is--properties-filewhich allows you to provide an entire properties file - but you could do it with--conf spark.scheduler.pool=default– Elliott Frisch
Nov 10 at 2:32
@ElliottFrisch I've set the pool to default already using the spark context
sc.setLocalProperty("spark.scheduler.pool", "default"). I need to also somehow provide thefairscheduler.xmlfile or else Spark complains and defaults back toFIFOorder– FGreg
Nov 10 at 3:20
Run the command I provided. You'll see (also)
--files, you can use that to add "fairscheduler.xml" to each container.– Elliott Frisch
Nov 10 at 3:29
@ElliottFrisch
--filesdoes the same thing assc.addFilewhich I've tried and does not work. According to the docs that option is a "Comma-separated list of files to be placed in the working directory of each executor. Globs are allowed." It does not say it puts the files on the root of the classpath which is what needs to happen according to github.com/apache/spark/blob/…– FGreg
Nov 10 at 3:34
Did you try
--jars? These are comments. Maybe someone will answer you.– Elliott Frisch
Nov 10 at 3:46