How Do I Enable Fair Scheduler in PySpark?











up vote
2
down vote

favorite












According to the docs




Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool “local property” to the SparkContext in the thread that’s submitting them




And




The pool properties can be set by creating an XML file, similar to conf/fairscheduler.xml.template, and either putting a file named fairscheduler.xml on the classpath, or setting spark.scheduler.allocation.file property in your SparkConf




So I can do the first part easily enough:



__sp_conf = SparkConf()
__sp_conf.set("spark.scheduler.mode", "FAIR")
sc = SparkContext(conf=__sp_conf)
sc.setLocalProperty("spark.scheduler.pool", "default")


But how do I get an xml file called fairscheduler.xml onto the classpath? Also, the classpath of what? Just the driver? Every executor?



I've tried using the addFile() fuction on SparkContext but that's really for being able to access files from your jobs, I don't think it adds anything to the classpath.



My other thought was modifying the PYSPARK_SUBMIT_ARGS environment variable to try messing around with the command sent to spark-submit but I'm not sure there's a way to alter the classpath using that method. Additionally, this would only alter the classpath of the driver, not every executor which I'm not sure would work or not.





To be clear, if I don't provide the fairscheduler.xml file Spark complains




WARN FairSchedulableBuilder:66 - Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.











share|improve this question
























  • spark-submit --help; there are several useful options. I think the one you really want is --properties-file which allows you to provide an entire properties file - but you could do it with --conf spark.scheduler.pool=default
    – Elliott Frisch
    Nov 10 at 2:32










  • @ElliottFrisch I've set the pool to default already using the spark context sc.setLocalProperty("spark.scheduler.pool", "default"). I need to also somehow provide the fairscheduler.xml file or else Spark complains and defaults back to FIFO order
    – FGreg
    Nov 10 at 3:20










  • Run the command I provided. You'll see (also) --files, you can use that to add "fairscheduler.xml" to each container.
    – Elliott Frisch
    Nov 10 at 3:29










  • @ElliottFrisch --files does the same thing as sc.addFile which I've tried and does not work. According to the docs that option is a "Comma-separated list of files to be placed in the working directory of each executor. Globs are allowed." It does not say it puts the files on the root of the classpath which is what needs to happen according to github.com/apache/spark/blob/…
    – FGreg
    Nov 10 at 3:34










  • Did you try --jars? These are comments. Maybe someone will answer you.
    – Elliott Frisch
    Nov 10 at 3:46















up vote
2
down vote

favorite












According to the docs




Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool “local property” to the SparkContext in the thread that’s submitting them




And




The pool properties can be set by creating an XML file, similar to conf/fairscheduler.xml.template, and either putting a file named fairscheduler.xml on the classpath, or setting spark.scheduler.allocation.file property in your SparkConf




So I can do the first part easily enough:



__sp_conf = SparkConf()
__sp_conf.set("spark.scheduler.mode", "FAIR")
sc = SparkContext(conf=__sp_conf)
sc.setLocalProperty("spark.scheduler.pool", "default")


But how do I get an xml file called fairscheduler.xml onto the classpath? Also, the classpath of what? Just the driver? Every executor?



I've tried using the addFile() fuction on SparkContext but that's really for being able to access files from your jobs, I don't think it adds anything to the classpath.



My other thought was modifying the PYSPARK_SUBMIT_ARGS environment variable to try messing around with the command sent to spark-submit but I'm not sure there's a way to alter the classpath using that method. Additionally, this would only alter the classpath of the driver, not every executor which I'm not sure would work or not.





To be clear, if I don't provide the fairscheduler.xml file Spark complains




WARN FairSchedulableBuilder:66 - Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.











share|improve this question
























  • spark-submit --help; there are several useful options. I think the one you really want is --properties-file which allows you to provide an entire properties file - but you could do it with --conf spark.scheduler.pool=default
    – Elliott Frisch
    Nov 10 at 2:32










  • @ElliottFrisch I've set the pool to default already using the spark context sc.setLocalProperty("spark.scheduler.pool", "default"). I need to also somehow provide the fairscheduler.xml file or else Spark complains and defaults back to FIFO order
    – FGreg
    Nov 10 at 3:20










  • Run the command I provided. You'll see (also) --files, you can use that to add "fairscheduler.xml" to each container.
    – Elliott Frisch
    Nov 10 at 3:29










  • @ElliottFrisch --files does the same thing as sc.addFile which I've tried and does not work. According to the docs that option is a "Comma-separated list of files to be placed in the working directory of each executor. Globs are allowed." It does not say it puts the files on the root of the classpath which is what needs to happen according to github.com/apache/spark/blob/…
    – FGreg
    Nov 10 at 3:34










  • Did you try --jars? These are comments. Maybe someone will answer you.
    – Elliott Frisch
    Nov 10 at 3:46













up vote
2
down vote

favorite









up vote
2
down vote

favorite











According to the docs




Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool “local property” to the SparkContext in the thread that’s submitting them




And




The pool properties can be set by creating an XML file, similar to conf/fairscheduler.xml.template, and either putting a file named fairscheduler.xml on the classpath, or setting spark.scheduler.allocation.file property in your SparkConf




So I can do the first part easily enough:



__sp_conf = SparkConf()
__sp_conf.set("spark.scheduler.mode", "FAIR")
sc = SparkContext(conf=__sp_conf)
sc.setLocalProperty("spark.scheduler.pool", "default")


But how do I get an xml file called fairscheduler.xml onto the classpath? Also, the classpath of what? Just the driver? Every executor?



I've tried using the addFile() fuction on SparkContext but that's really for being able to access files from your jobs, I don't think it adds anything to the classpath.



My other thought was modifying the PYSPARK_SUBMIT_ARGS environment variable to try messing around with the command sent to spark-submit but I'm not sure there's a way to alter the classpath using that method. Additionally, this would only alter the classpath of the driver, not every executor which I'm not sure would work or not.





To be clear, if I don't provide the fairscheduler.xml file Spark complains




WARN FairSchedulableBuilder:66 - Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.











share|improve this question















According to the docs




Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool “local property” to the SparkContext in the thread that’s submitting them




And




The pool properties can be set by creating an XML file, similar to conf/fairscheduler.xml.template, and either putting a file named fairscheduler.xml on the classpath, or setting spark.scheduler.allocation.file property in your SparkConf




So I can do the first part easily enough:



__sp_conf = SparkConf()
__sp_conf.set("spark.scheduler.mode", "FAIR")
sc = SparkContext(conf=__sp_conf)
sc.setLocalProperty("spark.scheduler.pool", "default")


But how do I get an xml file called fairscheduler.xml onto the classpath? Also, the classpath of what? Just the driver? Every executor?



I've tried using the addFile() fuction on SparkContext but that's really for being able to access files from your jobs, I don't think it adds anything to the classpath.



My other thought was modifying the PYSPARK_SUBMIT_ARGS environment variable to try messing around with the command sent to spark-submit but I'm not sure there's a way to alter the classpath using that method. Additionally, this would only alter the classpath of the driver, not every executor which I'm not sure would work or not.





To be clear, if I don't provide the fairscheduler.xml file Spark complains




WARN FairSchedulableBuilder:66 - Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.








java apache-spark pyspark






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 10 at 3:24

























asked Nov 10 at 2:26









FGreg

5,15164383




5,15164383












  • spark-submit --help; there are several useful options. I think the one you really want is --properties-file which allows you to provide an entire properties file - but you could do it with --conf spark.scheduler.pool=default
    – Elliott Frisch
    Nov 10 at 2:32










  • @ElliottFrisch I've set the pool to default already using the spark context sc.setLocalProperty("spark.scheduler.pool", "default"). I need to also somehow provide the fairscheduler.xml file or else Spark complains and defaults back to FIFO order
    – FGreg
    Nov 10 at 3:20










  • Run the command I provided. You'll see (also) --files, you can use that to add "fairscheduler.xml" to each container.
    – Elliott Frisch
    Nov 10 at 3:29










  • @ElliottFrisch --files does the same thing as sc.addFile which I've tried and does not work. According to the docs that option is a "Comma-separated list of files to be placed in the working directory of each executor. Globs are allowed." It does not say it puts the files on the root of the classpath which is what needs to happen according to github.com/apache/spark/blob/…
    – FGreg
    Nov 10 at 3:34










  • Did you try --jars? These are comments. Maybe someone will answer you.
    – Elliott Frisch
    Nov 10 at 3:46


















  • spark-submit --help; there are several useful options. I think the one you really want is --properties-file which allows you to provide an entire properties file - but you could do it with --conf spark.scheduler.pool=default
    – Elliott Frisch
    Nov 10 at 2:32










  • @ElliottFrisch I've set the pool to default already using the spark context sc.setLocalProperty("spark.scheduler.pool", "default"). I need to also somehow provide the fairscheduler.xml file or else Spark complains and defaults back to FIFO order
    – FGreg
    Nov 10 at 3:20










  • Run the command I provided. You'll see (also) --files, you can use that to add "fairscheduler.xml" to each container.
    – Elliott Frisch
    Nov 10 at 3:29










  • @ElliottFrisch --files does the same thing as sc.addFile which I've tried and does not work. According to the docs that option is a "Comma-separated list of files to be placed in the working directory of each executor. Globs are allowed." It does not say it puts the files on the root of the classpath which is what needs to happen according to github.com/apache/spark/blob/…
    – FGreg
    Nov 10 at 3:34










  • Did you try --jars? These are comments. Maybe someone will answer you.
    – Elliott Frisch
    Nov 10 at 3:46
















spark-submit --help; there are several useful options. I think the one you really want is --properties-file which allows you to provide an entire properties file - but you could do it with --conf spark.scheduler.pool=default
– Elliott Frisch
Nov 10 at 2:32




spark-submit --help; there are several useful options. I think the one you really want is --properties-file which allows you to provide an entire properties file - but you could do it with --conf spark.scheduler.pool=default
– Elliott Frisch
Nov 10 at 2:32












@ElliottFrisch I've set the pool to default already using the spark context sc.setLocalProperty("spark.scheduler.pool", "default"). I need to also somehow provide the fairscheduler.xml file or else Spark complains and defaults back to FIFO order
– FGreg
Nov 10 at 3:20




@ElliottFrisch I've set the pool to default already using the spark context sc.setLocalProperty("spark.scheduler.pool", "default"). I need to also somehow provide the fairscheduler.xml file or else Spark complains and defaults back to FIFO order
– FGreg
Nov 10 at 3:20












Run the command I provided. You'll see (also) --files, you can use that to add "fairscheduler.xml" to each container.
– Elliott Frisch
Nov 10 at 3:29




Run the command I provided. You'll see (also) --files, you can use that to add "fairscheduler.xml" to each container.
– Elliott Frisch
Nov 10 at 3:29












@ElliottFrisch --files does the same thing as sc.addFile which I've tried and does not work. According to the docs that option is a "Comma-separated list of files to be placed in the working directory of each executor. Globs are allowed." It does not say it puts the files on the root of the classpath which is what needs to happen according to github.com/apache/spark/blob/…
– FGreg
Nov 10 at 3:34




@ElliottFrisch --files does the same thing as sc.addFile which I've tried and does not work. According to the docs that option is a "Comma-separated list of files to be placed in the working directory of each executor. Globs are allowed." It does not say it puts the files on the root of the classpath which is what needs to happen according to github.com/apache/spark/blob/…
– FGreg
Nov 10 at 3:34












Did you try --jars? These are comments. Maybe someone will answer you.
– Elliott Frisch
Nov 10 at 3:46




Did you try --jars? These are comments. Maybe someone will answer you.
– Elliott Frisch
Nov 10 at 3:46












2 Answers
2






active

oldest

votes

















up vote
2
down vote














Question : But how do I get an xml file called fairscheduler.xml onto the
classpath? Also, the classpath of what? Just the driver? Every executor?




Below points especially #4 can help in this case based on the mode you are submitting job.



Here I am trying to list out all...





  1. To use the Fair Scheduler first assign the appropriate scheduler class
    in yarn-site.xml:



    <property>
    <name>yarn.resourcemanager.scheduler.class</name>


    org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler




  2. your way of __sp_conf.set or simply below way can work



    sudo vim /etc/spark/conf/spark-defaults.conf

    spark.master yarn

    ...
    spark.yarn.dist.files
    /etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml

    spark.scheduler.mode FAIR
    spark.scheduler.allocation.file fairscheduler.xml



  3. Copy paste fairscheduler.xml in /home/Hadoop/fairscheduler.xml



    <?xml version="1.0"?>
    <!--Licensed to the Apache Software Foundation
    (ASF) under one or morecontributor license agreements. See the NOTICE
    file distributed withthis work for additional information regarding
    copyright ownership.The ASF licenses this file to You under the Apache
    License, Version 2.0(the "License"); you may not use this file except
    in compliance with the License. You may obtain a copy of the License at


    http://www.apache.org/licenses/LICENSE-2.0 Unless required by
    applicable law or agreed to in writing, softwaredistributed under the
    License is distributed on an "AS IS" BASIS,WITHOUT WARRANTIES OR
    CONDITIONS OF ANY KIND, either express or implied.See the License for
    the specific language governing permissions and limitations under the
    License.-->



    <allocations>
    <pool name="sparkmodule1">
    <schedulingMode>FAIR</schedulingMode>
    <weight>1</weight>
    <minShare>2</minShare>
    </pool>
    <pool name="sparkmodule2">
    <schedulingMode>FAIR</schedulingMode>
    <weight>1</weight>
    <minShare>2</minShare>
    </pool>

    <pool name="test">
    <schedulingMode>FIFO</schedulingMode>
    <weight>1</weight>
    <minShare>2</minShare>
    </pool>
    <pool name="default">
    <schedulingMode>FAIR</schedulingMode>
    <weight>3</weight>
    <minShare>3</minShare>
    </pool>
    </allocations>


    where sparkmodule1... are the modules to which you want to create dedicated pool of resources.



    Note: you don't need to mention default pool like sc.setLocalProperty("spark.scheduler.pool", "default") if no matching pool from your fairscheduler.xml it will go in to default pool naturally.




  4. Sample Spark submit like below when you are in cluster mode



    spark-submit --name "jobname" --class

    --master yarn --deploy-mode cluster
    --files /etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml





Note : In client mode if we want to submit a spark job other than
home
directory with client mode create a symlink of fairscheduler.xml to
point to the directory you want to point. for example scripts folder where you are executing spark-submit from
ln –s /home/Hadoop/fairscheduler.xml fairscheduler.xml



Note : If you don't want to copy fairscheduler.xml to /home/hadoop
folder you can create fairscheduler.xml under
/etc/spark/conf/fairscheduler.xml and you can give sym link to the
directory where you are executing spark submit like described above.




References : Spark Fair scheduler example



To cross verify :



The Environment tab displays the values for the different environment and configuration variables, including Java™, Spark, and system properties. fair allocation file path will be here.



like...
enter image description here






share|improve this answer























  • So it does need to be available to every executor?
    – FGreg
    Nov 12 at 17:41










  • AFAIK yes if --deploy-mode cluster
    – Ram Ghadiyaram
    Nov 12 at 17:46










  • i dont think for client-mode its needed
    – Ram Ghadiyaram
    Nov 12 at 17:46










  • out of my experience if deploy-mode cluster --files was working
    – Ram Ghadiyaram
    Nov 12 at 17:47










  • also its nothing specific to pyspark. its at yarn level fair is yarn level. could you remove pyspark from question title ?
    – Ram Ghadiyaram
    Nov 12 at 17:49




















up vote
0
down vote













The following steps we will take:




  • Run a simple Spark Application and review the Spark UI History Server.

  • Create a new Spark FAIR Scheduler pool in an external XML file.

  • Set the spark.scheduler.pool to the pool created in external XML file.

  • Update code to use threads to trigger use of FAIR pools and rebuild.

  • Re-deploy the Spark Application with:



    • spark.scheduler.mode configuration variable to FAIR.


    • spark.scheduler.allocation.file configuration



  • Run and review Spark UI History Server.


REFERENCE



Spark Continuous Application with FAIR Scheduler presentation https://www.youtube.com/watch?v=oXwOQKXo9VE






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53235520%2fhow-do-i-enable-fair-scheduler-in-pyspark%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    2
    down vote














    Question : But how do I get an xml file called fairscheduler.xml onto the
    classpath? Also, the classpath of what? Just the driver? Every executor?




    Below points especially #4 can help in this case based on the mode you are submitting job.



    Here I am trying to list out all...





    1. To use the Fair Scheduler first assign the appropriate scheduler class
      in yarn-site.xml:



      <property>
      <name>yarn.resourcemanager.scheduler.class</name>


      org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler




    2. your way of __sp_conf.set or simply below way can work



      sudo vim /etc/spark/conf/spark-defaults.conf

      spark.master yarn

      ...
      spark.yarn.dist.files
      /etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml

      spark.scheduler.mode FAIR
      spark.scheduler.allocation.file fairscheduler.xml



    3. Copy paste fairscheduler.xml in /home/Hadoop/fairscheduler.xml



      <?xml version="1.0"?>
      <!--Licensed to the Apache Software Foundation
      (ASF) under one or morecontributor license agreements. See the NOTICE
      file distributed withthis work for additional information regarding
      copyright ownership.The ASF licenses this file to You under the Apache
      License, Version 2.0(the "License"); you may not use this file except
      in compliance with the License. You may obtain a copy of the License at


      http://www.apache.org/licenses/LICENSE-2.0 Unless required by
      applicable law or agreed to in writing, softwaredistributed under the
      License is distributed on an "AS IS" BASIS,WITHOUT WARRANTIES OR
      CONDITIONS OF ANY KIND, either express or implied.See the License for
      the specific language governing permissions and limitations under the
      License.-->



      <allocations>
      <pool name="sparkmodule1">
      <schedulingMode>FAIR</schedulingMode>
      <weight>1</weight>
      <minShare>2</minShare>
      </pool>
      <pool name="sparkmodule2">
      <schedulingMode>FAIR</schedulingMode>
      <weight>1</weight>
      <minShare>2</minShare>
      </pool>

      <pool name="test">
      <schedulingMode>FIFO</schedulingMode>
      <weight>1</weight>
      <minShare>2</minShare>
      </pool>
      <pool name="default">
      <schedulingMode>FAIR</schedulingMode>
      <weight>3</weight>
      <minShare>3</minShare>
      </pool>
      </allocations>


      where sparkmodule1... are the modules to which you want to create dedicated pool of resources.



      Note: you don't need to mention default pool like sc.setLocalProperty("spark.scheduler.pool", "default") if no matching pool from your fairscheduler.xml it will go in to default pool naturally.




    4. Sample Spark submit like below when you are in cluster mode



      spark-submit --name "jobname" --class

      --master yarn --deploy-mode cluster
      --files /etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml





    Note : In client mode if we want to submit a spark job other than
    home
    directory with client mode create a symlink of fairscheduler.xml to
    point to the directory you want to point. for example scripts folder where you are executing spark-submit from
    ln –s /home/Hadoop/fairscheduler.xml fairscheduler.xml



    Note : If you don't want to copy fairscheduler.xml to /home/hadoop
    folder you can create fairscheduler.xml under
    /etc/spark/conf/fairscheduler.xml and you can give sym link to the
    directory where you are executing spark submit like described above.




    References : Spark Fair scheduler example



    To cross verify :



    The Environment tab displays the values for the different environment and configuration variables, including Java™, Spark, and system properties. fair allocation file path will be here.



    like...
    enter image description here






    share|improve this answer























    • So it does need to be available to every executor?
      – FGreg
      Nov 12 at 17:41










    • AFAIK yes if --deploy-mode cluster
      – Ram Ghadiyaram
      Nov 12 at 17:46










    • i dont think for client-mode its needed
      – Ram Ghadiyaram
      Nov 12 at 17:46










    • out of my experience if deploy-mode cluster --files was working
      – Ram Ghadiyaram
      Nov 12 at 17:47










    • also its nothing specific to pyspark. its at yarn level fair is yarn level. could you remove pyspark from question title ?
      – Ram Ghadiyaram
      Nov 12 at 17:49

















    up vote
    2
    down vote














    Question : But how do I get an xml file called fairscheduler.xml onto the
    classpath? Also, the classpath of what? Just the driver? Every executor?




    Below points especially #4 can help in this case based on the mode you are submitting job.



    Here I am trying to list out all...





    1. To use the Fair Scheduler first assign the appropriate scheduler class
      in yarn-site.xml:



      <property>
      <name>yarn.resourcemanager.scheduler.class</name>


      org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler




    2. your way of __sp_conf.set or simply below way can work



      sudo vim /etc/spark/conf/spark-defaults.conf

      spark.master yarn

      ...
      spark.yarn.dist.files
      /etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml

      spark.scheduler.mode FAIR
      spark.scheduler.allocation.file fairscheduler.xml



    3. Copy paste fairscheduler.xml in /home/Hadoop/fairscheduler.xml



      <?xml version="1.0"?>
      <!--Licensed to the Apache Software Foundation
      (ASF) under one or morecontributor license agreements. See the NOTICE
      file distributed withthis work for additional information regarding
      copyright ownership.The ASF licenses this file to You under the Apache
      License, Version 2.0(the "License"); you may not use this file except
      in compliance with the License. You may obtain a copy of the License at


      http://www.apache.org/licenses/LICENSE-2.0 Unless required by
      applicable law or agreed to in writing, softwaredistributed under the
      License is distributed on an "AS IS" BASIS,WITHOUT WARRANTIES OR
      CONDITIONS OF ANY KIND, either express or implied.See the License for
      the specific language governing permissions and limitations under the
      License.-->



      <allocations>
      <pool name="sparkmodule1">
      <schedulingMode>FAIR</schedulingMode>
      <weight>1</weight>
      <minShare>2</minShare>
      </pool>
      <pool name="sparkmodule2">
      <schedulingMode>FAIR</schedulingMode>
      <weight>1</weight>
      <minShare>2</minShare>
      </pool>

      <pool name="test">
      <schedulingMode>FIFO</schedulingMode>
      <weight>1</weight>
      <minShare>2</minShare>
      </pool>
      <pool name="default">
      <schedulingMode>FAIR</schedulingMode>
      <weight>3</weight>
      <minShare>3</minShare>
      </pool>
      </allocations>


      where sparkmodule1... are the modules to which you want to create dedicated pool of resources.



      Note: you don't need to mention default pool like sc.setLocalProperty("spark.scheduler.pool", "default") if no matching pool from your fairscheduler.xml it will go in to default pool naturally.




    4. Sample Spark submit like below when you are in cluster mode



      spark-submit --name "jobname" --class

      --master yarn --deploy-mode cluster
      --files /etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml





    Note : In client mode if we want to submit a spark job other than
    home
    directory with client mode create a symlink of fairscheduler.xml to
    point to the directory you want to point. for example scripts folder where you are executing spark-submit from
    ln –s /home/Hadoop/fairscheduler.xml fairscheduler.xml



    Note : If you don't want to copy fairscheduler.xml to /home/hadoop
    folder you can create fairscheduler.xml under
    /etc/spark/conf/fairscheduler.xml and you can give sym link to the
    directory where you are executing spark submit like described above.




    References : Spark Fair scheduler example



    To cross verify :



    The Environment tab displays the values for the different environment and configuration variables, including Java™, Spark, and system properties. fair allocation file path will be here.



    like...
    enter image description here






    share|improve this answer























    • So it does need to be available to every executor?
      – FGreg
      Nov 12 at 17:41










    • AFAIK yes if --deploy-mode cluster
      – Ram Ghadiyaram
      Nov 12 at 17:46










    • i dont think for client-mode its needed
      – Ram Ghadiyaram
      Nov 12 at 17:46










    • out of my experience if deploy-mode cluster --files was working
      – Ram Ghadiyaram
      Nov 12 at 17:47










    • also its nothing specific to pyspark. its at yarn level fair is yarn level. could you remove pyspark from question title ?
      – Ram Ghadiyaram
      Nov 12 at 17:49















    up vote
    2
    down vote










    up vote
    2
    down vote










    Question : But how do I get an xml file called fairscheduler.xml onto the
    classpath? Also, the classpath of what? Just the driver? Every executor?




    Below points especially #4 can help in this case based on the mode you are submitting job.



    Here I am trying to list out all...





    1. To use the Fair Scheduler first assign the appropriate scheduler class
      in yarn-site.xml:



      <property>
      <name>yarn.resourcemanager.scheduler.class</name>


      org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler




    2. your way of __sp_conf.set or simply below way can work



      sudo vim /etc/spark/conf/spark-defaults.conf

      spark.master yarn

      ...
      spark.yarn.dist.files
      /etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml

      spark.scheduler.mode FAIR
      spark.scheduler.allocation.file fairscheduler.xml



    3. Copy paste fairscheduler.xml in /home/Hadoop/fairscheduler.xml



      <?xml version="1.0"?>
      <!--Licensed to the Apache Software Foundation
      (ASF) under one or morecontributor license agreements. See the NOTICE
      file distributed withthis work for additional information regarding
      copyright ownership.The ASF licenses this file to You under the Apache
      License, Version 2.0(the "License"); you may not use this file except
      in compliance with the License. You may obtain a copy of the License at


      http://www.apache.org/licenses/LICENSE-2.0 Unless required by
      applicable law or agreed to in writing, softwaredistributed under the
      License is distributed on an "AS IS" BASIS,WITHOUT WARRANTIES OR
      CONDITIONS OF ANY KIND, either express or implied.See the License for
      the specific language governing permissions and limitations under the
      License.-->



      <allocations>
      <pool name="sparkmodule1">
      <schedulingMode>FAIR</schedulingMode>
      <weight>1</weight>
      <minShare>2</minShare>
      </pool>
      <pool name="sparkmodule2">
      <schedulingMode>FAIR</schedulingMode>
      <weight>1</weight>
      <minShare>2</minShare>
      </pool>

      <pool name="test">
      <schedulingMode>FIFO</schedulingMode>
      <weight>1</weight>
      <minShare>2</minShare>
      </pool>
      <pool name="default">
      <schedulingMode>FAIR</schedulingMode>
      <weight>3</weight>
      <minShare>3</minShare>
      </pool>
      </allocations>


      where sparkmodule1... are the modules to which you want to create dedicated pool of resources.



      Note: you don't need to mention default pool like sc.setLocalProperty("spark.scheduler.pool", "default") if no matching pool from your fairscheduler.xml it will go in to default pool naturally.




    4. Sample Spark submit like below when you are in cluster mode



      spark-submit --name "jobname" --class

      --master yarn --deploy-mode cluster
      --files /etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml





    Note : In client mode if we want to submit a spark job other than
    home
    directory with client mode create a symlink of fairscheduler.xml to
    point to the directory you want to point. for example scripts folder where you are executing spark-submit from
    ln –s /home/Hadoop/fairscheduler.xml fairscheduler.xml



    Note : If you don't want to copy fairscheduler.xml to /home/hadoop
    folder you can create fairscheduler.xml under
    /etc/spark/conf/fairscheduler.xml and you can give sym link to the
    directory where you are executing spark submit like described above.




    References : Spark Fair scheduler example



    To cross verify :



    The Environment tab displays the values for the different environment and configuration variables, including Java™, Spark, and system properties. fair allocation file path will be here.



    like...
    enter image description here






    share|improve this answer















    Question : But how do I get an xml file called fairscheduler.xml onto the
    classpath? Also, the classpath of what? Just the driver? Every executor?




    Below points especially #4 can help in this case based on the mode you are submitting job.



    Here I am trying to list out all...





    1. To use the Fair Scheduler first assign the appropriate scheduler class
      in yarn-site.xml:



      <property>
      <name>yarn.resourcemanager.scheduler.class</name>


      org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler




    2. your way of __sp_conf.set or simply below way can work



      sudo vim /etc/spark/conf/spark-defaults.conf

      spark.master yarn

      ...
      spark.yarn.dist.files
      /etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml

      spark.scheduler.mode FAIR
      spark.scheduler.allocation.file fairscheduler.xml



    3. Copy paste fairscheduler.xml in /home/Hadoop/fairscheduler.xml



      <?xml version="1.0"?>
      <!--Licensed to the Apache Software Foundation
      (ASF) under one or morecontributor license agreements. See the NOTICE
      file distributed withthis work for additional information regarding
      copyright ownership.The ASF licenses this file to You under the Apache
      License, Version 2.0(the "License"); you may not use this file except
      in compliance with the License. You may obtain a copy of the License at


      http://www.apache.org/licenses/LICENSE-2.0 Unless required by
      applicable law or agreed to in writing, softwaredistributed under the
      License is distributed on an "AS IS" BASIS,WITHOUT WARRANTIES OR
      CONDITIONS OF ANY KIND, either express or implied.See the License for
      the specific language governing permissions and limitations under the
      License.-->



      <allocations>
      <pool name="sparkmodule1">
      <schedulingMode>FAIR</schedulingMode>
      <weight>1</weight>
      <minShare>2</minShare>
      </pool>
      <pool name="sparkmodule2">
      <schedulingMode>FAIR</schedulingMode>
      <weight>1</weight>
      <minShare>2</minShare>
      </pool>

      <pool name="test">
      <schedulingMode>FIFO</schedulingMode>
      <weight>1</weight>
      <minShare>2</minShare>
      </pool>
      <pool name="default">
      <schedulingMode>FAIR</schedulingMode>
      <weight>3</weight>
      <minShare>3</minShare>
      </pool>
      </allocations>


      where sparkmodule1... are the modules to which you want to create dedicated pool of resources.



      Note: you don't need to mention default pool like sc.setLocalProperty("spark.scheduler.pool", "default") if no matching pool from your fairscheduler.xml it will go in to default pool naturally.




    4. Sample Spark submit like below when you are in cluster mode



      spark-submit --name "jobname" --class

      --master yarn --deploy-mode cluster
      --files /etc/spark/conf/hive-site.xml,/home/hadoop/fairscheduler.xml





    Note : In client mode if we want to submit a spark job other than
    home
    directory with client mode create a symlink of fairscheduler.xml to
    point to the directory you want to point. for example scripts folder where you are executing spark-submit from
    ln –s /home/Hadoop/fairscheduler.xml fairscheduler.xml



    Note : If you don't want to copy fairscheduler.xml to /home/hadoop
    folder you can create fairscheduler.xml under
    /etc/spark/conf/fairscheduler.xml and you can give sym link to the
    directory where you are executing spark submit like described above.




    References : Spark Fair scheduler example



    To cross verify :



    The Environment tab displays the values for the different environment and configuration variables, including Java™, Spark, and system properties. fair allocation file path will be here.



    like...
    enter image description here







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 14 at 20:50









    marc_s

    568k12810991249




    568k12810991249










    answered Nov 10 at 6:28









    Ram Ghadiyaram

    15.7k54175




    15.7k54175












    • So it does need to be available to every executor?
      – FGreg
      Nov 12 at 17:41










    • AFAIK yes if --deploy-mode cluster
      – Ram Ghadiyaram
      Nov 12 at 17:46










    • i dont think for client-mode its needed
      – Ram Ghadiyaram
      Nov 12 at 17:46










    • out of my experience if deploy-mode cluster --files was working
      – Ram Ghadiyaram
      Nov 12 at 17:47










    • also its nothing specific to pyspark. its at yarn level fair is yarn level. could you remove pyspark from question title ?
      – Ram Ghadiyaram
      Nov 12 at 17:49




















    • So it does need to be available to every executor?
      – FGreg
      Nov 12 at 17:41










    • AFAIK yes if --deploy-mode cluster
      – Ram Ghadiyaram
      Nov 12 at 17:46










    • i dont think for client-mode its needed
      – Ram Ghadiyaram
      Nov 12 at 17:46










    • out of my experience if deploy-mode cluster --files was working
      – Ram Ghadiyaram
      Nov 12 at 17:47










    • also its nothing specific to pyspark. its at yarn level fair is yarn level. could you remove pyspark from question title ?
      – Ram Ghadiyaram
      Nov 12 at 17:49


















    So it does need to be available to every executor?
    – FGreg
    Nov 12 at 17:41




    So it does need to be available to every executor?
    – FGreg
    Nov 12 at 17:41












    AFAIK yes if --deploy-mode cluster
    – Ram Ghadiyaram
    Nov 12 at 17:46




    AFAIK yes if --deploy-mode cluster
    – Ram Ghadiyaram
    Nov 12 at 17:46












    i dont think for client-mode its needed
    – Ram Ghadiyaram
    Nov 12 at 17:46




    i dont think for client-mode its needed
    – Ram Ghadiyaram
    Nov 12 at 17:46












    out of my experience if deploy-mode cluster --files was working
    – Ram Ghadiyaram
    Nov 12 at 17:47




    out of my experience if deploy-mode cluster --files was working
    – Ram Ghadiyaram
    Nov 12 at 17:47












    also its nothing specific to pyspark. its at yarn level fair is yarn level. could you remove pyspark from question title ?
    – Ram Ghadiyaram
    Nov 12 at 17:49






    also its nothing specific to pyspark. its at yarn level fair is yarn level. could you remove pyspark from question title ?
    – Ram Ghadiyaram
    Nov 12 at 17:49














    up vote
    0
    down vote













    The following steps we will take:




    • Run a simple Spark Application and review the Spark UI History Server.

    • Create a new Spark FAIR Scheduler pool in an external XML file.

    • Set the spark.scheduler.pool to the pool created in external XML file.

    • Update code to use threads to trigger use of FAIR pools and rebuild.

    • Re-deploy the Spark Application with:



      • spark.scheduler.mode configuration variable to FAIR.


      • spark.scheduler.allocation.file configuration



    • Run and review Spark UI History Server.


    REFERENCE



    Spark Continuous Application with FAIR Scheduler presentation https://www.youtube.com/watch?v=oXwOQKXo9VE






    share|improve this answer



























      up vote
      0
      down vote













      The following steps we will take:




      • Run a simple Spark Application and review the Spark UI History Server.

      • Create a new Spark FAIR Scheduler pool in an external XML file.

      • Set the spark.scheduler.pool to the pool created in external XML file.

      • Update code to use threads to trigger use of FAIR pools and rebuild.

      • Re-deploy the Spark Application with:



        • spark.scheduler.mode configuration variable to FAIR.


        • spark.scheduler.allocation.file configuration



      • Run and review Spark UI History Server.


      REFERENCE



      Spark Continuous Application with FAIR Scheduler presentation https://www.youtube.com/watch?v=oXwOQKXo9VE






      share|improve this answer

























        up vote
        0
        down vote










        up vote
        0
        down vote









        The following steps we will take:




        • Run a simple Spark Application and review the Spark UI History Server.

        • Create a new Spark FAIR Scheduler pool in an external XML file.

        • Set the spark.scheduler.pool to the pool created in external XML file.

        • Update code to use threads to trigger use of FAIR pools and rebuild.

        • Re-deploy the Spark Application with:



          • spark.scheduler.mode configuration variable to FAIR.


          • spark.scheduler.allocation.file configuration



        • Run and review Spark UI History Server.


        REFERENCE



        Spark Continuous Application with FAIR Scheduler presentation https://www.youtube.com/watch?v=oXwOQKXo9VE






        share|improve this answer














        The following steps we will take:




        • Run a simple Spark Application and review the Spark UI History Server.

        • Create a new Spark FAIR Scheduler pool in an external XML file.

        • Set the spark.scheduler.pool to the pool created in external XML file.

        • Update code to use threads to trigger use of FAIR pools and rebuild.

        • Re-deploy the Spark Application with:



          • spark.scheduler.mode configuration variable to FAIR.


          • spark.scheduler.allocation.file configuration



        • Run and review Spark UI History Server.


        REFERENCE



        Spark Continuous Application with FAIR Scheduler presentation https://www.youtube.com/watch?v=oXwOQKXo9VE







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 11 at 20:30









        Charlie

        5812826




        5812826










        answered Nov 11 at 20:16









        Nagilla Venkatesh

        362




        362






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53235520%2fhow-do-i-enable-fair-scheduler-in-pyspark%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Landwehr

            how to define a CAPL function taking a sysvar argument

            How do I alter this code to allow abstract classes or interfaces to work over identical auto generated...