Run PySpark job from .egg instead of .py











up vote
1
down vote

favorite












I am trying to run a PySpark Job using Dataproc. The only different thing comparing to all the examples out there is that I want to submit the job from .egg instead of .py file.



In order to submit the PySpark job in a regular commodity cluster would be something like:



spark2-submit --master yarn 
--driver-memory 20g
--deploy-mode client
--conf parquet.compression=SNAPPY
--jars spark-avro_2.11-3.2.0.jar
--py-files dummyproject-1_spark-py2.7.egg
dummyproject-1_spark-py2.7.egg#__main__.py "param1" "param2"


Now, I want to submit exactly the same job but using Dataproc.
In order to accomplish this I am using the following command:



gcloud dataproc jobs submit pyspark 
file:///dummyproject-1_spark-py2.7.egg#__main__.py
--cluster=my-cluster-001
--py-files=file:///dummyproject-1_spark-py2.7.egg


The error I am getting is:




Error: Cannot load main class from JAR
file:/dummyproject-1_spark-py2.7.egg




It is important to mention that when I try to run a simple PySpark job using .py file, it is working correctly.



Can somebody tell me, how can I run a PySpark job from .egg file instead of .py file?










share|improve this question




























    up vote
    1
    down vote

    favorite












    I am trying to run a PySpark Job using Dataproc. The only different thing comparing to all the examples out there is that I want to submit the job from .egg instead of .py file.



    In order to submit the PySpark job in a regular commodity cluster would be something like:



    spark2-submit --master yarn 
    --driver-memory 20g
    --deploy-mode client
    --conf parquet.compression=SNAPPY
    --jars spark-avro_2.11-3.2.0.jar
    --py-files dummyproject-1_spark-py2.7.egg
    dummyproject-1_spark-py2.7.egg#__main__.py "param1" "param2"


    Now, I want to submit exactly the same job but using Dataproc.
    In order to accomplish this I am using the following command:



    gcloud dataproc jobs submit pyspark 
    file:///dummyproject-1_spark-py2.7.egg#__main__.py
    --cluster=my-cluster-001
    --py-files=file:///dummyproject-1_spark-py2.7.egg


    The error I am getting is:




    Error: Cannot load main class from JAR
    file:/dummyproject-1_spark-py2.7.egg




    It is important to mention that when I try to run a simple PySpark job using .py file, it is working correctly.



    Can somebody tell me, how can I run a PySpark job from .egg file instead of .py file?










    share|improve this question


























      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      I am trying to run a PySpark Job using Dataproc. The only different thing comparing to all the examples out there is that I want to submit the job from .egg instead of .py file.



      In order to submit the PySpark job in a regular commodity cluster would be something like:



      spark2-submit --master yarn 
      --driver-memory 20g
      --deploy-mode client
      --conf parquet.compression=SNAPPY
      --jars spark-avro_2.11-3.2.0.jar
      --py-files dummyproject-1_spark-py2.7.egg
      dummyproject-1_spark-py2.7.egg#__main__.py "param1" "param2"


      Now, I want to submit exactly the same job but using Dataproc.
      In order to accomplish this I am using the following command:



      gcloud dataproc jobs submit pyspark 
      file:///dummyproject-1_spark-py2.7.egg#__main__.py
      --cluster=my-cluster-001
      --py-files=file:///dummyproject-1_spark-py2.7.egg


      The error I am getting is:




      Error: Cannot load main class from JAR
      file:/dummyproject-1_spark-py2.7.egg




      It is important to mention that when I try to run a simple PySpark job using .py file, it is working correctly.



      Can somebody tell me, how can I run a PySpark job from .egg file instead of .py file?










      share|improve this question















      I am trying to run a PySpark Job using Dataproc. The only different thing comparing to all the examples out there is that I want to submit the job from .egg instead of .py file.



      In order to submit the PySpark job in a regular commodity cluster would be something like:



      spark2-submit --master yarn 
      --driver-memory 20g
      --deploy-mode client
      --conf parquet.compression=SNAPPY
      --jars spark-avro_2.11-3.2.0.jar
      --py-files dummyproject-1_spark-py2.7.egg
      dummyproject-1_spark-py2.7.egg#__main__.py "param1" "param2"


      Now, I want to submit exactly the same job but using Dataproc.
      In order to accomplish this I am using the following command:



      gcloud dataproc jobs submit pyspark 
      file:///dummyproject-1_spark-py2.7.egg#__main__.py
      --cluster=my-cluster-001
      --py-files=file:///dummyproject-1_spark-py2.7.egg


      The error I am getting is:




      Error: Cannot load main class from JAR
      file:/dummyproject-1_spark-py2.7.egg




      It is important to mention that when I try to run a simple PySpark job using .py file, it is working correctly.



      Can somebody tell me, how can I run a PySpark job from .egg file instead of .py file?







      pyspark google-cloud-platform google-cloud-dataproc dataproc






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 9 at 3:41









      Igor Dvorzhak

      624413




      624413










      asked Nov 9 at 3:22









      dbustosp

      1,127722




      1,127722
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          3
          down vote



          accepted










          It looks like there is a bug on how gcloud dataproc is parsing the arguments and making Spark try to execute your file like a Java JAR file. A workaround is copy your __main__.py file outside of your egg file and execute it independently like this.



          gcloud dataproc jobs submit pyspark 
          --cluster=my-cluster-001
          --py-files=file:///dummyproject-1_spark-py2.7.egg
          file:///__main__.py





          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














             

            draft saved


            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53219401%2frun-pyspark-job-from-egg-instead-of-py%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            3
            down vote



            accepted










            It looks like there is a bug on how gcloud dataproc is parsing the arguments and making Spark try to execute your file like a Java JAR file. A workaround is copy your __main__.py file outside of your egg file and execute it independently like this.



            gcloud dataproc jobs submit pyspark 
            --cluster=my-cluster-001
            --py-files=file:///dummyproject-1_spark-py2.7.egg
            file:///__main__.py





            share|improve this answer



























              up vote
              3
              down vote



              accepted










              It looks like there is a bug on how gcloud dataproc is parsing the arguments and making Spark try to execute your file like a Java JAR file. A workaround is copy your __main__.py file outside of your egg file and execute it independently like this.



              gcloud dataproc jobs submit pyspark 
              --cluster=my-cluster-001
              --py-files=file:///dummyproject-1_spark-py2.7.egg
              file:///__main__.py





              share|improve this answer

























                up vote
                3
                down vote



                accepted







                up vote
                3
                down vote



                accepted






                It looks like there is a bug on how gcloud dataproc is parsing the arguments and making Spark try to execute your file like a Java JAR file. A workaround is copy your __main__.py file outside of your egg file and execute it independently like this.



                gcloud dataproc jobs submit pyspark 
                --cluster=my-cluster-001
                --py-files=file:///dummyproject-1_spark-py2.7.egg
                file:///__main__.py





                share|improve this answer














                It looks like there is a bug on how gcloud dataproc is parsing the arguments and making Spark try to execute your file like a Java JAR file. A workaround is copy your __main__.py file outside of your egg file and execute it independently like this.



                gcloud dataproc jobs submit pyspark 
                --cluster=my-cluster-001
                --py-files=file:///dummyproject-1_spark-py2.7.egg
                file:///__main__.py






                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Nov 12 at 11:38









                Igor Dvorzhak

                624413




                624413










                answered Nov 9 at 14:53









                hlagos

                3,1951816




                3,1951816






























                     

                    draft saved


                    draft discarded



















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53219401%2frun-pyspark-job-from-egg-instead-of-py%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Landwehr

                    Reims

                    Schenkenzell