Run PySpark job from .egg instead of .py

up vote
1
down vote

favorite

I am trying to run a PySpark Job using Dataproc. The only different thing comparing to all the examples out there is that I want to submit the job from .egg instead of .py file.

In order to submit the PySpark job in a regular commodity cluster would be something like:

spark2-submit --master yarn 

    --driver-memory 20g 

    --deploy-mode client 

    --conf parquet.compression=SNAPPY 

    --jars spark-avro_2.11-3.2.0.jar 

    --py-files dummyproject-1_spark-py2.7.egg 

    dummyproject-1_spark-py2.7.egg#__main__.py "param1" "param2"

Now, I want to submit exactly the same job but using Dataproc.
In order to accomplish this I am using the following command:

gcloud dataproc jobs submit pyspark 

    file:///dummyproject-1_spark-py2.7.egg#__main__.py 

    --cluster=my-cluster-001 

    --py-files=file:///dummyproject-1_spark-py2.7.egg

The error I am getting is:

Error: Cannot load main class from JAR
file:/dummyproject-1_spark-py2.7.egg

It is important to mention that when I try to run a simple PySpark job using .py file, it is working correctly.

Can somebody tell me, how can I run a PySpark job from .egg file instead of .py file?

edited Nov 9 at 3:41

Igor Dvorzhak

624413

asked Nov 9 at 3:22

dbustosp

1,127722

add a comment |

up vote
1
down vote

favorite

I am trying to run a PySpark Job using Dataproc. The only different thing comparing to all the examples out there is that I want to submit the job from .egg instead of .py file.

In order to submit the PySpark job in a regular commodity cluster would be something like:

spark2-submit --master yarn 

    --driver-memory 20g 

    --deploy-mode client 

    --conf parquet.compression=SNAPPY 

    --jars spark-avro_2.11-3.2.0.jar 

    --py-files dummyproject-1_spark-py2.7.egg 

    dummyproject-1_spark-py2.7.egg#__main__.py "param1" "param2"

Now, I want to submit exactly the same job but using Dataproc.
In order to accomplish this I am using the following command:

gcloud dataproc jobs submit pyspark 

    file:///dummyproject-1_spark-py2.7.egg#__main__.py 

    --cluster=my-cluster-001 

    --py-files=file:///dummyproject-1_spark-py2.7.egg

The error I am getting is:

Error: Cannot load main class from JAR
file:/dummyproject-1_spark-py2.7.egg

It is important to mention that when I try to run a simple PySpark job using .py file, it is working correctly.

Can somebody tell me, how can I run a PySpark job from .egg file instead of .py file?

edited Nov 9 at 3:41

Igor Dvorzhak

624413

asked Nov 9 at 3:22

dbustosp

1,127722

add a comment |

up vote
1
down vote

favorite

I am trying to run a PySpark Job using Dataproc. The only different thing comparing to all the examples out there is that I want to submit the job from .egg instead of .py file.

In order to submit the PySpark job in a regular commodity cluster would be something like:

spark2-submit --master yarn 

    --driver-memory 20g 

    --deploy-mode client 

    --conf parquet.compression=SNAPPY 

    --jars spark-avro_2.11-3.2.0.jar 

    --py-files dummyproject-1_spark-py2.7.egg 

    dummyproject-1_spark-py2.7.egg#__main__.py "param1" "param2"

Now, I want to submit exactly the same job but using Dataproc.
In order to accomplish this I am using the following command:

gcloud dataproc jobs submit pyspark 

    file:///dummyproject-1_spark-py2.7.egg#__main__.py 

    --cluster=my-cluster-001 

    --py-files=file:///dummyproject-1_spark-py2.7.egg

The error I am getting is:

Error: Cannot load main class from JAR
file:/dummyproject-1_spark-py2.7.egg

It is important to mention that when I try to run a simple PySpark job using .py file, it is working correctly.

Can somebody tell me, how can I run a PySpark job from .egg file instead of .py file?

edited Nov 9 at 3:41

Igor Dvorzhak

624413

asked Nov 9 at 3:22

dbustosp

1,127722

I am trying to run a PySpark Job using Dataproc. The only different thing comparing to all the examples out there is that I want to submit the job from .egg instead of .py file.

In order to submit the PySpark job in a regular commodity cluster would be something like:

spark2-submit --master yarn 

    --driver-memory 20g 

    --deploy-mode client 

    --conf parquet.compression=SNAPPY 

    --jars spark-avro_2.11-3.2.0.jar 

    --py-files dummyproject-1_spark-py2.7.egg 

    dummyproject-1_spark-py2.7.egg#__main__.py "param1" "param2"

Now, I want to submit exactly the same job but using Dataproc.
In order to accomplish this I am using the following command:

gcloud dataproc jobs submit pyspark 

    file:///dummyproject-1_spark-py2.7.egg#__main__.py 

    --cluster=my-cluster-001 

    --py-files=file:///dummyproject-1_spark-py2.7.egg

The error I am getting is:

Error: Cannot load main class from JAR
file:/dummyproject-1_spark-py2.7.egg

It is important to mention that when I try to run a simple PySpark job using .py file, it is working correctly.

Can somebody tell me, how can I run a PySpark job from .egg file instead of .py file?

pyspark google-cloud-platform google-cloud-dataproc dataproc

edited Nov 9 at 3:41

Igor Dvorzhak

624413

asked Nov 9 at 3:22

dbustosp

1,127722

edited Nov 9 at 3:41

Igor Dvorzhak

624413

asked Nov 9 at 3:22

dbustosp

1,127722

edited Nov 9 at 3:41

Igor Dvorzhak

624413

edited Nov 9 at 3:41

Igor Dvorzhak

624413

edited Nov 9 at 3:41

Igor Dvorzhak

624413

asked Nov 9 at 3:22

dbustosp

1,127722

asked Nov 9 at 3:22

dbustosp

1,127722

asked Nov 9 at 3:22

dbustosp

1,127722

add a comment |

1 Answer
1

active

oldest

votes

up vote
3
down vote

accepted

It looks like there is a bug on how gcloud dataproc is parsing the arguments and making Spark try to execute your file like a Java JAR file. A workaround is copy your __main__.py file outside of your egg file and execute it independently like this.

gcloud dataproc jobs submit pyspark 

    --cluster=my-cluster-001 

    --py-files=file:///dummyproject-1_spark-py2.7.egg 

    file:///__main__.py

edited Nov 12 at 11:38

Igor Dvorzhak

624413

answered Nov 9 at 14:53

hlagos

3,1951816

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53219401%2frun-pyspark-job-from-egg-instead-of-py%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
3
down vote

accepted

gcloud dataproc jobs submit pyspark 

    --cluster=my-cluster-001 

    --py-files=file:///dummyproject-1_spark-py2.7.egg 

    file:///__main__.py

edited Nov 12 at 11:38

Igor Dvorzhak

624413

answered Nov 9 at 14:53

hlagos

3,1951816

add a comment |

up vote
3
down vote

accepted

gcloud dataproc jobs submit pyspark 

    --cluster=my-cluster-001 

    --py-files=file:///dummyproject-1_spark-py2.7.egg 

    file:///__main__.py

edited Nov 12 at 11:38

Igor Dvorzhak

624413

answered Nov 9 at 14:53

hlagos

3,1951816

add a comment |

up vote
3
down vote

accepted

gcloud dataproc jobs submit pyspark 

    --cluster=my-cluster-001 

    --py-files=file:///dummyproject-1_spark-py2.7.egg 

    file:///__main__.py

edited Nov 12 at 11:38

Igor Dvorzhak

624413

answered Nov 9 at 14:53

hlagos

3,1951816

gcloud dataproc jobs submit pyspark 

    --cluster=my-cluster-001 

    --py-files=file:///dummyproject-1_spark-py2.7.egg 

    file:///__main__.py

edited Nov 12 at 11:38

Igor Dvorzhak

624413

answered Nov 9 at 14:53

hlagos

3,1951816

edited Nov 12 at 11:38

Igor Dvorzhak

624413

edited Nov 12 at 11:38

Igor Dvorzhak

624413

edited Nov 12 at 11:38

Igor Dvorzhak

624413

answered Nov 9 at 14:53

hlagos

3,1951816

answered Nov 9 at 14:53

hlagos

3,1951816

answered Nov 9 at 14:53

hlagos

3,1951816

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Xtykutl