Run PySpark job from .egg instead of .py
up vote
1
down vote
favorite
I am trying to run a PySpark Job using Dataproc. The only different thing comparing to all the examples out there is that I want to submit the job from .egg instead of .py file.
In order to submit the PySpark job in a regular commodity cluster would be something like:
spark2-submit --master yarn
--driver-memory 20g
--deploy-mode client
--conf parquet.compression=SNAPPY
--jars spark-avro_2.11-3.2.0.jar
--py-files dummyproject-1_spark-py2.7.egg
dummyproject-1_spark-py2.7.egg#__main__.py "param1" "param2"
Now, I want to submit exactly the same job but using Dataproc.
In order to accomplish this I am using the following command:
gcloud dataproc jobs submit pyspark
file:///dummyproject-1_spark-py2.7.egg#__main__.py
--cluster=my-cluster-001
--py-files=file:///dummyproject-1_spark-py2.7.egg
The error I am getting is:
Error: Cannot load main class from JAR
file:/dummyproject-1_spark-py2.7.egg
It is important to mention that when I try to run a simple PySpark job using .py file, it is working correctly.
Can somebody tell me, how can I run a PySpark job from .egg file instead of .py file?
pyspark
add a comment |
up vote
1
down vote
favorite
I am trying to run a PySpark Job using Dataproc. The only different thing comparing to all the examples out there is that I want to submit the job from .egg instead of .py file.
In order to submit the PySpark job in a regular commodity cluster would be something like:
spark2-submit --master yarn
--driver-memory 20g
--deploy-mode client
--conf parquet.compression=SNAPPY
--jars spark-avro_2.11-3.2.0.jar
--py-files dummyproject-1_spark-py2.7.egg
dummyproject-1_spark-py2.7.egg#__main__.py "param1" "param2"
Now, I want to submit exactly the same job but using Dataproc.
In order to accomplish this I am using the following command:
gcloud dataproc jobs submit pyspark
file:///dummyproject-1_spark-py2.7.egg#__main__.py
--cluster=my-cluster-001
--py-files=file:///dummyproject-1_spark-py2.7.egg
The error I am getting is:
Error: Cannot load main class from JAR
file:/dummyproject-1_spark-py2.7.egg
It is important to mention that when I try to run a simple PySpark job using .py file, it is working correctly.
Can somebody tell me, how can I run a PySpark job from .egg file instead of .py file?
pyspark
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I am trying to run a PySpark Job using Dataproc. The only different thing comparing to all the examples out there is that I want to submit the job from .egg instead of .py file.
In order to submit the PySpark job in a regular commodity cluster would be something like:
spark2-submit --master yarn
--driver-memory 20g
--deploy-mode client
--conf parquet.compression=SNAPPY
--jars spark-avro_2.11-3.2.0.jar
--py-files dummyproject-1_spark-py2.7.egg
dummyproject-1_spark-py2.7.egg#__main__.py "param1" "param2"
Now, I want to submit exactly the same job but using Dataproc.
In order to accomplish this I am using the following command:
gcloud dataproc jobs submit pyspark
file:///dummyproject-1_spark-py2.7.egg#__main__.py
--cluster=my-cluster-001
--py-files=file:///dummyproject-1_spark-py2.7.egg
The error I am getting is:
Error: Cannot load main class from JAR
file:/dummyproject-1_spark-py2.7.egg
It is important to mention that when I try to run a simple PySpark job using .py file, it is working correctly.
Can somebody tell me, how can I run a PySpark job from .egg file instead of .py file?
pyspark
I am trying to run a PySpark Job using Dataproc. The only different thing comparing to all the examples out there is that I want to submit the job from .egg instead of .py file.
In order to submit the PySpark job in a regular commodity cluster would be something like:
spark2-submit --master yarn
--driver-memory 20g
--deploy-mode client
--conf parquet.compression=SNAPPY
--jars spark-avro_2.11-3.2.0.jar
--py-files dummyproject-1_spark-py2.7.egg
dummyproject-1_spark-py2.7.egg#__main__.py "param1" "param2"
Now, I want to submit exactly the same job but using Dataproc.
In order to accomplish this I am using the following command:
gcloud dataproc jobs submit pyspark
file:///dummyproject-1_spark-py2.7.egg#__main__.py
--cluster=my-cluster-001
--py-files=file:///dummyproject-1_spark-py2.7.egg
The error I am getting is:
Error: Cannot load main class from JAR
file:/dummyproject-1_spark-py2.7.egg
It is important to mention that when I try to run a simple PySpark job using .py file, it is working correctly.
Can somebody tell me, how can I run a PySpark job from .egg file instead of .py file?
pyspark
pyspark
edited Nov 9 at 3:41
Igor Dvorzhak
624413
624413
asked Nov 9 at 3:22
dbustosp
1,127722
1,127722
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
3
down vote
accepted
It looks like there is a bug on how gcloud dataproc is parsing the arguments and making Spark try to execute your file like a Java JAR file. A workaround is copy your __main__.py file outside of your egg file and execute it independently like this.
gcloud dataproc jobs submit pyspark
--cluster=my-cluster-001
--py-files=file:///dummyproject-1_spark-py2.7.egg
file:///__main__.py
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
accepted
It looks like there is a bug on how gcloud dataproc is parsing the arguments and making Spark try to execute your file like a Java JAR file. A workaround is copy your __main__.py file outside of your egg file and execute it independently like this.
gcloud dataproc jobs submit pyspark
--cluster=my-cluster-001
--py-files=file:///dummyproject-1_spark-py2.7.egg
file:///__main__.py
add a comment |
up vote
3
down vote
accepted
It looks like there is a bug on how gcloud dataproc is parsing the arguments and making Spark try to execute your file like a Java JAR file. A workaround is copy your __main__.py file outside of your egg file and execute it independently like this.
gcloud dataproc jobs submit pyspark
--cluster=my-cluster-001
--py-files=file:///dummyproject-1_spark-py2.7.egg
file:///__main__.py
add a comment |
up vote
3
down vote
accepted
up vote
3
down vote
accepted
It looks like there is a bug on how gcloud dataproc is parsing the arguments and making Spark try to execute your file like a Java JAR file. A workaround is copy your __main__.py file outside of your egg file and execute it independently like this.
gcloud dataproc jobs submit pyspark
--cluster=my-cluster-001
--py-files=file:///dummyproject-1_spark-py2.7.egg
file:///__main__.py
It looks like there is a bug on how gcloud dataproc is parsing the arguments and making Spark try to execute your file like a Java JAR file. A workaround is copy your __main__.py file outside of your egg file and execute it independently like this.
gcloud dataproc jobs submit pyspark
--cluster=my-cluster-001
--py-files=file:///dummyproject-1_spark-py2.7.egg
file:///__main__.py
edited Nov 12 at 11:38
Igor Dvorzhak
624413
624413
answered Nov 9 at 14:53
hlagos
3,1951816
3,1951816
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53219401%2frun-pyspark-job-from-egg-instead-of-py%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown