The Dataflow workers get stuck with custom setup.py, reading data from BQ
up vote
0
down vote
favorite
I've been trying to get a DataFlow runner to work all day, without success. The worker just loads the job into data flow and does nothing for 1 hour.
Everything runs as expected locally. The process is:
Data from BQ Source -> Some data manipulation -> Writing TF Records
I think something goes wrong when reading data from BQ:
Job Type State Start Time Duration User Email Bytes Processed Bytes Billed Billing Tier Labels
---------- --------- ----------------- ---------- ---------------------------------------------------- ----------------- -------------- -------------- --------
extract SUCCESS 08 Nov 11:06:10 0:00:02 27xxxxxxx7565-compute@developer.gserviceaccount.com
Looks like nothing has been processed.
Basic Pipeline:
import apache_beam as beam
import datetime
import tensorflow_transform.beam.impl as beam_impl
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions
@beam.ptransform_fn
def ReadDataFromBQ(pcoll, project, dataset, table):
bq = beam.io.BigQuerySource(dataset=dataset,
table=table,
project=project,
validate=True,
use_standard_sql=True,)
return pcoll | "ReadFromBQ" >> beam.io.Read(bq)
with beam.Pipeline(options=options) as pipeline:
with beam_impl.Context(temp_dir=google_cloud_options.temp_location):
train_data = pipeline | 'LoadTrainData' >> ReadDataFromBQ(dataset='d_name',
project='project-name',
table='table_name')
it still doesn't work.
I'm using the 2.7.0 version of SDK.
import apache_beam as beam
beam.__version__
'2.7.0' # local
My setup.py file is:
import setuptools
from setuptools import find_packages
REQUIRES = ['tensorflow_transform']
setuptools.setup(
name='Beam',
version='0.0.1',
install_requires=REQUIRES,
packages=find_packages(),
)
Workflow failed. Causes: The Dataflow job appears to be stuck because
no worker activity has been seen in the last 1h. You can get help with
Cloud Dataflow at https://cloud.google.com/dataflow/support.
Job_id: 2018-11-07_12_27_39-17873629436928290134 for full pipeline.
Job_id: 2018-11-08_04_30_38-16805982576734763423 for reduced pipeline (Just Read BQ and Write Txt to GCS)
Prior to this, everything seemed to be working correctly:
2018-11-07 (20:29:44) BigQuery export job "dataflow_job_5858975562210600855" started. You can check its status with the bq...
2018-11-07 (20:29:44) BigQuery export job "dataflow_job_5509154328514239323" started. You can check its status with the bq...
2018-11-07 (20:30:15) BigQuery export job finished: "dataflow_job_5858975562210600855"
2018-11-07 (20:30:15) BigQuery export job finished: "dataflow_job_5509154328514239323"
2018-11-07 (21:30:15) Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been se...
python google-cloud-dataflow apache-beam
add a comment |
up vote
0
down vote
favorite
I've been trying to get a DataFlow runner to work all day, without success. The worker just loads the job into data flow and does nothing for 1 hour.
Everything runs as expected locally. The process is:
Data from BQ Source -> Some data manipulation -> Writing TF Records
I think something goes wrong when reading data from BQ:
Job Type State Start Time Duration User Email Bytes Processed Bytes Billed Billing Tier Labels
---------- --------- ----------------- ---------- ---------------------------------------------------- ----------------- -------------- -------------- --------
extract SUCCESS 08 Nov 11:06:10 0:00:02 27xxxxxxx7565-compute@developer.gserviceaccount.com
Looks like nothing has been processed.
Basic Pipeline:
import apache_beam as beam
import datetime
import tensorflow_transform.beam.impl as beam_impl
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions
@beam.ptransform_fn
def ReadDataFromBQ(pcoll, project, dataset, table):
bq = beam.io.BigQuerySource(dataset=dataset,
table=table,
project=project,
validate=True,
use_standard_sql=True,)
return pcoll | "ReadFromBQ" >> beam.io.Read(bq)
with beam.Pipeline(options=options) as pipeline:
with beam_impl.Context(temp_dir=google_cloud_options.temp_location):
train_data = pipeline | 'LoadTrainData' >> ReadDataFromBQ(dataset='d_name',
project='project-name',
table='table_name')
it still doesn't work.
I'm using the 2.7.0 version of SDK.
import apache_beam as beam
beam.__version__
'2.7.0' # local
My setup.py file is:
import setuptools
from setuptools import find_packages
REQUIRES = ['tensorflow_transform']
setuptools.setup(
name='Beam',
version='0.0.1',
install_requires=REQUIRES,
packages=find_packages(),
)
Workflow failed. Causes: The Dataflow job appears to be stuck because
no worker activity has been seen in the last 1h. You can get help with
Cloud Dataflow at https://cloud.google.com/dataflow/support.
Job_id: 2018-11-07_12_27_39-17873629436928290134 for full pipeline.
Job_id: 2018-11-08_04_30_38-16805982576734763423 for reduced pipeline (Just Read BQ and Write Txt to GCS)
Prior to this, everything seemed to be working correctly:
2018-11-07 (20:29:44) BigQuery export job "dataflow_job_5858975562210600855" started. You can check its status with the bq...
2018-11-07 (20:29:44) BigQuery export job "dataflow_job_5509154328514239323" started. You can check its status with the bq...
2018-11-07 (20:30:15) BigQuery export job finished: "dataflow_job_5858975562210600855"
2018-11-07 (20:30:15) BigQuery export job finished: "dataflow_job_5509154328514239323"
2018-11-07 (21:30:15) Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been se...
python google-cloud-dataflow apache-beam
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I've been trying to get a DataFlow runner to work all day, without success. The worker just loads the job into data flow and does nothing for 1 hour.
Everything runs as expected locally. The process is:
Data from BQ Source -> Some data manipulation -> Writing TF Records
I think something goes wrong when reading data from BQ:
Job Type State Start Time Duration User Email Bytes Processed Bytes Billed Billing Tier Labels
---------- --------- ----------------- ---------- ---------------------------------------------------- ----------------- -------------- -------------- --------
extract SUCCESS 08 Nov 11:06:10 0:00:02 27xxxxxxx7565-compute@developer.gserviceaccount.com
Looks like nothing has been processed.
Basic Pipeline:
import apache_beam as beam
import datetime
import tensorflow_transform.beam.impl as beam_impl
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions
@beam.ptransform_fn
def ReadDataFromBQ(pcoll, project, dataset, table):
bq = beam.io.BigQuerySource(dataset=dataset,
table=table,
project=project,
validate=True,
use_standard_sql=True,)
return pcoll | "ReadFromBQ" >> beam.io.Read(bq)
with beam.Pipeline(options=options) as pipeline:
with beam_impl.Context(temp_dir=google_cloud_options.temp_location):
train_data = pipeline | 'LoadTrainData' >> ReadDataFromBQ(dataset='d_name',
project='project-name',
table='table_name')
it still doesn't work.
I'm using the 2.7.0 version of SDK.
import apache_beam as beam
beam.__version__
'2.7.0' # local
My setup.py file is:
import setuptools
from setuptools import find_packages
REQUIRES = ['tensorflow_transform']
setuptools.setup(
name='Beam',
version='0.0.1',
install_requires=REQUIRES,
packages=find_packages(),
)
Workflow failed. Causes: The Dataflow job appears to be stuck because
no worker activity has been seen in the last 1h. You can get help with
Cloud Dataflow at https://cloud.google.com/dataflow/support.
Job_id: 2018-11-07_12_27_39-17873629436928290134 for full pipeline.
Job_id: 2018-11-08_04_30_38-16805982576734763423 for reduced pipeline (Just Read BQ and Write Txt to GCS)
Prior to this, everything seemed to be working correctly:
2018-11-07 (20:29:44) BigQuery export job "dataflow_job_5858975562210600855" started. You can check its status with the bq...
2018-11-07 (20:29:44) BigQuery export job "dataflow_job_5509154328514239323" started. You can check its status with the bq...
2018-11-07 (20:30:15) BigQuery export job finished: "dataflow_job_5858975562210600855"
2018-11-07 (20:30:15) BigQuery export job finished: "dataflow_job_5509154328514239323"
2018-11-07 (21:30:15) Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been se...
python google-cloud-dataflow apache-beam
I've been trying to get a DataFlow runner to work all day, without success. The worker just loads the job into data flow and does nothing for 1 hour.
Everything runs as expected locally. The process is:
Data from BQ Source -> Some data manipulation -> Writing TF Records
I think something goes wrong when reading data from BQ:
Job Type State Start Time Duration User Email Bytes Processed Bytes Billed Billing Tier Labels
---------- --------- ----------------- ---------- ---------------------------------------------------- ----------------- -------------- -------------- --------
extract SUCCESS 08 Nov 11:06:10 0:00:02 27xxxxxxx7565-compute@developer.gserviceaccount.com
Looks like nothing has been processed.
Basic Pipeline:
import apache_beam as beam
import datetime
import tensorflow_transform.beam.impl as beam_impl
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions
@beam.ptransform_fn
def ReadDataFromBQ(pcoll, project, dataset, table):
bq = beam.io.BigQuerySource(dataset=dataset,
table=table,
project=project,
validate=True,
use_standard_sql=True,)
return pcoll | "ReadFromBQ" >> beam.io.Read(bq)
with beam.Pipeline(options=options) as pipeline:
with beam_impl.Context(temp_dir=google_cloud_options.temp_location):
train_data = pipeline | 'LoadTrainData' >> ReadDataFromBQ(dataset='d_name',
project='project-name',
table='table_name')
it still doesn't work.
I'm using the 2.7.0 version of SDK.
import apache_beam as beam
beam.__version__
'2.7.0' # local
My setup.py file is:
import setuptools
from setuptools import find_packages
REQUIRES = ['tensorflow_transform']
setuptools.setup(
name='Beam',
version='0.0.1',
install_requires=REQUIRES,
packages=find_packages(),
)
Workflow failed. Causes: The Dataflow job appears to be stuck because
no worker activity has been seen in the last 1h. You can get help with
Cloud Dataflow at https://cloud.google.com/dataflow/support.
Job_id: 2018-11-07_12_27_39-17873629436928290134 for full pipeline.
Job_id: 2018-11-08_04_30_38-16805982576734763423 for reduced pipeline (Just Read BQ and Write Txt to GCS)
Prior to this, everything seemed to be working correctly:
2018-11-07 (20:29:44) BigQuery export job "dataflow_job_5858975562210600855" started. You can check its status with the bq...
2018-11-07 (20:29:44) BigQuery export job "dataflow_job_5509154328514239323" started. You can check its status with the bq...
2018-11-07 (20:30:15) BigQuery export job finished: "dataflow_job_5858975562210600855"
2018-11-07 (20:30:15) BigQuery export job finished: "dataflow_job_5509154328514239323"
2018-11-07 (21:30:15) Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been se...
python google-cloud-dataflow apache-beam
python google-cloud-dataflow apache-beam
edited Nov 8 at 12:31
asked Nov 8 at 10:32
GRS
492622
492622
add a comment |
add a comment |
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53205894%2fthe-dataflow-workers-get-stuck-with-custom-setup-py-reading-data-from-bq%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown