The Dataflow workers get stuck with custom setup.py, reading data from BQ











up vote
0
down vote

favorite












I've been trying to get a DataFlow runner to work all day, without success. The worker just loads the job into data flow and does nothing for 1 hour.



Everything runs as expected locally. The process is:



Data from BQ Source -> Some data manipulation -> Writing TF Records



I think something goes wrong when reading data from BQ:



  Job Type    State      Start Time      Duration                       User Email                       Bytes Processed   Bytes Billed   Billing Tier   Labels  
---------- --------- ----------------- ---------- ---------------------------------------------------- ----------------- -------------- -------------- --------
extract SUCCESS 08 Nov 11:06:10 0:00:02 27xxxxxxx7565-compute@developer.gserviceaccount.com


Looks like nothing has been processed.



Basic Pipeline:



import apache_beam as beam
import datetime
import tensorflow_transform.beam.impl as beam_impl
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions

@beam.ptransform_fn
def ReadDataFromBQ(pcoll, project, dataset, table):
bq = beam.io.BigQuerySource(dataset=dataset,
table=table,
project=project,
validate=True,
use_standard_sql=True,)

return pcoll | "ReadFromBQ" >> beam.io.Read(bq)

with beam.Pipeline(options=options) as pipeline:
with beam_impl.Context(temp_dir=google_cloud_options.temp_location):

train_data = pipeline | 'LoadTrainData' >> ReadDataFromBQ(dataset='d_name',
project='project-name',
table='table_name')


it still doesn't work.



I'm using the 2.7.0 version of SDK.



import apache_beam as beam
beam.__version__
'2.7.0' # local


My setup.py file is:



import setuptools
from setuptools import find_packages

REQUIRES = ['tensorflow_transform']

setuptools.setup(
name='Beam',
version='0.0.1',
install_requires=REQUIRES,
packages=find_packages(),
)



Workflow failed. Causes: The Dataflow job appears to be stuck because
no worker activity has been seen in the last 1h. You can get help with
Cloud Dataflow at https://cloud.google.com/dataflow/support.





  1. Job_id: 2018-11-07_12_27_39-17873629436928290134 for full pipeline.


  2. Job_id: 2018-11-08_04_30_38-16805982576734763423 for reduced pipeline (Just Read BQ and Write Txt to GCS)



Prior to this, everything seemed to be working correctly:



 2018-11-07 (20:29:44) BigQuery export job "dataflow_job_5858975562210600855" started. You can check its status with the bq...

2018-11-07 (20:29:44) BigQuery export job "dataflow_job_5509154328514239323" started. You can check its status with the bq...

2018-11-07 (20:30:15) BigQuery export job finished: "dataflow_job_5858975562210600855"

2018-11-07 (20:30:15) BigQuery export job finished: "dataflow_job_5509154328514239323"

2018-11-07 (21:30:15) Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been se...









share|improve this question




























    up vote
    0
    down vote

    favorite












    I've been trying to get a DataFlow runner to work all day, without success. The worker just loads the job into data flow and does nothing for 1 hour.



    Everything runs as expected locally. The process is:



    Data from BQ Source -> Some data manipulation -> Writing TF Records



    I think something goes wrong when reading data from BQ:



      Job Type    State      Start Time      Duration                       User Email                       Bytes Processed   Bytes Billed   Billing Tier   Labels  
    ---------- --------- ----------------- ---------- ---------------------------------------------------- ----------------- -------------- -------------- --------
    extract SUCCESS 08 Nov 11:06:10 0:00:02 27xxxxxxx7565-compute@developer.gserviceaccount.com


    Looks like nothing has been processed.



    Basic Pipeline:



    import apache_beam as beam
    import datetime
    import tensorflow_transform.beam.impl as beam_impl
    from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions

    @beam.ptransform_fn
    def ReadDataFromBQ(pcoll, project, dataset, table):
    bq = beam.io.BigQuerySource(dataset=dataset,
    table=table,
    project=project,
    validate=True,
    use_standard_sql=True,)

    return pcoll | "ReadFromBQ" >> beam.io.Read(bq)

    with beam.Pipeline(options=options) as pipeline:
    with beam_impl.Context(temp_dir=google_cloud_options.temp_location):

    train_data = pipeline | 'LoadTrainData' >> ReadDataFromBQ(dataset='d_name',
    project='project-name',
    table='table_name')


    it still doesn't work.



    I'm using the 2.7.0 version of SDK.



    import apache_beam as beam
    beam.__version__
    '2.7.0' # local


    My setup.py file is:



    import setuptools
    from setuptools import find_packages

    REQUIRES = ['tensorflow_transform']

    setuptools.setup(
    name='Beam',
    version='0.0.1',
    install_requires=REQUIRES,
    packages=find_packages(),
    )



    Workflow failed. Causes: The Dataflow job appears to be stuck because
    no worker activity has been seen in the last 1h. You can get help with
    Cloud Dataflow at https://cloud.google.com/dataflow/support.





    1. Job_id: 2018-11-07_12_27_39-17873629436928290134 for full pipeline.


    2. Job_id: 2018-11-08_04_30_38-16805982576734763423 for reduced pipeline (Just Read BQ and Write Txt to GCS)



    Prior to this, everything seemed to be working correctly:



     2018-11-07 (20:29:44) BigQuery export job "dataflow_job_5858975562210600855" started. You can check its status with the bq...

    2018-11-07 (20:29:44) BigQuery export job "dataflow_job_5509154328514239323" started. You can check its status with the bq...

    2018-11-07 (20:30:15) BigQuery export job finished: "dataflow_job_5858975562210600855"

    2018-11-07 (20:30:15) BigQuery export job finished: "dataflow_job_5509154328514239323"

    2018-11-07 (21:30:15) Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been se...









    share|improve this question


























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I've been trying to get a DataFlow runner to work all day, without success. The worker just loads the job into data flow and does nothing for 1 hour.



      Everything runs as expected locally. The process is:



      Data from BQ Source -> Some data manipulation -> Writing TF Records



      I think something goes wrong when reading data from BQ:



        Job Type    State      Start Time      Duration                       User Email                       Bytes Processed   Bytes Billed   Billing Tier   Labels  
      ---------- --------- ----------------- ---------- ---------------------------------------------------- ----------------- -------------- -------------- --------
      extract SUCCESS 08 Nov 11:06:10 0:00:02 27xxxxxxx7565-compute@developer.gserviceaccount.com


      Looks like nothing has been processed.



      Basic Pipeline:



      import apache_beam as beam
      import datetime
      import tensorflow_transform.beam.impl as beam_impl
      from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions

      @beam.ptransform_fn
      def ReadDataFromBQ(pcoll, project, dataset, table):
      bq = beam.io.BigQuerySource(dataset=dataset,
      table=table,
      project=project,
      validate=True,
      use_standard_sql=True,)

      return pcoll | "ReadFromBQ" >> beam.io.Read(bq)

      with beam.Pipeline(options=options) as pipeline:
      with beam_impl.Context(temp_dir=google_cloud_options.temp_location):

      train_data = pipeline | 'LoadTrainData' >> ReadDataFromBQ(dataset='d_name',
      project='project-name',
      table='table_name')


      it still doesn't work.



      I'm using the 2.7.0 version of SDK.



      import apache_beam as beam
      beam.__version__
      '2.7.0' # local


      My setup.py file is:



      import setuptools
      from setuptools import find_packages

      REQUIRES = ['tensorflow_transform']

      setuptools.setup(
      name='Beam',
      version='0.0.1',
      install_requires=REQUIRES,
      packages=find_packages(),
      )



      Workflow failed. Causes: The Dataflow job appears to be stuck because
      no worker activity has been seen in the last 1h. You can get help with
      Cloud Dataflow at https://cloud.google.com/dataflow/support.





      1. Job_id: 2018-11-07_12_27_39-17873629436928290134 for full pipeline.


      2. Job_id: 2018-11-08_04_30_38-16805982576734763423 for reduced pipeline (Just Read BQ and Write Txt to GCS)



      Prior to this, everything seemed to be working correctly:



       2018-11-07 (20:29:44) BigQuery export job "dataflow_job_5858975562210600855" started. You can check its status with the bq...

      2018-11-07 (20:29:44) BigQuery export job "dataflow_job_5509154328514239323" started. You can check its status with the bq...

      2018-11-07 (20:30:15) BigQuery export job finished: "dataflow_job_5858975562210600855"

      2018-11-07 (20:30:15) BigQuery export job finished: "dataflow_job_5509154328514239323"

      2018-11-07 (21:30:15) Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been se...









      share|improve this question















      I've been trying to get a DataFlow runner to work all day, without success. The worker just loads the job into data flow and does nothing for 1 hour.



      Everything runs as expected locally. The process is:



      Data from BQ Source -> Some data manipulation -> Writing TF Records



      I think something goes wrong when reading data from BQ:



        Job Type    State      Start Time      Duration                       User Email                       Bytes Processed   Bytes Billed   Billing Tier   Labels  
      ---------- --------- ----------------- ---------- ---------------------------------------------------- ----------------- -------------- -------------- --------
      extract SUCCESS 08 Nov 11:06:10 0:00:02 27xxxxxxx7565-compute@developer.gserviceaccount.com


      Looks like nothing has been processed.



      Basic Pipeline:



      import apache_beam as beam
      import datetime
      import tensorflow_transform.beam.impl as beam_impl
      from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions

      @beam.ptransform_fn
      def ReadDataFromBQ(pcoll, project, dataset, table):
      bq = beam.io.BigQuerySource(dataset=dataset,
      table=table,
      project=project,
      validate=True,
      use_standard_sql=True,)

      return pcoll | "ReadFromBQ" >> beam.io.Read(bq)

      with beam.Pipeline(options=options) as pipeline:
      with beam_impl.Context(temp_dir=google_cloud_options.temp_location):

      train_data = pipeline | 'LoadTrainData' >> ReadDataFromBQ(dataset='d_name',
      project='project-name',
      table='table_name')


      it still doesn't work.



      I'm using the 2.7.0 version of SDK.



      import apache_beam as beam
      beam.__version__
      '2.7.0' # local


      My setup.py file is:



      import setuptools
      from setuptools import find_packages

      REQUIRES = ['tensorflow_transform']

      setuptools.setup(
      name='Beam',
      version='0.0.1',
      install_requires=REQUIRES,
      packages=find_packages(),
      )



      Workflow failed. Causes: The Dataflow job appears to be stuck because
      no worker activity has been seen in the last 1h. You can get help with
      Cloud Dataflow at https://cloud.google.com/dataflow/support.





      1. Job_id: 2018-11-07_12_27_39-17873629436928290134 for full pipeline.


      2. Job_id: 2018-11-08_04_30_38-16805982576734763423 for reduced pipeline (Just Read BQ and Write Txt to GCS)



      Prior to this, everything seemed to be working correctly:



       2018-11-07 (20:29:44) BigQuery export job "dataflow_job_5858975562210600855" started. You can check its status with the bq...

      2018-11-07 (20:29:44) BigQuery export job "dataflow_job_5509154328514239323" started. You can check its status with the bq...

      2018-11-07 (20:30:15) BigQuery export job finished: "dataflow_job_5858975562210600855"

      2018-11-07 (20:30:15) BigQuery export job finished: "dataflow_job_5509154328514239323"

      2018-11-07 (21:30:15) Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been se...






      python google-cloud-dataflow apache-beam






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 8 at 12:31

























      asked Nov 8 at 10:32









      GRS

      492622




      492622





























          active

          oldest

          votes











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














           

          draft saved


          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53205894%2fthe-dataflow-workers-get-stuck-with-custom-setup-py-reading-data-from-bq%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown






























          active

          oldest

          votes













          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















           

          draft saved


          draft discarded



















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53205894%2fthe-dataflow-workers-get-stuck-with-custom-setup-py-reading-data-from-bq%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Schultheiß

          Verwaltungsgliederung Dänemarks

          Liste der Kulturdenkmale in Wilsdruff