pyspark: insert into dataframe if key not present or row.timestamp is more recent











up vote
0
down vote

favorite












I have a Kudu database with a table in it. Every day, I launch a batch job which receives new data to ingest (an ETL pipeline).



I would like to insert the new data if:




  • the key is not present

  • if the key is present, update the row only if the timestamp column of the new row is more recent










share|improve this question
























  • Did you try anything in PySpark? If, yes what did you?
    – karma4917
    Nov 8 at 18:51










  • Hello @karma4917, I haven't tried anything yet because I'm not sure on how to proceed. In a more programmish environment, I would have sorted the dataset and with a scan I would have found the rows with same keys and execute the logic. this is not very much efficient (both in space and time) but It works. But I'm not sure on how to proceed with the spark logic in mind - and if there are other more efficient ways.
    – Federico Ponzi
    Nov 9 at 12:18

















up vote
0
down vote

favorite












I have a Kudu database with a table in it. Every day, I launch a batch job which receives new data to ingest (an ETL pipeline).



I would like to insert the new data if:




  • the key is not present

  • if the key is present, update the row only if the timestamp column of the new row is more recent










share|improve this question
























  • Did you try anything in PySpark? If, yes what did you?
    – karma4917
    Nov 8 at 18:51










  • Hello @karma4917, I haven't tried anything yet because I'm not sure on how to proceed. In a more programmish environment, I would have sorted the dataset and with a scan I would have found the rows with same keys and execute the logic. this is not very much efficient (both in space and time) but It works. But I'm not sure on how to proceed with the spark logic in mind - and if there are other more efficient ways.
    – Federico Ponzi
    Nov 9 at 12:18















up vote
0
down vote

favorite









up vote
0
down vote

favorite











I have a Kudu database with a table in it. Every day, I launch a batch job which receives new data to ingest (an ETL pipeline).



I would like to insert the new data if:




  • the key is not present

  • if the key is present, update the row only if the timestamp column of the new row is more recent










share|improve this question















I have a Kudu database with a table in it. Every day, I launch a batch job which receives new data to ingest (an ETL pipeline).



I would like to insert the new data if:




  • the key is not present

  • if the key is present, update the row only if the timestamp column of the new row is more recent







apache-spark pyspark apache-spark-sql apache-kudu






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 9 at 22:17









tk421

3,32131426




3,32131426










asked Nov 8 at 16:57









Federico Ponzi

1,22832243




1,22832243












  • Did you try anything in PySpark? If, yes what did you?
    – karma4917
    Nov 8 at 18:51










  • Hello @karma4917, I haven't tried anything yet because I'm not sure on how to proceed. In a more programmish environment, I would have sorted the dataset and with a scan I would have found the rows with same keys and execute the logic. this is not very much efficient (both in space and time) but It works. But I'm not sure on how to proceed with the spark logic in mind - and if there are other more efficient ways.
    – Federico Ponzi
    Nov 9 at 12:18




















  • Did you try anything in PySpark? If, yes what did you?
    – karma4917
    Nov 8 at 18:51










  • Hello @karma4917, I haven't tried anything yet because I'm not sure on how to proceed. In a more programmish environment, I would have sorted the dataset and with a scan I would have found the rows with same keys and execute the logic. this is not very much efficient (both in space and time) but It works. But I'm not sure on how to proceed with the spark logic in mind - and if there are other more efficient ways.
    – Federico Ponzi
    Nov 9 at 12:18


















Did you try anything in PySpark? If, yes what did you?
– karma4917
Nov 8 at 18:51




Did you try anything in PySpark? If, yes what did you?
– karma4917
Nov 8 at 18:51












Hello @karma4917, I haven't tried anything yet because I'm not sure on how to proceed. In a more programmish environment, I would have sorted the dataset and with a scan I would have found the rows with same keys and execute the logic. this is not very much efficient (both in space and time) but It works. But I'm not sure on how to proceed with the spark logic in mind - and if there are other more efficient ways.
– Federico Ponzi
Nov 9 at 12:18






Hello @karma4917, I haven't tried anything yet because I'm not sure on how to proceed. In a more programmish environment, I would have sorted the dataset and with a scan I would have found the rows with same keys and execute the logic. this is not very much efficient (both in space and time) but It works. But I'm not sure on how to proceed with the spark logic in mind - and if there are other more efficient ways.
– Federico Ponzi
Nov 9 at 12:18














1 Answer
1






active

oldest

votes

















up vote
0
down vote













I think what you need is a left outer join of the new data with the existing table, the result of which you first have to save into a temporary table, and then move it to the original table, with SaveMode.Append.



You might also be interested in using Spark Structured Streaming or Kafka instead of batch jobs. I even found an example on GitHub (didn't check how well it works, though, and whether it takes existing data into account).






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














     

    draft saved


    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53212586%2fpyspark-insert-into-dataframe-if-key-not-present-or-row-timestamp-is-more-recen%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote













    I think what you need is a left outer join of the new data with the existing table, the result of which you first have to save into a temporary table, and then move it to the original table, with SaveMode.Append.



    You might also be interested in using Spark Structured Streaming or Kafka instead of batch jobs. I even found an example on GitHub (didn't check how well it works, though, and whether it takes existing data into account).






    share|improve this answer



























      up vote
      0
      down vote













      I think what you need is a left outer join of the new data with the existing table, the result of which you first have to save into a temporary table, and then move it to the original table, with SaveMode.Append.



      You might also be interested in using Spark Structured Streaming or Kafka instead of batch jobs. I even found an example on GitHub (didn't check how well it works, though, and whether it takes existing data into account).






      share|improve this answer

























        up vote
        0
        down vote










        up vote
        0
        down vote









        I think what you need is a left outer join of the new data with the existing table, the result of which you first have to save into a temporary table, and then move it to the original table, with SaveMode.Append.



        You might also be interested in using Spark Structured Streaming or Kafka instead of batch jobs. I even found an example on GitHub (didn't check how well it works, though, and whether it takes existing data into account).






        share|improve this answer














        I think what you need is a left outer join of the new data with the existing table, the result of which you first have to save into a temporary table, and then move it to the original table, with SaveMode.Append.



        You might also be interested in using Spark Structured Streaming or Kafka instead of batch jobs. I even found an example on GitHub (didn't check how well it works, though, and whether it takes existing data into account).







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 9 at 15:12

























        answered Nov 9 at 13:35









        Bernhard Stadler

        500511




        500511






























             

            draft saved


            draft discarded



















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53212586%2fpyspark-insert-into-dataframe-if-key-not-present-or-row-timestamp-is-more-recen%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Schultheiß

            Verwaltungsgliederung Dänemarks

            Liste der Kulturdenkmale in Wilsdruff