pyspark: insert into dataframe if key not present or row.timestamp is more recent

up vote
0
down vote

favorite

I have a Kudu database with a table in it. Every day, I launch a batch job which receives new data to ingest (an ETL pipeline).

I would like to insert the new data if:

the key is not present

if the key is present, update the row only if the timestamp column of the new row is more recent

edited Nov 9 at 22:17

tk421

3,32131426

asked Nov 8 at 16:57

Federico Ponzi

1,22832243

Did you try anything in PySpark? If, yes what did you?
– karma4917
Nov 8 at 18:51

Hello @karma4917, I haven't tried anything yet because I'm not sure on how to proceed. In a more programmish environment, I would have sorted the dataset and with a scan I would have found the rows with same keys and execute the logic. this is not very much efficient (both in space and time) but It works. But I'm not sure on how to proceed with the spark logic in mind - and if there are other more efficient ways.
– Federico Ponzi
Nov 9 at 12:18

add a comment |

up vote
0
down vote

favorite

I have a Kudu database with a table in it. Every day, I launch a batch job which receives new data to ingest (an ETL pipeline).

I would like to insert the new data if:

the key is not present

if the key is present, update the row only if the timestamp column of the new row is more recent

edited Nov 9 at 22:17

tk421

3,32131426

asked Nov 8 at 16:57

Federico Ponzi

1,22832243

Did you try anything in PySpark? If, yes what did you?
– karma4917
Nov 8 at 18:51

Hello @karma4917, I haven't tried anything yet because I'm not sure on how to proceed. In a more programmish environment, I would have sorted the dataset and with a scan I would have found the rows with same keys and execute the logic. this is not very much efficient (both in space and time) but It works. But I'm not sure on how to proceed with the spark logic in mind - and if there are other more efficient ways.
– Federico Ponzi
Nov 9 at 12:18

add a comment |

up vote
0
down vote

favorite

I have a Kudu database with a table in it. Every day, I launch a batch job which receives new data to ingest (an ETL pipeline).

I would like to insert the new data if:

the key is not present

if the key is present, update the row only if the timestamp column of the new row is more recent

edited Nov 9 at 22:17

tk421

3,32131426

asked Nov 8 at 16:57

Federico Ponzi

1,22832243

I have a Kudu database with a table in it. Every day, I launch a batch job which receives new data to ingest (an ETL pipeline).

I would like to insert the new data if:

the key is not present

if the key is present, update the row only if the timestamp column of the new row is more recent

apache-spark pyspark apache-spark-sql apache-kudu

edited Nov 9 at 22:17

tk421

3,32131426

asked Nov 8 at 16:57

Federico Ponzi

1,22832243

edited Nov 9 at 22:17

tk421

3,32131426

asked Nov 8 at 16:57

Federico Ponzi

1,22832243

edited Nov 9 at 22:17

tk421

3,32131426

edited Nov 9 at 22:17

tk421

3,32131426

edited Nov 9 at 22:17

tk421

3,32131426

asked Nov 8 at 16:57

Federico Ponzi

1,22832243

asked Nov 8 at 16:57

Federico Ponzi

1,22832243

asked Nov 8 at 16:57

Federico Ponzi

1,22832243

Did you try anything in PySpark? If, yes what did you?
– karma4917
Nov 8 at 18:51

Hello @karma4917, I haven't tried anything yet because I'm not sure on how to proceed. In a more programmish environment, I would have sorted the dataset and with a scan I would have found the rows with same keys and execute the logic. this is not very much efficient (both in space and time) but It works. But I'm not sure on how to proceed with the spark logic in mind - and if there are other more efficient ways.
– Federico Ponzi
Nov 9 at 12:18

add a comment |

Did you try anything in PySpark? If, yes what did you?
– karma4917
Nov 8 at 18:51

Hello @karma4917, I haven't tried anything yet because I'm not sure on how to proceed. In a more programmish environment, I would have sorted the dataset and with a scan I would have found the rows with same keys and execute the logic. this is not very much efficient (both in space and time) but It works. But I'm not sure on how to proceed with the spark logic in mind - and if there are other more efficient ways.
– Federico Ponzi
Nov 9 at 12:18

Did you try anything in PySpark? If, yes what did you?
– karma4917
Nov 8 at 18:51

Hello @karma4917, I haven't tried anything yet because I'm not sure on how to proceed. In a more programmish environment, I would have sorted the dataset and with a scan I would have found the rows with same keys and execute the logic. this is not very much efficient (both in space and time) but It works. But I'm not sure on how to proceed with the spark logic in mind - and if there are other more efficient ways.
– Federico Ponzi
Nov 9 at 12:18

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

I think what you need is a left outer join of the new data with the existing table, the result of which you first have to save into a temporary table, and then move it to the original table, with SaveMode.Append.

You might also be interested in using Spark Structured Streaming or Kafka instead of batch jobs. I even found an example on GitHub (didn't check how well it works, though, and whether it takes existing data into account).

edited Nov 9 at 15:12

answered Nov 9 at 13:35

Bernhard Stadler

500511

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53212586%2fpyspark-insert-into-dataframe-if-key-not-present-or-row-timestamp-is-more-recen%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

edited Nov 9 at 15:12

answered Nov 9 at 13:35

Bernhard Stadler

500511

add a comment |

up vote
0
down vote

edited Nov 9 at 15:12

answered Nov 9 at 13:35

Bernhard Stadler

500511

add a comment |

up vote
0
down vote

edited Nov 9 at 15:12

answered Nov 9 at 13:35

Bernhard Stadler

500511

edited Nov 9 at 15:12

answered Nov 9 at 13:35

Bernhard Stadler

500511

edited Nov 9 at 15:12

answered Nov 9 at 13:35

Bernhard Stadler

500511

answered Nov 9 at 13:35

Bernhard Stadler

500511

answered Nov 9 at 13:35

Bernhard Stadler

500511

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Xtykutl