pyspark: insert into dataframe if key not present or row.timestamp is more recent
up vote
0
down vote
favorite
I have a Kudu database with a table in it. Every day, I launch a batch job which receives new data to ingest (an ETL pipeline).
I would like to insert the new data if:
- the key is not present
- if the key is present, update the row only if the timestamp column of the new row is more recent
apache-spark pyspark apache-spark-sql apache-kudu
add a comment |
up vote
0
down vote
favorite
I have a Kudu database with a table in it. Every day, I launch a batch job which receives new data to ingest (an ETL pipeline).
I would like to insert the new data if:
- the key is not present
- if the key is present, update the row only if the timestamp column of the new row is more recent
apache-spark pyspark apache-spark-sql apache-kudu
Did you try anything inPySpark
? If, yes what did you?
– karma4917
Nov 8 at 18:51
Hello @karma4917, I haven't tried anything yet because I'm not sure on how to proceed. In a more programmish environment, I would have sorted the dataset and with a scan I would have found the rows with same keys and execute the logic. this is not very much efficient (both in space and time) but It works. But I'm not sure on how to proceed with the spark logic in mind - and if there are other more efficient ways.
– Federico Ponzi
Nov 9 at 12:18
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have a Kudu database with a table in it. Every day, I launch a batch job which receives new data to ingest (an ETL pipeline).
I would like to insert the new data if:
- the key is not present
- if the key is present, update the row only if the timestamp column of the new row is more recent
apache-spark pyspark apache-spark-sql apache-kudu
I have a Kudu database with a table in it. Every day, I launch a batch job which receives new data to ingest (an ETL pipeline).
I would like to insert the new data if:
- the key is not present
- if the key is present, update the row only if the timestamp column of the new row is more recent
apache-spark pyspark apache-spark-sql apache-kudu
apache-spark pyspark apache-spark-sql apache-kudu
edited Nov 9 at 22:17
tk421
3,32131426
3,32131426
asked Nov 8 at 16:57
Federico Ponzi
1,22832243
1,22832243
Did you try anything inPySpark
? If, yes what did you?
– karma4917
Nov 8 at 18:51
Hello @karma4917, I haven't tried anything yet because I'm not sure on how to proceed. In a more programmish environment, I would have sorted the dataset and with a scan I would have found the rows with same keys and execute the logic. this is not very much efficient (both in space and time) but It works. But I'm not sure on how to proceed with the spark logic in mind - and if there are other more efficient ways.
– Federico Ponzi
Nov 9 at 12:18
add a comment |
Did you try anything inPySpark
? If, yes what did you?
– karma4917
Nov 8 at 18:51
Hello @karma4917, I haven't tried anything yet because I'm not sure on how to proceed. In a more programmish environment, I would have sorted the dataset and with a scan I would have found the rows with same keys and execute the logic. this is not very much efficient (both in space and time) but It works. But I'm not sure on how to proceed with the spark logic in mind - and if there are other more efficient ways.
– Federico Ponzi
Nov 9 at 12:18
Did you try anything in
PySpark
? If, yes what did you?– karma4917
Nov 8 at 18:51
Did you try anything in
PySpark
? If, yes what did you?– karma4917
Nov 8 at 18:51
Hello @karma4917, I haven't tried anything yet because I'm not sure on how to proceed. In a more programmish environment, I would have sorted the dataset and with a scan I would have found the rows with same keys and execute the logic. this is not very much efficient (both in space and time) but It works. But I'm not sure on how to proceed with the spark logic in mind - and if there are other more efficient ways.
– Federico Ponzi
Nov 9 at 12:18
Hello @karma4917, I haven't tried anything yet because I'm not sure on how to proceed. In a more programmish environment, I would have sorted the dataset and with a scan I would have found the rows with same keys and execute the logic. this is not very much efficient (both in space and time) but It works. But I'm not sure on how to proceed with the spark logic in mind - and if there are other more efficient ways.
– Federico Ponzi
Nov 9 at 12:18
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
I think what you need is a left outer join of the new data with the existing table, the result of which you first have to save into a temporary table, and then move it to the original table, with SaveMode.Append
.
You might also be interested in using Spark Structured Streaming or Kafka instead of batch jobs. I even found an example on GitHub (didn't check how well it works, though, and whether it takes existing data into account).
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
I think what you need is a left outer join of the new data with the existing table, the result of which you first have to save into a temporary table, and then move it to the original table, with SaveMode.Append
.
You might also be interested in using Spark Structured Streaming or Kafka instead of batch jobs. I even found an example on GitHub (didn't check how well it works, though, and whether it takes existing data into account).
add a comment |
up vote
0
down vote
I think what you need is a left outer join of the new data with the existing table, the result of which you first have to save into a temporary table, and then move it to the original table, with SaveMode.Append
.
You might also be interested in using Spark Structured Streaming or Kafka instead of batch jobs. I even found an example on GitHub (didn't check how well it works, though, and whether it takes existing data into account).
add a comment |
up vote
0
down vote
up vote
0
down vote
I think what you need is a left outer join of the new data with the existing table, the result of which you first have to save into a temporary table, and then move it to the original table, with SaveMode.Append
.
You might also be interested in using Spark Structured Streaming or Kafka instead of batch jobs. I even found an example on GitHub (didn't check how well it works, though, and whether it takes existing data into account).
I think what you need is a left outer join of the new data with the existing table, the result of which you first have to save into a temporary table, and then move it to the original table, with SaveMode.Append
.
You might also be interested in using Spark Structured Streaming or Kafka instead of batch jobs. I even found an example on GitHub (didn't check how well it works, though, and whether it takes existing data into account).
edited Nov 9 at 15:12
answered Nov 9 at 13:35
Bernhard Stadler
500511
500511
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53212586%2fpyspark-insert-into-dataframe-if-key-not-present-or-row-timestamp-is-more-recen%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Did you try anything in
PySpark
? If, yes what did you?– karma4917
Nov 8 at 18:51
Hello @karma4917, I haven't tried anything yet because I'm not sure on how to proceed. In a more programmish environment, I would have sorted the dataset and with a scan I would have found the rows with same keys and execute the logic. this is not very much efficient (both in space and time) but It works. But I'm not sure on how to proceed with the spark logic in mind - and if there are other more efficient ways.
– Federico Ponzi
Nov 9 at 12:18