Ingesting unique records in Kafka-Spark Streaming
up vote
1
down vote
favorite
I have a Kafka topic getting 10K events per min and a Spark Streaming 2.3 consumer in scala written to receive and ingest into Cassandra. Incoming events are JSON having an 'userid' field among others. However if an event with the same userid comes along again (even with a different message body) still I don't want that to be ingested into Cassandra. The Cassandra table to growing every minute and day so doing a lookup of all userids encountered till this point by retrieving the table into an in-memory spark dataframe is impossible as the table will be becoming huge. How can I best ingest only unique records?
Can updateStateByKey work? And how long can state be maintained? Because if the same userid comes after one year, i don't want to ingest it into Cassandra.
scala cassandra apache-kafka spark-streaming
add a comment |
up vote
1
down vote
favorite
I have a Kafka topic getting 10K events per min and a Spark Streaming 2.3 consumer in scala written to receive and ingest into Cassandra. Incoming events are JSON having an 'userid' field among others. However if an event with the same userid comes along again (even with a different message body) still I don't want that to be ingested into Cassandra. The Cassandra table to growing every minute and day so doing a lookup of all userids encountered till this point by retrieving the table into an in-memory spark dataframe is impossible as the table will be becoming huge. How can I best ingest only unique records?
Can updateStateByKey work? And how long can state be maintained? Because if the same userid comes after one year, i don't want to ingest it into Cassandra.
scala cassandra apache-kafka spark-streaming
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have a Kafka topic getting 10K events per min and a Spark Streaming 2.3 consumer in scala written to receive and ingest into Cassandra. Incoming events are JSON having an 'userid' field among others. However if an event with the same userid comes along again (even with a different message body) still I don't want that to be ingested into Cassandra. The Cassandra table to growing every minute and day so doing a lookup of all userids encountered till this point by retrieving the table into an in-memory spark dataframe is impossible as the table will be becoming huge. How can I best ingest only unique records?
Can updateStateByKey work? And how long can state be maintained? Because if the same userid comes after one year, i don't want to ingest it into Cassandra.
scala cassandra apache-kafka spark-streaming
I have a Kafka topic getting 10K events per min and a Spark Streaming 2.3 consumer in scala written to receive and ingest into Cassandra. Incoming events are JSON having an 'userid' field among others. However if an event with the same userid comes along again (even with a different message body) still I don't want that to be ingested into Cassandra. The Cassandra table to growing every minute and day so doing a lookup of all userids encountered till this point by retrieving the table into an in-memory spark dataframe is impossible as the table will be becoming huge. How can I best ingest only unique records?
Can updateStateByKey work? And how long can state be maintained? Because if the same userid comes after one year, i don't want to ingest it into Cassandra.
scala cassandra apache-kafka spark-streaming
scala cassandra apache-kafka spark-streaming
edited Nov 11 at 23:45
asked Nov 8 at 19:26
Steven Park
698
698
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
Use an external low latency external DB like Aerospike or if the rate of duplicates is low you can use an in-memory bloom/cuckoo filter (that is ~4GB for 1 year @ 10K per min rate) with rechecking of matches through Cassandra to do not discard events in case of false positives.
will the bloom filter then contain all the userids seen till date? So with every streaming interval the bloomfilter keeps increasing to append to itself the new ids encountered?
– Steven Park
Nov 11 at 22:17
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Use an external low latency external DB like Aerospike or if the rate of duplicates is low you can use an in-memory bloom/cuckoo filter (that is ~4GB for 1 year @ 10K per min rate) with rechecking of matches through Cassandra to do not discard events in case of false positives.
will the bloom filter then contain all the userids seen till date? So with every streaming interval the bloomfilter keeps increasing to append to itself the new ids encountered?
– Steven Park
Nov 11 at 22:17
add a comment |
up vote
0
down vote
Use an external low latency external DB like Aerospike or if the rate of duplicates is low you can use an in-memory bloom/cuckoo filter (that is ~4GB for 1 year @ 10K per min rate) with rechecking of matches through Cassandra to do not discard events in case of false positives.
will the bloom filter then contain all the userids seen till date? So with every streaming interval the bloomfilter keeps increasing to append to itself the new ids encountered?
– Steven Park
Nov 11 at 22:17
add a comment |
up vote
0
down vote
up vote
0
down vote
Use an external low latency external DB like Aerospike or if the rate of duplicates is low you can use an in-memory bloom/cuckoo filter (that is ~4GB for 1 year @ 10K per min rate) with rechecking of matches through Cassandra to do not discard events in case of false positives.
Use an external low latency external DB like Aerospike or if the rate of duplicates is low you can use an in-memory bloom/cuckoo filter (that is ~4GB for 1 year @ 10K per min rate) with rechecking of matches through Cassandra to do not discard events in case of false positives.
answered Nov 8 at 21:23
Andriy Plokhotnyuk
5,5933040
5,5933040
will the bloom filter then contain all the userids seen till date? So with every streaming interval the bloomfilter keeps increasing to append to itself the new ids encountered?
– Steven Park
Nov 11 at 22:17
add a comment |
will the bloom filter then contain all the userids seen till date? So with every streaming interval the bloomfilter keeps increasing to append to itself the new ids encountered?
– Steven Park
Nov 11 at 22:17
will the bloom filter then contain all the userids seen till date? So with every streaming interval the bloomfilter keeps increasing to append to itself the new ids encountered?
– Steven Park
Nov 11 at 22:17
will the bloom filter then contain all the userids seen till date? So with every streaming interval the bloomfilter keeps increasing to append to itself the new ids encountered?
– Steven Park
Nov 11 at 22:17
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53214811%2fingesting-unique-records-in-kafka-spark-streaming%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown