Hi,
I have implemented the flink job with MapStates. The functionality is like,
I need a help how can I store this state data in Rocksdb and how to do setups, configurations and codes for those which I am not understanding. Also, is it possible to run batch streaming job on Rocksdb data?
Help will be highly appreciated.
Thanks,
Jaswin
|
/**This is the snapshot of implementation I have done From: Jaswin Shah <[hidden email]>
Sent: 18 May 2020 13:55 To: [hidden email] <[hidden email]> Subject: Rocksdb implementation
Hi,
I have implemented the flink job with MapStates. The functionality is like,
I need a help how can I store this state data in Rocksdb and how to do setups, configurations and codes for those which I am not understanding. Also, is it possible to run batch streaming job on Rocksdb data?
Help will be highly appreciated.
Thanks,
Jaswin
|
Hi Jaswin, I'd discourage using rocksdb directly. It's more of an implementation detail of Flink. I'd also discourage to write to Kafka directly without using our Kafka Sink, as you will receive duplicates upon recovery. If you run the KeyedCoProcessFunction continuously anyways, I'd add a timer (2 days?) [1] for all unmatched records and on triggering of the timer, output the record through a side output [2], where you do your batch logic. Then you don't need a separate batch job to clean that up. If you actually want to output to Kafka for some other application, you just need to stream the side output to a KafkaProducer. On Mon, May 18, 2020 at 10:30 AM Jaswin Shah <[hidden email]> wrote:
-- Arvid Heise | Senior Java Developer Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng |
Hi Jaswin
As Arvid suggested, it's not encouraged to query the internal RocksDB directly. Apart from Arvid's solution, I think queryable state [1] might also help you. I think you just want to know the left entries in both of map state after several days and query the
state should make the meet, please refer to the official doc and this example [2] to know more details.
Best
Yun Tang
From: Arvid Heise <[hidden email]>
Sent: Monday, May 18, 2020 23:40 To: Jaswin Shah <[hidden email]> Cc: [hidden email] <[hidden email]> Subject: Re: Rocksdb implementation Hi Jaswin,
I'd discourage using rocksdb directly. It's more of an implementation detail of Flink. I'd also discourage to write to Kafka directly without using our Kafka Sink, as you will receive duplicates upon recovery.
If you run the KeyedCoProcessFunction continuously anyways, I'd add a timer (2 days?) [1] for all unmatched records and on triggering of the timer, output the record through a side output [2], where you do your batch logic. Then you don't need a separate
batch job to clean that up. If you actually want to output to Kafka for some other application, you just need to stream the side output to a KafkaProducer.
On Mon, May 18, 2020 at 10:30 AM Jaswin Shah <[hidden email]> wrote:
-- Arvid Heise | Senior Java Developer Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng |
++
From: Yun Tang <[hidden email]>
Sent: 18 May 2020 23:47 To: Arvid Heise <[hidden email]>; Jaswin Shah <[hidden email]> Cc: [hidden email] <[hidden email]> Subject: Re: Rocksdb implementation
Hi Jaswin
As Arvid suggested, it's not encouraged to query the internal RocksDB directly. Apart from Arvid's solution, I think queryable state [1] might also help you. I think you just want to know the left entries in both of map state after several days and query the
state should make the meet, please refer to the official doc and this example [2] to know more details.
Best
Yun Tang
From: Arvid Heise <[hidden email]>
Sent: Monday, May 18, 2020 23:40 To: Jaswin Shah <[hidden email]> Cc: [hidden email] <[hidden email]> Subject: Re: Rocksdb implementation Hi Jaswin,
I'd discourage using rocksdb directly. It's more of an implementation detail of Flink. I'd also discourage to write to Kafka directly without using our Kafka Sink, as you will receive duplicates upon recovery.
If you run the KeyedCoProcessFunction continuously anyways, I'd add a timer (2 days?) [1] for all unmatched records and on triggering of the timer, output the record through a side output [2], where you do your batch logic. Then you don't need a separate
batch job to clean that up. If you actually want to output to Kafka for some other application, you just need to stream the side output to a KafkaProducer.
On Mon, May 18, 2020 at 10:30 AM Jaswin Shah <[hidden email]> wrote:
-- Arvid Heise | Senior Java Developer Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng |
Hi Flink will store state in StateBackend, there exist two StateBackends: HeapStateBackend - which will store state in heap, and RocksDBStateBackend -- which will store state in RocksDB. You can set RocksDB with the following ways:[1] 1. add `env.setStateBackend(...);` in your code 2. add configuration `state.backend: rocksdb` in `flink-conf.yaml` Best, Congxian Jaswin Shah <[hidden email]> 于2020年5月19日周二 下午3:59写道:
|
In reply to this post by jaswin.shah@outlook.com
Thanks yun and Arvid.
Just a question, is it possible to have a batch execution inside the same streaming job. You meant to say I should collect the missing messages from both streams in sideoutput on timer expiry. So, I will execute a batch job on side output as sideput will be
shared with the same streaming job that I have. Basically, I need that missing message infos outside.
From: Jaswin Shah <[hidden email]>
Sent: 19 May 2020 13:29 To: Yun Tang <[hidden email]>; Arvid Heise <[hidden email]>; [hidden email] <[hidden email]>; [hidden email] <[hidden email]> Cc: [hidden email] <[hidden email]> Subject: Re: Rocksdb implementation
++
From: Yun Tang <[hidden email]>
Sent: 18 May 2020 23:47 To: Arvid Heise <[hidden email]>; Jaswin Shah <[hidden email]> Cc: [hidden email] <[hidden email]> Subject: Re: Rocksdb implementation
Hi Jaswin
As Arvid suggested, it's not encouraged to query the internal RocksDB directly. Apart from Arvid's solution, I think queryable state [1] might also help you. I think you just want to know the left entries in both of map state after several days and query the
state should make the meet, please refer to the official doc and this example [2] to know more details.
Best
Yun Tang
From: Arvid Heise <[hidden email]>
Sent: Monday, May 18, 2020 23:40 To: Jaswin Shah <[hidden email]> Cc: [hidden email] <[hidden email]> Subject: Re: Rocksdb implementation Hi Jaswin,
I'd discourage using rocksdb directly. It's more of an implementation detail of Flink. I'd also discourage to write to Kafka directly without using our Kafka Sink, as you will receive duplicates upon recovery.
If you run the KeyedCoProcessFunction continuously anyways, I'd add a timer (2 days?) [1] for all unmatched records and on triggering of the timer, output the record through a side output [2], where you do your batch logic. Then you don't need a separate
batch job to clean that up. If you actually want to output to Kafka for some other application, you just need to stream the side output to a KafkaProducer.
On Mon, May 18, 2020 at 10:30 AM Jaswin Shah <[hidden email]> wrote:
-- Arvid Heise | Senior Java Developer Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng |
Hi Jaswin, you cannot run a DataSet program inside a DataStream. However, you can perform the same query on a windowed stream. So if you would execute the batchy part every day, you can just create a tumble window of 24h and then perform your batchy analysis on that time window. Alternatively, you can dump the data into Kafka or a file system and then run the batchy part as a separate program. On Tue, May 19, 2020 at 1:36 PM Jaswin Shah <[hidden email]> wrote:
-- Arvid Heise | Senior Java Developer Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng |
If I create such large tumbling window, that data will stay in memory for large time until the window is not triggered, right? So,won't there be possibility of data loss, or flink would recover in case of any outage.
From: Arvid Heise <[hidden email]>
Sent: 20 May 2020 00:10 To: Jaswin Shah <[hidden email]> Cc: Yun Tang <[hidden email]>; [hidden email] <[hidden email]>; [hidden email] <[hidden email]>; [hidden email] <[hidden email]> Subject: Re: Rocksdb implementation Hi Jaswin,
you cannot run a DataSet program inside a DataStream. However, you can perform the same query on a windowed stream. So if you would execute the batchy part every day, you can just create a tumble window of 24h and then perform your batchy analysis on that
time window.
Alternatively, you can dump the data into Kafka or a file system and then run the batchy part as a separate program.
On Tue, May 19, 2020 at 1:36 PM Jaswin Shah <[hidden email]> wrote:
-- Arvid Heise | Senior Java Developer Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng |
If you enabled checkpointing (which is strongly recommended) [1], no data is lost. On Tue, May 19, 2020 at 8:59 PM Jaswin Shah <[hidden email]> wrote:
-- Arvid Heise | Senior Java Developer Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng |
Okay, so on checkpointing window's data would also be persisted.
From: Arvid Heise <[hidden email]>
Sent: 20 May 2020 01:05 To: Jaswin Shah <[hidden email]> Cc: Yun Tang <[hidden email]>; [hidden email] <[hidden email]>; [hidden email] <[hidden email]>; [hidden email] <[hidden email]> Subject: Re: Rocksdb implementation If you enabled checkpointing (which is strongly recommended) [1], no data is lost.
On Tue, May 19, 2020 at 8:59 PM Jaswin Shah <[hidden email]> wrote:
-- Arvid Heise | Senior Java Developer Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng |
Free forum by Nabble | Edit this page |