execution.runtime-mode=BATCH when reading from Hive

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

execution.runtime-mode=BATCH when reading from Hive

Dongwon Kim-2
Hi,

Recently I've been working on a real-time data stream processing pipeline with DataStream API while preparing for a new service to launch.
Now it's time to develop a back-fill job to produce the same result by reading data stored on Hive which we use for long-term storage.

Meanwhile, I watched Aljoscha's talk [1] and just wondered if I could reuse major components of the pipeline written in DataStream API.
The pipeline conceptually looks as follows: 
(A) reads input from Kafka
(B) performs AsyncIO to Redis in order to enrich the input data
(C) appends timestamps and emits watermarks before time-based window
(D) keyBy followed by a session window with a custom trigger for early firing
(E) writes output to Kafka

I have simple (maybe stupid) questions on reusing components of the pipeline written in DataStream API.
(1) By replacing (A) with a bounded source, can I execute the pipeline with a new BATCH execution mode without modifying (B)~(E)?
(2) Is there a bounded source for Hive available for DataStream API?

Best,

Dongwon

Reply | Threaded
Open this post in threaded view
|

Re: execution.runtime-mode=BATCH when reading from Hive

Aljoscha Krettek
Hi Dongwon,

Unfortunately, it's not that easy right now because normal Sinks that
rely on checkpointing to write out data, such as Kafka, don't work in
BATCH execution mode because we don't have checkopoints there. It will
work, however, if you use a source that doesn't rely on checkpointing it
will work. The FlinkKafkaProducer with Semantic.NONE should work, for
example.

There is HiveSource, which is built on the new Source API that will work
well with both BATCH and STREAMING. It's quite new and it was only added
to be used by a Table/SQL connector but you might have some success with
that one.

Best,
Aljoscha

On 18.11.20 07:03, Dongwon Kim wrote:

> Hi,
>
> Recently I've been working on a real-time data stream processing pipeline
> with DataStream API while preparing for a new service to launch.
> Now it's time to develop a back-fill job to produce the same result by
> reading data stored on Hive which we use for long-term storage.
>
> Meanwhile, I watched Aljoscha's talk [1] and just wondered if I could reuse
> major components of the pipeline written in DataStream API.
> The pipeline conceptually looks as follows:
> (A) reads input from Kafka
> (B) performs AsyncIO to Redis in order to enrich the input data
> (C) appends timestamps and emits watermarks before time-based window
> (D) keyBy followed by a session window with a custom trigger for early
> firing
> (E) writes output to Kafka
>
> I have simple (maybe stupid) questions on reusing components of the
> pipeline written in DataStream API.
> (1) By replacing (A) with a bounded source, can I execute the pipeline with
> a new BATCH execution mode without modifying (B)~(E)?
> (2) Is there a bounded source for Hive available for DataStream API?
>
> Best,
>
> Dongwon
>
> [1] https://www.youtube.com/watch?v=z9ye4jzp4DQ
>

Reply | Threaded
Open this post in threaded view
|

Re: execution.runtime-mode=BATCH when reading from Hive

Dongwon Kim-2
Hi Aljoscha,

Unfortunately, it's not that easy right now because normal Sinks that
rely on checkpointing to write out data, such as Kafka, don't work in
BATCH execution mode because we don't have checkpoints there. It will
work, however, if you use a source that doesn't rely on checkpointing it
will work. The FlinkKafkaProducer with Semantic.NONE should work, for
example.
As the output produced to Kafka is eventually stored on Cassandra, I might use a different sink to write output directly to Cassandra for BATCH execution.
In such a case, I have to replace both (A) source and (E) sink.

There is HiveSource, which is built on the new Source API that will work
well with both BATCH and STREAMING. It's quite new and it was only added
to be used by a Table/SQL connector but you might have some success with
that one.
Oh, this one is a new one which will be introduced in the upcoming 1.12 release. 
How I've missed this one.
I'm going to give it a try :-)

BTW, thanks a lot for the input and the nice presentation - it's very helpful and insightful.

Best,

Dongwon

On Wed, Nov 18, 2020 at 9:44 PM Aljoscha Krettek <[hidden email]> wrote:
Hi Dongwon,

Unfortunately, it's not that easy right now because normal Sinks that
rely on checkpointing to write out data, such as Kafka, don't work in
BATCH execution mode because we don't have checkopoints there. It will
work, however, if you use a source that doesn't rely on checkpointing it
will work. The FlinkKafkaProducer with Semantic.NONE should work, for
example.

There is HiveSource, which is built on the new Source API that will work
well with both BATCH and STREAMING. It's quite new and it was only added
to be used by a Table/SQL connector but you might have some success with
that one.

Best,
Aljoscha

On 18.11.20 07:03, Dongwon Kim wrote:
> Hi,
>
> Recently I've been working on a real-time data stream processing pipeline
> with DataStream API while preparing for a new service to launch.
> Now it's time to develop a back-fill job to produce the same result by
> reading data stored on Hive which we use for long-term storage.
>
> Meanwhile, I watched Aljoscha's talk [1] and just wondered if I could reuse
> major components of the pipeline written in DataStream API.
> The pipeline conceptually looks as follows:
> (A) reads input from Kafka
> (B) performs AsyncIO to Redis in order to enrich the input data
> (C) appends timestamps and emits watermarks before time-based window
> (D) keyBy followed by a session window with a custom trigger for early
> firing
> (E) writes output to Kafka
>
> I have simple (maybe stupid) questions on reusing components of the
> pipeline written in DataStream API.
> (1) By replacing (A) with a bounded source, can I execute the pipeline with
> a new BATCH execution mode without modifying (B)~(E)?
> (2) Is there a bounded source for Hive available for DataStream API?
>
> Best,
>
> Dongwon
>
> [1] https://www.youtube.com/watch?v=z9ye4jzp4DQ
>

Reply | Threaded
Open this post in threaded view
|

Re: execution.runtime-mode=BATCH when reading from Hive

Aljoscha Krettek
Thanks! It's good to see that it is helpful to you.

Best,
Aljoscha

On 18.11.20 18:11, Dongwon Kim wrote:

> Hi Aljoscha,
>
> Unfortunately, it's not that easy right now because normal Sinks that
>> rely on checkpointing to write out data, such as Kafka, don't work in
>> BATCH execution mode because we don't have checkpoints there. It will
>> work, however, if you use a source that doesn't rely on checkpointing it
>> will work. The FlinkKafkaProducer with Semantic.NONE should work, for
>> example.
>
> As the output produced to Kafka is eventually stored on Cassandra, I might
> use a different sink to write output directly to Cassandra for BATCH
> execution.
> In such a case, I have to replace both (A) source and (E) sink.
>
> There is HiveSource, which is built on the new Source API that will work
>> well with both BATCH and STREAMING. It's quite new and it was only added
>> to be used by a Table/SQL connector but you might have some success with
>> that one.
>
> Oh, this one is a new one which will be introduced in the upcoming 1.12
> release.
> How I've missed this one.
> I'm going to give it a try :-)
>
> BTW, thanks a lot for the input and the nice presentation - it's very
> helpful and insightful.
>
> Best,
>
> Dongwon
>
> On Wed, Nov 18, 2020 at 9:44 PM Aljoscha Krettek <[hidden email]>
> wrote:
>
>> Hi Dongwon,
>>
>> Unfortunately, it's not that easy right now because normal Sinks that
>> rely on checkpointing to write out data, such as Kafka, don't work in
>> BATCH execution mode because we don't have checkopoints there. It will
>> work, however, if you use a source that doesn't rely on checkpointing it
>> will work. The FlinkKafkaProducer with Semantic.NONE should work, for
>> example.
>>
>> There is HiveSource, which is built on the new Source API that will work
>> well with both BATCH and STREAMING. It's quite new and it was only added
>> to be used by a Table/SQL connector but you might have some success with
>> that one.
>>
>> Best,
>> Aljoscha
>>
>> On 18.11.20 07:03, Dongwon Kim wrote:
>>> Hi,
>>>
>>> Recently I've been working on a real-time data stream processing pipeline
>>> with DataStream API while preparing for a new service to launch.
>>> Now it's time to develop a back-fill job to produce the same result by
>>> reading data stored on Hive which we use for long-term storage.
>>>
>>> Meanwhile, I watched Aljoscha's talk [1] and just wondered if I could
>> reuse
>>> major components of the pipeline written in DataStream API.
>>> The pipeline conceptually looks as follows:
>>> (A) reads input from Kafka
>>> (B) performs AsyncIO to Redis in order to enrich the input data
>>> (C) appends timestamps and emits watermarks before time-based window
>>> (D) keyBy followed by a session window with a custom trigger for early
>>> firing
>>> (E) writes output to Kafka
>>>
>>> I have simple (maybe stupid) questions on reusing components of the
>>> pipeline written in DataStream API.
>>> (1) By replacing (A) with a bounded source, can I execute the pipeline
>> with
>>> a new BATCH execution mode without modifying (B)~(E)?
>>> (2) Is there a bounded source for Hive available for DataStream API?
>>>
>>> Best,
>>>
>>> Dongwon
>>>
>>> [1] https://www.youtube.com/watch?v=z9ye4jzp4DQ
>>>
>>
>>
>