(DEPRECATED) Apache Flink User Mailing List archive.

Recommended way to split datastream in PyFlink

Classic

List

Threaded

5 messages Options

Sumeet Malhotra

Recommended way to split datastream in PyFlink

Hi,

I have a use case where I need to process incoming records on a Kafka topic based on a certain record field that defines the record type.

What I'm thinking is to split the incoming datastream into record-type specific streams and then apply record-type specific stream processing on each. What's the recommended way to achieve this in PyFlink? Are side outputs supported in PyFlink (I couldn't find any reference in the codebase)? Or do I have to replicate the input datastream and then apply record specific filters?

Thanks,

Sumeet

Dian Fu

Re: Recommended way to split datastream in PyFlink

Hi Sumeet,

It still doesn't support side outputs in PyFlink.

>> Or do I have to replicate the input datastream and then apply record specific filters?
I'm afraid that yes.

Regards,
Dian

On Sun, Mar 21, 2021 at 5:20 PM Sumeet Malhotra <[hidden email]> wrote:

Hi,

I have a use case where I need to process incoming records on a Kafka topic based on a certain record field that defines the record type.

What I'm thinking is to split the incoming datastream into record-type specific streams and then apply record-type specific stream processing on each. What's the recommended way to achieve this in PyFlink? Are side outputs supported in PyFlink (I couldn't find any reference in the codebase)? Or do I have to replicate the input datastream and then apply record specific filters?

Thanks,
Sumeet

Sumeet Malhotra

Re: Recommended way to split datastream in PyFlink

Thanks Dian.

Another question I have is, since PyFlink Datastream API still doesn't have native Window support, what's the recommended way to introduce windows? Use PyFlink Table API for windows in conjunction with the Datastream APIs? For example, read input records from Kafka into a table and then use `StreamTableEnvironment.to_apend_table(...)` to push records into a data stream?

Thanks,

Sumeet

On Mon, Mar 22, 2021 at 1:05 PM Dian Fu <[hidden email]> wrote:

Hi Sumeet,

It still doesn't support side outputs in PyFlink.

>> Or do I have to replicate the input datastream and then apply record specific filters?
I'm afraid that yes.

Regards,
Dian

On Sun, Mar 21, 2021 at 5:20 PM Sumeet Malhotra <[hidden email]> wrote:
Hi,

I have a use case where I need to process incoming records on a Kafka topic based on a certain record field that defines the record type.

What I'm thinking is to split the incoming datastream into record-type specific streams and then apply record-type specific stream processing on each. What's the recommended way to achieve this in PyFlink? Are side outputs supported in PyFlink (I couldn't find any reference in the codebase)? Or do I have to replicate the input datastream and then apply record specific filters?

Thanks,
Sumeet

Sumeet Malhotra

Re: Recommended way to split datastream in PyFlink

Apologies. I meant `StreamTableEnvironment.to_append_stream` in my last message.

On Mon, Mar 22, 2021 at 2:03 PM Sumeet Malhotra <[hidden email]> wrote:

Thanks Dian.

Another question I have is, since PyFlink Datastream API still doesn't have native Window support, what's the recommended way to introduce windows? Use PyFlink Table API for windows in conjunction with the Datastream APIs? For example, read input records from Kafka into a table and then use `StreamTableEnvironment.to_apend_table(...)` to push records into a data stream?

Thanks,
Sumeet

On Mon, Mar 22, 2021 at 1:05 PM Dian Fu <[hidden email]> wrote:
Hi Sumeet,

It still doesn't support side outputs in PyFlink.

>> Or do I have to replicate the input datastream and then apply record specific filters?
I'm afraid that yes.

Regards,
Dian

On Sun, Mar 21, 2021 at 5:20 PM Sumeet Malhotra <[hidden email]> wrote:
Hi,

I have a use case where I need to process incoming records on a Kafka topic based on a certain record field that defines the record type.

What I'm thinking is to split the incoming datastream into record-type specific streams and then apply record-type specific stream processing on each. What's the recommended way to achieve this in PyFlink? Are side outputs supported in PyFlink (I couldn't find any reference in the codebase)? Or do I have to replicate the input datastream and then apply record specific filters?

Thanks,
Sumeet

Dian Fu

Re: Recommended way to split datastream in PyFlink

Hi Sumeet,

Yes, you are right. It supports mix use of PyFlink DataStream API and PyFlink Table API. It's recommended to use PyFlink Table API for window operations for now before it's supported in PyFlink DataStream API.

Regards,
Dian

On Mon, Mar 22, 2021 at 4:34 PM Sumeet Malhotra <[hidden email]> wrote:

Apologies. I meant `StreamTableEnvironment.to_append_stream` in my last message.

On Mon, Mar 22, 2021 at 2:03 PM Sumeet Malhotra <[hidden email]> wrote:
Thanks Dian.

Another question I have is, since PyFlink Datastream API still doesn't have native Window support, what's the recommended way to introduce windows? Use PyFlink Table API for windows in conjunction with the Datastream APIs? For example, read input records from Kafka into a table and then use `StreamTableEnvironment.to_apend_table(...)` to push records into a data stream?

Thanks,
Sumeet

On Mon, Mar 22, 2021 at 1:05 PM Dian Fu <[hidden email]> wrote:
Hi Sumeet,

It still doesn't support side outputs in PyFlink.

>> Or do I have to replicate the input datastream and then apply record specific filters?
I'm afraid that yes.

Regards,
Dian

On Sun, Mar 21, 2021 at 5:20 PM Sumeet Malhotra <[hidden email]> wrote:
Hi,

I have a use case where I need to process incoming records on a Kafka topic based on a certain record field that defines the record type.

What I'm thinking is to split the incoming datastream into record-type specific streams and then apply record-type specific stream processing on each. What's the recommended way to achieve this in PyFlink? Are side outputs supported in PyFlink (I couldn't find any reference in the codebase)? Or do I have to replicate the input datastream and then apply record specific filters?

Thanks,
Sumeet