Recommended way to split datastream in PyFlink

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Recommended way to split datastream in PyFlink

Sumeet Malhotra
Hi,

I have a use case where I need to process incoming records on a Kafka topic based on a certain record field that defines the record type.

What I'm thinking is to split the incoming datastream into record-type specific streams and then apply record-type specific stream processing on each. What's the recommended way to achieve this in PyFlink? Are side outputs supported in PyFlink (I couldn't find any reference in the codebase)? Or do I have to replicate the input datastream and then apply record specific filters?

Thanks,
Sumeet
Reply | Threaded
Open this post in threaded view
|

Re: Recommended way to split datastream in PyFlink

Dian Fu
Hi Sumeet,

It still doesn't support side outputs in PyFlink.

>> Or do I have to replicate the input datastream and then apply record specific filters?
I'm afraid that yes.

Regards,
Dian


On Sun, Mar 21, 2021 at 5:20 PM Sumeet Malhotra <[hidden email]> wrote:
Hi,

I have a use case where I need to process incoming records on a Kafka topic based on a certain record field that defines the record type.

What I'm thinking is to split the incoming datastream into record-type specific streams and then apply record-type specific stream processing on each. What's the recommended way to achieve this in PyFlink? Are side outputs supported in PyFlink (I couldn't find any reference in the codebase)? Or do I have to replicate the input datastream and then apply record specific filters?

Thanks,
Sumeet
Reply | Threaded
Open this post in threaded view
|

Re: Recommended way to split datastream in PyFlink

Sumeet Malhotra
Thanks Dian.

Another question I have is, since PyFlink Datastream API still doesn't have native Window support, what's the recommended way to introduce windows? Use PyFlink Table API for windows in conjunction with the Datastream APIs? For example, read input records from Kafka into a table and then use `StreamTableEnvironment.to_apend_table(...)` to push records into a data stream?

Thanks,
Sumeet


On Mon, Mar 22, 2021 at 1:05 PM Dian Fu <[hidden email]> wrote:
Hi Sumeet,

It still doesn't support side outputs in PyFlink.

>> Or do I have to replicate the input datastream and then apply record specific filters?
I'm afraid that yes.

Regards,
Dian


On Sun, Mar 21, 2021 at 5:20 PM Sumeet Malhotra <[hidden email]> wrote:
Hi,

I have a use case where I need to process incoming records on a Kafka topic based on a certain record field that defines the record type.

What I'm thinking is to split the incoming datastream into record-type specific streams and then apply record-type specific stream processing on each. What's the recommended way to achieve this in PyFlink? Are side outputs supported in PyFlink (I couldn't find any reference in the codebase)? Or do I have to replicate the input datastream and then apply record specific filters?

Thanks,
Sumeet
Reply | Threaded
Open this post in threaded view
|

Re: Recommended way to split datastream in PyFlink

Sumeet Malhotra
Apologies. I meant `StreamTableEnvironment.to_append_stream` in my last message.

On Mon, Mar 22, 2021 at 2:03 PM Sumeet Malhotra <[hidden email]> wrote:
Thanks Dian.

Another question I have is, since PyFlink Datastream API still doesn't have native Window support, what's the recommended way to introduce windows? Use PyFlink Table API for windows in conjunction with the Datastream APIs? For example, read input records from Kafka into a table and then use `StreamTableEnvironment.to_apend_table(...)` to push records into a data stream?

Thanks,
Sumeet


On Mon, Mar 22, 2021 at 1:05 PM Dian Fu <[hidden email]> wrote:
Hi Sumeet,

It still doesn't support side outputs in PyFlink.

>> Or do I have to replicate the input datastream and then apply record specific filters?
I'm afraid that yes.

Regards,
Dian


On Sun, Mar 21, 2021 at 5:20 PM Sumeet Malhotra <[hidden email]> wrote:
Hi,

I have a use case where I need to process incoming records on a Kafka topic based on a certain record field that defines the record type.

What I'm thinking is to split the incoming datastream into record-type specific streams and then apply record-type specific stream processing on each. What's the recommended way to achieve this in PyFlink? Are side outputs supported in PyFlink (I couldn't find any reference in the codebase)? Or do I have to replicate the input datastream and then apply record specific filters?

Thanks,
Sumeet
Reply | Threaded
Open this post in threaded view
|

Re: Recommended way to split datastream in PyFlink

Dian Fu
Hi Sumeet,

Yes, you are right. It supports mix use of PyFlink DataStream API and PyFlink Table API. It's recommended to use PyFlink Table API for window operations for now before it's supported in PyFlink DataStream API.

Regards,
Dian

On Mon, Mar 22, 2021 at 4:34 PM Sumeet Malhotra <[hidden email]> wrote:
Apologies. I meant `StreamTableEnvironment.to_append_stream` in my last message.

On Mon, Mar 22, 2021 at 2:03 PM Sumeet Malhotra <[hidden email]> wrote:
Thanks Dian.

Another question I have is, since PyFlink Datastream API still doesn't have native Window support, what's the recommended way to introduce windows? Use PyFlink Table API for windows in conjunction with the Datastream APIs? For example, read input records from Kafka into a table and then use `StreamTableEnvironment.to_apend_table(...)` to push records into a data stream?

Thanks,
Sumeet


On Mon, Mar 22, 2021 at 1:05 PM Dian Fu <[hidden email]> wrote:
Hi Sumeet,

It still doesn't support side outputs in PyFlink.

>> Or do I have to replicate the input datastream and then apply record specific filters?
I'm afraid that yes.

Regards,
Dian


On Sun, Mar 21, 2021 at 5:20 PM Sumeet Malhotra <[hidden email]> wrote:
Hi,

I have a use case where I need to process incoming records on a Kafka topic based on a certain record field that defines the record type.

What I'm thinking is to split the incoming datastream into record-type specific streams and then apply record-type specific stream processing on each. What's the recommended way to achieve this in PyFlink? Are side outputs supported in PyFlink (I couldn't find any reference in the codebase)? Or do I have to replicate the input datastream and then apply record specific filters?

Thanks,
Sumeet