(DEPRECATED) Apache Flink User Mailing List archive.

Dynamically merge multiple upstream souces

Classic

List

Threaded

2 messages Options

uuuuuu

Dynamically merge multiple upstream souces

Hi all,

I have a user case where I want to merge several upstream data source (Kafka topics). The data are essential the same,

but they have different field names.

I guess I can say my problem is not so much about flink itself. It is about how to deserialize data and merge different data effectively with flink.

I can define different schemas and then deserialize data and merge them manually. I wonder if there is any dynamical way to do such thing, that is,

I want to changing field names works like changing pandas dataframe column names. I see there is already

https://cwiki.apache.org/confluence/display/FLINK/FLIP-120%3A+Support+conversion+between+PyFlink+Table+and+Pandas+DataFrame

but resorting to pandas implies I need to work with python, which is something I prefer not to do.

What is your practice on dynamically changing sources and merging them? I'd love to here your opinion.

Bests,
Yi

Arvid Heise-3

Re: Dynamically merge multiple upstream souces

Hi Yi,

one option is to use Avro, where you define one global Avro schema as the source of truth. Then you add aliases [1] to this schema for each source where the fields are named differently. You use the same schema to read the Avro messages from Kafka and Avro automatically converts the data with its write schema (stored in some schema registry) to your global schema.

If you pay close attention to the requirements of schema evolution of Avro [2], you could easily add and remove some fields in the different sources without changing anything programmatically in your Flink ingestion job.

[1] https://avro.apache.org/docs/1.8.1/spec.html#Aliases

[2] https://avro.apache.org/docs/1.8.1/spec.html#Schema+Resolution

On Tue, Jun 2, 2020 at 8:35 AM <[hidden email]> wrote:

Hi all,

I have a user case where I want to merge several upstream data source (Kafka topics). The data are essential the same,
but they have different field names.

I guess I can say my problem is not so much about flink itself. It is about how to deserialize data and merge different data effectively with flink.
I can define different schemas and then deserialize data and merge them manually. I wonder if there is any dynamical way to do such thing, that is,
I want to changing field names works like changing pandas dataframe column names. I see there is already
https://cwiki.apache.org/confluence/display/FLINK/FLIP-120%3A+Support+conversion+between+PyFlink+Table+and+Pandas+DataFrame
but resorting to pandas implies I need to work with python, which is something I prefer not to do.

What is your practice on dynamically changing sources and merging them? I'd love to here your opinion.

Bests,
Yi

Arvid Heise | Senior Java Developer

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng