Dynamically merge multiple upstream souces

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Dynamically merge multiple upstream souces

uuuuuu
Hi all,

I have a user case where I want to merge several upstream data source (Kafka topics). The data are essential the same,
but they have different field names.

I guess I can say my problem is not so much about flink itself. It is about how to deserialize data and merge different data effectively with flink.
I can define different schemas and then deserialize data and merge them manually. I wonder if there is any dynamical way to do such thing, that is,
I want to changing field names works like changing pandas dataframe column names. I see there is already
but resorting to pandas implies I need to work with python, which is something I prefer not to do.

What is your practice on dynamically changing sources and merging them? I'd love to here your opinion.

Bests,
Yi
Reply | Threaded
Open this post in threaded view
|

Re: Dynamically merge multiple upstream souces

Arvid Heise-3
Hi Yi,

one option is to use Avro, where you define one global Avro schema as the source of truth. Then you add aliases [1] to this schema for each source where the fields are named differently. You use the same schema to read the Avro messages from Kafka and Avro automatically converts the data with its write schema (stored in some schema registry) to your global schema.

If you pay close attention to the requirements of schema evolution of Avro [2], you could easily add and remove some fields in the different sources without changing anything programmatically in your Flink ingestion job.


On Tue, Jun 2, 2020 at 8:35 AM <[hidden email]> wrote:
Hi all,

I have a user case where I want to merge several upstream data source (Kafka topics). The data are essential the same,
but they have different field names.

I guess I can say my problem is not so much about flink itself. It is about how to deserialize data and merge different data effectively with flink.
I can define different schemas and then deserialize data and merge them manually. I wonder if there is any dynamical way to do such thing, that is,
I want to changing field names works like changing pandas dataframe column names. I see there is already
but resorting to pandas implies I need to work with python, which is something I prefer not to do.

What is your practice on dynamically changing sources and merging them? I'd love to here your opinion.

Bests,
Yi


--

Arvid Heise | Senior Java Developer


Follow us @VervericaData

--

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng