Hi all, I have a user case where I want to merge several upstream data source (Kafka topics). The data are essential the same, but they have different field names. I guess I can say my problem is not so much about flink itself. It is about how to deserialize data and merge different data effectively with flink. I can define different schemas and then deserialize data and merge them manually. I wonder if there is any dynamical way to do such thing, that is, I want to changing field names works like changing pandas dataframe column names. I see there is already but resorting to pandas implies I need to work with python, which is something I prefer not to do. What is your practice on dynamically changing sources and merging them? I'd love to here your opinion. Bests,
Yi |
Hi Yi, one option is to use Avro, where you define one global Avro schema as the source of truth. Then you add aliases [1] to this schema for each source where the fields are named differently. You use the same schema to read the Avro messages from Kafka and Avro automatically converts the data with its write schema (stored in some schema registry) to your global schema. If you pay close attention to the requirements of schema evolution of Avro [2], you could easily add and remove some fields in the different sources without changing anything programmatically in your Flink ingestion job. On Tue, Jun 2, 2020 at 8:35 AM <[hidden email]> wrote:
-- Arvid Heise | Senior Java Developer Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng |
Free forum by Nabble | Edit this page |