Flink Serialization as stable (kafka) output format?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink Serialization as stable (kafka) output format?

Theo
Hi,

Without knowing too much about flink serialization, I know that Flinks states that it serializes POJOtypes much faster than even the fast Kryo for Java. I further know that it supports schema evolution in the same way as avro.

In our project, we have a star architecture, where one flink job produces results into a kafka topic and where we have multiple downstream consumers from that kafka topic (Mostly other flink jobs).
For fast development cycles, we currently use JSON as output format for the kafka topic due to easy debugging capabilities and best migration possibilities. However, when scaling up, we need to switch to a more efficient format. Most often, Avro is mentioned in combination with a schema registry, as its much more efficient then JSON where essentially, each message contains the schema as well. However, in most benchmarks, avro turns out to be rather slow in terms of CPU cycles ( e.g. [1] )

My question(s) now:
1. Is it reasonable to use flink serializers as message format in Kafka?
2. Are there any downsides in using flinks serialization result as output format to kafka?
3. Can downstream consumers, written in Java, but not flink components, also easily deserialize flink serialized POJOs? Or do they have a dependency to at least full flink-core?
4. Do you have benchmarks comparing flink (de-)serialization performance to e.g. kryo and avro?

The only thing I come up with why I wouldn't use flink serialization is that we wouldn't have a schema registry, but in our case, we share all our POJOs in a jar which is used by all components, so that is kind of a schema registry already and if we only make avro compatible changes, which are also well treated by flink, that shouldn't be any limitation compared to like avro+registry?

Best regards
Theo

Reply | Threaded
Open this post in threaded view
|

Re: Flink Serialization as stable (kafka) output format?

rmetzger0
Hi Theo,

However, in most benchmarks, avro turns out to be rather slow in terms of CPU cycles ( e.g. [1] )

Avro is slower compared to what?
You should not only benchmark the CPU cycles for serializing the data. If you are sending JSON strings across the network, you'll probably have a lot more bytes to send across the network, making everything slower (usually network is slower than CPU)

One of the reasons why people use Avro it supports schema evolution.

Regarding your questions:
1. For this use case, you can use the Flink data format as an internal message format (between the star architecture jobs)
2. Generally speaking no
3. You will at leave have a dependency to flink-core. And this is a somewhat custom setup, so you might be facing breaking API changes.
4. I'm not aware of any benchmarks. The Flink serializers are mostly for internal use (between our operators), Kryo is our fallback (to not suffer to much from the not invented here syndrome), while Avro is meant for cross-system serialization.

I have the feeling that you can move ahead with using Flink's Pojo serializer everywhere :)

Best,
Robert


 

On Wed, Mar 4, 2020 at 1:04 PM Theo Diefenthal <[hidden email]> wrote:
Hi,

Without knowing too much about flink serialization, I know that Flinks states that it serializes POJOtypes much faster than even the fast Kryo for Java. I further know that it supports schema evolution in the same way as avro.

In our project, we have a star architecture, where one flink job produces results into a kafka topic and where we have multiple downstream consumers from that kafka topic (Mostly other flink jobs).
For fast development cycles, we currently use JSON as output format for the kafka topic due to easy debugging capabilities and best migration possibilities. However, when scaling up, we need to switch to a more efficient format. Most often, Avro is mentioned in combination with a schema registry, as its much more efficient then JSON where essentially, each message contains the schema as well. However, in most benchmarks, avro turns out to be rather slow in terms of CPU cycles ( e.g. [1] )

My question(s) now:
1. Is it reasonable to use flink serializers as message format in Kafka?
2. Are there any downsides in using flinks serialization result as output format to kafka?
3. Can downstream consumers, written in Java, but not flink components, also easily deserialize flink serialized POJOs? Or do they have a dependency to at least full flink-core?
4. Do you have benchmarks comparing flink (de-)serialization performance to e.g. kryo and avro?

The only thing I come up with why I wouldn't use flink serialization is that we wouldn't have a schema registry, but in our case, we share all our POJOs in a jar which is used by all components, so that is kind of a schema registry already and if we only make avro compatible changes, which are also well treated by flink, that shouldn't be any limitation compared to like avro+registry?

Best regards
Theo

Reply | Threaded
Open this post in threaded view
|

Re: Flink Serialization as stable (kafka) output format?

Arvid Heise-3
Hi Theo,

I strongly discourage the use of flink serialization for persistent storage of data. It was never intended to work in this way and does not offer the benefits of Avro of lazy schema evolution and maturity.

Unless you can explicitly measure that Avro is a bottleneck in your setup, stick with it. It's the preferred way to store data in Kafka for a reason. It's mature, supports plenty of languages, and the schema evolution feature will save you so many headaches in the future.

If it turns out to be a bottleneck, the most logical alternative is protobuf. Kryo is even worse than Flink serializer for Kafka. In general, realistically speaking, it's so much more cost-effective to just add another node to your Flink cluster and use Avro than coming up with any clever solution (just assume that you need at least one man month to implement and do the math).

And btw, you should always use generated Java/scala classes if possible for Avro. It's faster and offers a much nicer development experience.

On Mon, Mar 9, 2020 at 3:57 PM Robert Metzger <[hidden email]> wrote:
Hi Theo,

However, in most benchmarks, avro turns out to be rather slow in terms of CPU cycles ( e.g. [1] )

Avro is slower compared to what?
You should not only benchmark the CPU cycles for serializing the data. If you are sending JSON strings across the network, you'll probably have a lot more bytes to send across the network, making everything slower (usually network is slower than CPU)

One of the reasons why people use Avro it supports schema evolution.

Regarding your questions:
1. For this use case, you can use the Flink data format as an internal message format (between the star architecture jobs)
2. Generally speaking no
3. You will at leave have a dependency to flink-core. And this is a somewhat custom setup, so you might be facing breaking API changes.
4. I'm not aware of any benchmarks. The Flink serializers are mostly for internal use (between our operators), Kryo is our fallback (to not suffer to much from the not invented here syndrome), while Avro is meant for cross-system serialization.

I have the feeling that you can move ahead with using Flink's Pojo serializer everywhere :)

Best,
Robert


 

On Wed, Mar 4, 2020 at 1:04 PM Theo Diefenthal <[hidden email]> wrote:
Hi,

Without knowing too much about flink serialization, I know that Flinks states that it serializes POJOtypes much faster than even the fast Kryo for Java. I further know that it supports schema evolution in the same way as avro.

In our project, we have a star architecture, where one flink job produces results into a kafka topic and where we have multiple downstream consumers from that kafka topic (Mostly other flink jobs).
For fast development cycles, we currently use JSON as output format for the kafka topic due to easy debugging capabilities and best migration possibilities. However, when scaling up, we need to switch to a more efficient format. Most often, Avro is mentioned in combination with a schema registry, as its much more efficient then JSON where essentially, each message contains the schema as well. However, in most benchmarks, avro turns out to be rather slow in terms of CPU cycles ( e.g. [1] )

My question(s) now:
1. Is it reasonable to use flink serializers as message format in Kafka?
2. Are there any downsides in using flinks serialization result as output format to kafka?
3. Can downstream consumers, written in Java, but not flink components, also easily deserialize flink serialized POJOs? Or do they have a dependency to at least full flink-core?
4. Do you have benchmarks comparing flink (de-)serialization performance to e.g. kryo and avro?

The only thing I come up with why I wouldn't use flink serialization is that we wouldn't have a schema registry, but in our case, we share all our POJOs in a jar which is used by all components, so that is kind of a schema registry already and if we only make avro compatible changes, which are also well treated by flink, that shouldn't be any limitation compared to like avro+registry?

Best regards
Theo

Reply | Threaded
Open this post in threaded view
|

Re: Flink Serialization as stable (kafka) output format?

rmetzger0
Hey Theo,

we recently published a blog post that answers your request for a comparison between Kryo and Avro in Flink: https://flink.apache.org/news/2020/04/15/flink-serialization-tuning-vol-1.html

On Tue, Mar 10, 2020 at 9:27 AM Arvid Heise <[hidden email]> wrote:
Hi Theo,

I strongly discourage the use of flink serialization for persistent storage of data. It was never intended to work in this way and does not offer the benefits of Avro of lazy schema evolution and maturity.

Unless you can explicitly measure that Avro is a bottleneck in your setup, stick with it. It's the preferred way to store data in Kafka for a reason. It's mature, supports plenty of languages, and the schema evolution feature will save you so many headaches in the future.

If it turns out to be a bottleneck, the most logical alternative is protobuf. Kryo is even worse than Flink serializer for Kafka. In general, realistically speaking, it's so much more cost-effective to just add another node to your Flink cluster and use Avro than coming up with any clever solution (just assume that you need at least one man month to implement and do the math).

And btw, you should always use generated Java/scala classes if possible for Avro. It's faster and offers a much nicer development experience.

On Mon, Mar 9, 2020 at 3:57 PM Robert Metzger <[hidden email]> wrote:
Hi Theo,

However, in most benchmarks, avro turns out to be rather slow in terms of CPU cycles ( e.g. [1] )

Avro is slower compared to what?
You should not only benchmark the CPU cycles for serializing the data. If you are sending JSON strings across the network, you'll probably have a lot more bytes to send across the network, making everything slower (usually network is slower than CPU)

One of the reasons why people use Avro it supports schema evolution.

Regarding your questions:
1. For this use case, you can use the Flink data format as an internal message format (between the star architecture jobs)
2. Generally speaking no
3. You will at leave have a dependency to flink-core. And this is a somewhat custom setup, so you might be facing breaking API changes.
4. I'm not aware of any benchmarks. The Flink serializers are mostly for internal use (between our operators), Kryo is our fallback (to not suffer to much from the not invented here syndrome), while Avro is meant for cross-system serialization.

I have the feeling that you can move ahead with using Flink's Pojo serializer everywhere :)

Best,
Robert


 

On Wed, Mar 4, 2020 at 1:04 PM Theo Diefenthal <[hidden email]> wrote:
Hi,

Without knowing too much about flink serialization, I know that Flinks states that it serializes POJOtypes much faster than even the fast Kryo for Java. I further know that it supports schema evolution in the same way as avro.

In our project, we have a star architecture, where one flink job produces results into a kafka topic and where we have multiple downstream consumers from that kafka topic (Mostly other flink jobs).
For fast development cycles, we currently use JSON as output format for the kafka topic due to easy debugging capabilities and best migration possibilities. However, when scaling up, we need to switch to a more efficient format. Most often, Avro is mentioned in combination with a schema registry, as its much more efficient then JSON where essentially, each message contains the schema as well. However, in most benchmarks, avro turns out to be rather slow in terms of CPU cycles ( e.g. [1] )

My question(s) now:
1. Is it reasonable to use flink serializers as message format in Kafka?
2. Are there any downsides in using flinks serialization result as output format to kafka?
3. Can downstream consumers, written in Java, but not flink components, also easily deserialize flink serialized POJOs? Or do they have a dependency to at least full flink-core?
4. Do you have benchmarks comparing flink (de-)serialization performance to e.g. kryo and avro?

The only thing I come up with why I wouldn't use flink serialization is that we wouldn't have a schema registry, but in our case, we share all our POJOs in a jar which is used by all components, so that is kind of a schema registry already and if we only make avro compatible changes, which are also well treated by flink, that shouldn't be any limitation compared to like avro+registry?

Best regards
Theo

Reply | Threaded
Open this post in threaded view
|

Re: Flink Serialization as stable (kafka) output format?

Theo
Hi Robert,

Thank you very much for pointing me to the nice blog post.

It aligns with my readings that the flink serializer is fast, outperfoms avro (especially reflect) and still supports schema evolution well. So nice done job @flink :)

But as Arvid says, Avro is compatible with much more languages than just JVM. Who knows how my project grows in future and whether there are going to be people wanting to connect to Kafka with Python/Go/whatever, ... As long as you don't recommend, I won't jump into using flink as external serializer but keep with the internal usages of PojoSerializer everywhere :)

I'll definitely also keep in mind that avro reflect performs much worse compared to avro specific/generic then I expected.

Best regards
Theo



Von: "Robert Metzger" <[hidden email]>
An: "Arvid Heise" <[hidden email]>
CC: "Theo Diefenthal" <[hidden email]>, "user" <[hidden email]>
Gesendet: Sonntag, 19. April 2020 08:23:42
Betreff: Re: Flink Serialization as stable (kafka) output format?

Hey Theo,
we recently published a blog post that answers your request for a comparison between Kryo and Avro in Flink: https://flink.apache.org/news/2020/04/15/flink-serialization-tuning-vol-1.html

On Tue, Mar 10, 2020 at 9:27 AM Arvid Heise <[hidden email]> wrote:
Hi Theo,

I strongly discourage the use of flink serialization for persistent storage of data. It was never intended to work in this way and does not offer the benefits of Avro of lazy schema evolution and maturity.

Unless you can explicitly measure that Avro is a bottleneck in your setup, stick with it. It's the preferred way to store data in Kafka for a reason. It's mature, supports plenty of languages, and the schema evolution feature will save you so many headaches in the future.

If it turns out to be a bottleneck, the most logical alternative is protobuf. Kryo is even worse than Flink serializer for Kafka. In general, realistically speaking, it's so much more cost-effective to just add another node to your Flink cluster and use Avro than coming up with any clever solution (just assume that you need at least one man month to implement and do the math).

And btw, you should always use generated Java/scala classes if possible for Avro. It's faster and offers a much nicer development experience.

On Mon, Mar 9, 2020 at 3:57 PM Robert Metzger <[hidden email]> wrote:
Hi Theo,

However, in most benchmarks, avro turns out to be rather slow in terms of CPU cycles ( e.g. [1] )

Avro is slower compared to what?
You should not only benchmark the CPU cycles for serializing the data. If you are sending JSON strings across the network, you'll probably have a lot more bytes to send across the network, making everything slower (usually network is slower than CPU)

One of the reasons why people use Avro it supports schema evolution.

Regarding your questions:
1. For this use case, you can use the Flink data format as an internal message format (between the star architecture jobs)
2. Generally speaking no
3. You will at leave have a dependency to flink-core. And this is a somewhat custom setup, so you might be facing breaking API changes.
4. I'm not aware of any benchmarks. The Flink serializers are mostly for internal use (between our operators), Kryo is our fallback (to not suffer to much from the not invented here syndrome), while Avro is meant for cross-system serialization.

I have the feeling that you can move ahead with using Flink's Pojo serializer everywhere :)

Best,
Robert


 

On Wed, Mar 4, 2020 at 1:04 PM Theo Diefenthal <[hidden email]> wrote:
Hi,

Without knowing too much about flink serialization, I know that Flinks states that it serializes POJOtypes much faster than even the fast Kryo for Java. I further know that it supports schema evolution in the same way as avro.

In our project, we have a star architecture, where one flink job produces results into a kafka topic and where we have multiple downstream consumers from that kafka topic (Mostly other flink jobs).
For fast development cycles, we currently use JSON as output format for the kafka topic due to easy debugging capabilities and best migration possibilities. However, when scaling up, we need to switch to a more efficient format. Most often, Avro is mentioned in combination with a schema registry, as its much more efficient then JSON where essentially, each message contains the schema as well. However, in most benchmarks, avro turns out to be rather slow in terms of CPU cycles ( e.g. [1] )

My question(s) now:
1. Is it reasonable to use flink serializers as message format in Kafka?
2. Are there any downsides in using flinks serialization result as output format to kafka?
3. Can downstream consumers, written in Java, but not flink components, also easily deserialize flink serialized POJOs? Or do they have a dependency to at least full flink-core?
4. Do you have benchmarks comparing flink (de-)serialization performance to e.g. kryo and avro?

The only thing I come up with why I wouldn't use flink serialization is that we wouldn't have a schema registry, but in our case, we share all our POJOs in a jar which is used by all components, so that is kind of a schema registry already and if we only make avro compatible changes, which are also well treated by flink, that shouldn't be any limitation compared to like avro+registry?

Best regards
Theo