Stream collector serialization performance

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Stream collector serialization performance

祁明良
Hi all,

I’m currently using the keyed process function, I see there’s serialization happening when I collect the object / update the object to rocksdb. For me the performance of serialization seems to be the bottleneck.
By default, POJO serializer is used, and the timecost of collect / update to rocksdb is roughly 1:1, Then I switch to kryo by setting getConfig.enableForceKryo(). Now the timecost of update to rocksdb decreases significantly to roughly 0.3, but the collect method seems not improving. Can someone help to explain this?

 My Object looks somehow like this:

Class A {
String f1 // 20 * string fields
List<B> f2. // 20 * list of another POJO object
Int f3 // 20 * ints fields
}
Class B {
String f // 5 * string fields
}

Best,
Mingliang

本邮件及其附件含有小红书公司的保密信息,仅限于发送给以上收件人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
This communication may contain privileged or other confidential information of Red. If you have received it in error, please advise the sender by reply e-mail and immediately delete the message and any attachments without copying or disclosing the contents. Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: Stream collector serialization performance

Timo Walther
Hi Mingliang,

first of all the POJO serializer is not very performant. Tuple or Row
are better. If you want to improve the performance of a collect()
between operators, you could also enable object reuse. You can read more
about this here [1] (section "Issue 2: Object Reuse"), but make sure
your implementation is correct because an operator could modify the
objects of follwing operators.

I hope this helps.

Regards,
Timo

[1]
https://data-artisans.com/blog/curious-case-broken-benchmark-revisiting-apache-flink-vs-databricks-runtime


Am 15.08.18 um 09:06 schrieb 祁明良:

> Hi all,
>
> I’m currently using the keyed process function, I see there’s serialization happening when I collect the object / update the object to rocksdb. For me the performance of serialization seems to be the bottleneck.
> By default, POJO serializer is used, and the timecost of collect / update to rocksdb is roughly 1:1, Then I switch to kryo by setting getConfig.enableForceKryo(). Now the timecost of update to rocksdb decreases significantly to roughly 0.3, but the collect method seems not improving. Can someone help to explain this?
>
>   My Object looks somehow like this:
>
> Class A {
> String f1 // 20 * string fields
> List<B> f2. // 20 * list of another POJO object
> Int f3 // 20 * ints fields
> }
> Class B {
> String f // 5 * string fields
> }
>
> Best,
> Mingliang
>
> 本邮件及其附件含有小红书公司的保密信息,仅限于发送给以上收件人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
> This communication may contain privileged or other confidential information of Red. If you have received it in error, please advise the sender by reply e-mail and immediately delete the message and any attachments without copying or disclosing the contents. Thank you.