Very low-latency - is it possible?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Very low-latency - is it possible?

Marchant, Hayden
We're about to get started on a 9-person-month PoC using Flink Streaming. Before we get started, I am interested to know how low-latency I can expect for my end-to-end flow for a single event (from source to sink).

Here is a very high-level description of our Flink design:
We need at least once semantics, and our main flow of application is parsing a message ( < 50 microseconds) from Kafka, and then doing a keyBy on the parsed event ( <1kb) and then updating a very small user state in the KeyedStream, and then doing another keyBy and then operator of that KeyedStream. Each of the operators is a very simple operation - very little calculation and no I/O.


** Our requirement is to get close to 1ms (99%) or lower for end-to-end processing (timer starts once we get message from Kafka). Is this at all realistic if are flow contains 2 aggregations?  If so, what optimizations might we need to get there regarding cluster configuration (both Flink and Hardware). Our throughput is possibly small enough (40,000 events per second) that we could run on one node - which might eliminate some network latency.

I did read in https://ci.apache.org/projects/flink/flink-docs-master/internals/stream_checkpointing.html in Exactly Once vs At Least Once that a few milliseconds is considered super low-latency - wondering if we can get lower.

Any advice or 'war stories' are very welcome.

Thanks,
Hayden Marchant


Reply | Threaded
Open this post in threaded view
|

Re: Very low-latency - is it possible?

Jörn Franke
If you really need to get that low something else might be more suitable. Given the times a custom solution might be necessary. Flink is a generic powerful framework - hence it does not address these latencies.

> On 31. Aug 2017, at 14:50, Marchant, Hayden <[hidden email]> wrote:
>
> We're about to get started on a 9-person-month PoC using Flink Streaming. Before we get started, I am interested to know how low-latency I can expect for my end-to-end flow for a single event (from source to sink).
>
> Here is a very high-level description of our Flink design:
> We need at least once semantics, and our main flow of application is parsing a message ( < 50 microseconds) from Kafka, and then doing a keyBy on the parsed event ( <1kb) and then updating a very small user state in the KeyedStream, and then doing another keyBy and then operator of that KeyedStream. Each of the operators is a very simple operation - very little calculation and no I/O.
>
>
> ** Our requirement is to get close to 1ms (99%) or lower for end-to-end processing (timer starts once we get message from Kafka). Is this at all realistic if are flow contains 2 aggregations?  If so, what optimizations might we need to get there regarding cluster configuration (both Flink and Hardware). Our throughput is possibly small enough (40,000 events per second) that we could run on one node - which might eliminate some network latency.
>
> I did read in https://ci.apache.org/projects/flink/flink-docs-master/internals/stream_checkpointing.html in Exactly Once vs At Least Once that a few milliseconds is considered super low-latency - wondering if we can get lower.
>
> Any advice or 'war stories' are very welcome.
>
> Thanks,
> Hayden Marchant
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Very low-latency - is it possible?

Piotr Nowojski
Achieving 1ms in any distributed system might be problematic, because even simplest ping messages between worker nodes take ~0.2ms.

However, as you stated your desired throughput (40k records/s) and state is small, so maybe there is no need for using a distributed system for that? You could try run single node Flink instance (or 2 node instance with parallelism set to 1, just for automatic failures recovery).

As Jörn wrote earlier it might be just simpler to write simple custom java standalone application for that. As long as your state fits into memory of a single node, you should be easily able to process millions of records per second on a single machine.

Piotrek

> On Aug 31, 2017, at 3:01 PM, Jörn Franke <[hidden email]> wrote:
>
> If you really need to get that low something else might be more suitable. Given the times a custom solution might be necessary. Flink is a generic powerful framework - hence it does not address these latencies.
>
>> On 31. Aug 2017, at 14:50, Marchant, Hayden <[hidden email]> wrote:
>>
>> We're about to get started on a 9-person-month PoC using Flink Streaming. Before we get started, I am interested to know how low-latency I can expect for my end-to-end flow for a single event (from source to sink).
>>
>> Here is a very high-level description of our Flink design:
>> We need at least once semantics, and our main flow of application is parsing a message ( < 50 microseconds) from Kafka, and then doing a keyBy on the parsed event ( <1kb) and then updating a very small user state in the KeyedStream, and then doing another keyBy and then operator of that KeyedStream. Each of the operators is a very simple operation - very little calculation and no I/O.
>>
>>
>> ** Our requirement is to get close to 1ms (99%) or lower for end-to-end processing (timer starts once we get message from Kafka). Is this at all realistic if are flow contains 2 aggregations?  If so, what optimizations might we need to get there regarding cluster configuration (both Flink and Hardware). Our throughput is possibly small enough (40,000 events per second) that we could run on one node - which might eliminate some network latency.
>>
>> I did read in https://ci.apache.org/projects/flink/flink-docs-master/internals/stream_checkpointing.html in Exactly Once vs At Least Once that a few milliseconds is considered super low-latency - wondering if we can get lower.
>>
>> Any advice or 'war stories' are very welcome.
>>
>> Thanks,
>> Hayden Marchant
>>
>>