Performance issues - is my topology not setup properly?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Performance issues - is my topology not setup properly?

Jon Yeargers
Flink 1.1.1 is running on AWS / EMR. 3 boxes - total 24 cores and 90Gb of RAM.

Job is submitted via yarn.

Topology:

read csv files from SQS -> parse files by line  and create object for each line -> pass through 'KeySelector' to pair entries (by hash) over 60 second window -> write original and matched sets to BigQuery.

Each file contains ~ 15K lines and there are ~10 files / second. 

My topology can't keep up with this stream. What am I doing wrong?

Articles like this (http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/) speak of > 1 million events / sec / core. Im not clear what constitutes an 'event' but given the number of cores Im throwing at this problem I would expect higher throughput.

I run the job as : 

HADOOP_CONF_DIR=/etc/hadoop/conf bin/flink run -m yarn-cluster -yn 3 -ys 8 -yst -ytm 4096 ../flink_all.jar

Reply | Threaded
Open this post in threaded view
|

Re: Performance issues - is my topology not setup properly?

Ufuk Celebi
Hey Jon! Thanks for sharing this. The blog post refers to each record
in the stream as an event.

The YARN command you've shared looks good, you could give the machines
more memory, but I would not expect this to be the problem here. I
would rather think that the sources are the bottleneck (but of course
not sure).

We can do the following to sanity check this as a first step: look at
the back pressure monitoring tab in the web frontend (see [1]). What
status do you see there? If the sources all say "OK" they produce as
fast as they can and the part of the topology after the them is not
back pressuring them/slowing them down.

Can we be sure that we can read from SQS with the desired throughput you need?

You could also replace the sources with a custom source function
producing data inside of Flink to check how much data the rest of the
topology can handle.

[1] https://ci.apache.org/projects/flink/flink-docs-master/internals/back_pressure_monitoring.html



On Mon, Aug 15, 2016 at 10:22 PM, Jon Yeargers <[hidden email]> wrote:

> Flink 1.1.1 is running on AWS / EMR. 3 boxes - total 24 cores and 90Gb of
> RAM.
>
> Job is submitted via yarn.
>
> Topology:
>
> read csv files from SQS -> parse files by line  and create object for each
> line -> pass through 'KeySelector' to pair entries (by hash) over 60 second
> window -> write original and matched sets to BigQuery.
>
> Each file contains ~ 15K lines and there are ~10 files / second.
>
> My topology can't keep up with this stream. What am I doing wrong?
>
> Articles like this
> (http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/)
> speak of > 1 million events / sec / core. Im not clear what constitutes an
> 'event' but given the number of cores Im throwing at this problem I would
> expect higher throughput.
>
> I run the job as :
>
> HADOOP_CONF_DIR=/etc/hadoop/conf bin/flink run -m yarn-cluster -yn 3 -ys 8
> -yst -ytm 4096 ../flink_all.jar