Flink 1.1.1 is running on AWS / EMR. 3 boxes - total 24 cores and 90Gb of RAM.
Job is submitted via yarn. Topology: read csv files from SQS -> parse files by line and create object for each line -> pass through 'KeySelector' to pair entries (by hash) over 60 second window -> write original and matched sets to BigQuery. Each file contains ~ 15K lines and there are ~10 files / second. My topology can't keep up with this stream. What am I doing wrong? Articles like this (http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/) speak of > 1 million events / sec / core. Im not clear what constitutes an 'event' but given the number of cores Im throwing at this problem I would expect higher throughput. I run the job as : HADOOP_CONF_DIR=/etc/hadoop/conf bin/flink run -m yarn-cluster -yn 3 -ys 8 -yst -ytm 4096 ../flink_all.jar |
Hey Jon! Thanks for sharing this. The blog post refers to each record
in the stream as an event. The YARN command you've shared looks good, you could give the machines more memory, but I would not expect this to be the problem here. I would rather think that the sources are the bottleneck (but of course not sure). We can do the following to sanity check this as a first step: look at the back pressure monitoring tab in the web frontend (see [1]). What status do you see there? If the sources all say "OK" they produce as fast as they can and the part of the topology after the them is not back pressuring them/slowing them down. Can we be sure that we can read from SQS with the desired throughput you need? You could also replace the sources with a custom source function producing data inside of Flink to check how much data the rest of the topology can handle. [1] https://ci.apache.org/projects/flink/flink-docs-master/internals/back_pressure_monitoring.html On Mon, Aug 15, 2016 at 10:22 PM, Jon Yeargers <[hidden email]> wrote: > Flink 1.1.1 is running on AWS / EMR. 3 boxes - total 24 cores and 90Gb of > RAM. > > Job is submitted via yarn. > > Topology: > > read csv files from SQS -> parse files by line and create object for each > line -> pass through 'KeySelector' to pair entries (by hash) over 60 second > window -> write original and matched sets to BigQuery. > > Each file contains ~ 15K lines and there are ~10 files / second. > > My topology can't keep up with this stream. What am I doing wrong? > > Articles like this > (http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/) > speak of > 1 million events / sec / core. Im not clear what constitutes an > 'event' but given the number of cores Im throwing at this problem I would > expect higher throughput. > > I run the job as : > > HADOOP_CONF_DIR=/etc/hadoop/conf bin/flink run -m yarn-cluster -yn 3 -ys 8 > -yst -ytm 4096 ../flink_all.jar |
Free forum by Nabble | Edit this page |