(no subject)

Posted by Robert Waury on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/no-subject-tp165.html

Hi,

I performed the Yarn Setup on a cluster running Apache Hadoop 2.3.0-cdh5.1.3 like described on the website.

I could see the allocated containers in the Yarn ResourceManger and after starting a Flink job via the CLI client it showed up on the Flink Dashboard.

The problem is that the job which runs in about 17 minutes in my local VM (3 cores, 4GB RAM, input from local files) now takes about 25 minutes on the cluster (18 containers with 4GB and 8 cores each, input from HDFS with rf=5).

From the Flink log it seemed all data was shuffled to a single machine even for FlatMap operations.

log excerpt:
10:54:08,832 INFO  org.apache.flink.runtime.jobmanager.splitassigner.file.FileInputSplitList  - nceorihad06 (ipcPort=56158, dataPort=55744) receives remote file input split (distance 2147483647)
10:54:08,832 INFO  org.apache.flink.runtime.jobmanager.splitassigner.InputSplitManager  - CHAIN DataSource (TextInputFormat (hdfs:/user/rwaury/input/all_catalog_140410.txt) - UTF-8) -> FlatMap (com.amadeus.pcb.join.FlightConnectionJoiner$FilteringUTCExtractor) (1/1) receives input split 5
10:54:09,589 INFO  org.apache.flink.runtime.jobmanager.splitassigner.file.FileInputSplitList  - nceorihad06 (ipcPort=56158, dataPort=55744) receives remote file input split (distance 2147483647)
10:54:09,590 INFO  org.apache.flink.runtime.jobmanager.splitassigner.InputSplitManager  - CHAIN DataSource (TextInputFormat (hdfs:/user/rwaury/input/all_catalog_140410.txt) - UTF-8) -> FlatMap (com.amadeus.pcb.join.FlightConnectionJoiner$FilteringUTCExtractor) (1/1) receives input split 128

The job takes two large input files (~9 GB) and after filtering and converting them with a FlatMap (selectivity is below 1%) it joins them each twice with a small data set (< 1MB) after that the join results are joined with each other. The result is about 2.7 GB.

Any idea what causes this?

Cheers,
Robert