Hello all,
i am trying to run some relational queries on flink over yarn, i found two repo (https://github.com/stratosphere/stratosphere-tpch, https://github.com/project-flink/flink-perf ) with the java and scala implementation for some of the bench queries. Running some of them with scale factor 64 the reading of the dataset seems to be bottleneck. Cause im new in the flink community, is there any way to implement those queries more efficient ? Also are there any results of this benchmark for the flink-yarn ?? Thanks in advance, Alex |
Hi Alex, these jobs are implemented in a way that they read text data from HDFS.There are several formats which are much better suited to read relational data such as Hive's ORC or Parquet (also in Apache Incubation). The performance problems with text files are manifold: - Data representation is not native but must be parsed (CPU intensive) - Data representation is inefficient (an integer might need several characters where 4 bytes would suffice) - All data must be read, even columns that are not used by the query. - No support to push filters down for early filtering Having said that, it is not uncommon that I/O is the bottleneck in data processing systems. Let us know, if you need any help. Cheers, Fabian 2014-09-22 12:12 GMT+02:00 Alexandros Papadopoulos <[hidden email]>: Hello all, |
Hi Alex, "stratosphere-tpch" programs are written against our old Scala API and we haven't really fine-tuned them, so maybe they are not optimally implemented. We haven't benchmarked Flink explicitly on YARN, but I don't expect the results to be different from non-yarn setups. We use YARN just for deploying our JobManager and TaskManagers and then run everything like we do with direct installations. The execution is exactly the same for YARN and non-YARN setups. On Mon, Sep 22, 2014 at 12:25 PM, Fabian Hueske <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |