TPC -H Benchmark

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

TPC -H Benchmark

Alexandros Papadopoulos
Hello all,

   i am trying to run some relational queries on flink over yarn,
i found two repo (https://github.com/stratosphere/stratosphere-tpch,
https://github.com/project-flink/flink-perf ) with the java and scala
implementation for some of the bench queries.
Running some of them with scale factor 64 the reading of the dataset
seems to be bottleneck.
Cause im new in the flink community, is there any way to implement those
queries more efficient ?
Also are there any results of this benchmark for the flink-yarn ??

Thanks in advance,

Alex
Reply | Threaded
Open this post in threaded view
|

Re: TPC -H Benchmark

Fabian Hueske
Hi Alex,

these jobs are implemented in a way that they read text data from HDFS.
This is a very inefficient (yet very portable and easy-to-use) format to read relational data.
There are several formats which are much better suited to read relational data such as Hive's ORC or Parquet (also in Apache Incubation).

The performance problems with text files are manifold:
- Data representation is not native but must be parsed (CPU intensive)
- Data representation is inefficient (an integer might need several characters where 4 bytes would suffice)
- All data must be read, even columns that are not used by the query.
- No support to push filters down for early filtering

You could port the jobs to use an ORC or Parquet format. Either use Hadoop's InputFormats (Flink supports those) or port them to Flink InputFormats (which are very similar to Hadoop's). Using Hadoop's formats might have a little overhead but will be easier...
Having said that, it is not uncommon that I/O is the bottleneck in data processing systems.

Let us know, if you need any help.

Cheers, Fabian


2014-09-22 12:12 GMT+02:00 Alexandros Papadopoulos <[hidden email]>:
Hello all,

  i am trying to run some relational queries on flink over yarn,
i found two repo (https://github.com/stratosphere/stratosphere-tpch, https://github.com/project-flink/flink-perf ) with the java and scala implementation for some of the bench queries.
Running some of them with scale factor 64 the reading of the dataset seems to be bottleneck.
Cause im new in the flink community, is there any way to implement those queries more efficient ?
Also are there any results of this benchmark for the flink-yarn ??

Thanks in advance,

Alex

Reply | Threaded
Open this post in threaded view
|

Re: TPC -H Benchmark

rmetzger0
Hi Alex,

"stratosphere-tpch" programs are written against our old Scala API and we haven't really fine-tuned them, so maybe they are not optimally implemented.

We haven't benchmarked Flink explicitly on YARN, but I don't expect the results to be different from non-yarn setups. We use YARN just for deploying our JobManager and TaskManagers and then run everything like we do with direct installations.
The execution is exactly the same for YARN and non-YARN setups.




On Mon, Sep 22, 2014 at 12:25 PM, Fabian Hueske <[hidden email]> wrote:
Hi Alex,

these jobs are implemented in a way that they read text data from HDFS.
This is a very inefficient (yet very portable and easy-to-use) format to read relational data.
There are several formats which are much better suited to read relational data such as Hive's ORC or Parquet (also in Apache Incubation).

The performance problems with text files are manifold:
- Data representation is not native but must be parsed (CPU intensive)
- Data representation is inefficient (an integer might need several characters where 4 bytes would suffice)
- All data must be read, even columns that are not used by the query.
- No support to push filters down for early filtering

You could port the jobs to use an ORC or Parquet format. Either use Hadoop's InputFormats (Flink supports those) or port them to Flink InputFormats (which are very similar to Hadoop's). Using Hadoop's formats might have a little overhead but will be easier...
Having said that, it is not uncommon that I/O is the bottleneck in data processing systems.

Let us know, if you need any help.

Cheers, Fabian


2014-09-22 12:12 GMT+02:00 Alexandros Papadopoulos <[hidden email]>:
Hello all,

  i am trying to run some relational queries on flink over yarn,
i found two repo (https://github.com/stratosphere/stratosphere-tpch, https://github.com/project-flink/flink-perf ) with the java and scala implementation for some of the bench queries.
Running some of them with scale factor 64 the reading of the dataset seems to be bottleneck.
Cause im new in the flink community, is there any way to implement those queries more efficient ?
Also are there any results of this benchmark for the flink-yarn ??

Thanks in advance,

Alex