(DEPRECATED) Apache Flink User Mailing List archive.

Flink performance pre-packaged vs. self-compiled

Classic

List

Threaded

6 messages Options

Robert Schmidtke

Flink performance pre-packaged vs. self-compiled

Hi everyone,

I'm using Flink 0.10.2 for some benchmarks and had to add some small changes to Flink, which led me to compiling and running it myself. This is when I noticed a performance difference in the pre-packaged Flink version that I downloaded from the web (http://archive.apache.org/dist/flink/flink-0.10.2/flink-0.10.2-bin-hadoop27.tgz) versus the form of the release-0.10 branch I built myself (mvn -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests -Drat.skip=true clean install // mvn version 3.0.4).

I ran some version of TeraSort (https://github.com/eastcirclek/terasort) and I noticed that the pre-packaged version of Flink performs 10-20% better than the one I built myself (the only tweaks I mead are in the CliFrontend after the Job has finished running, so I would rule out bad programming on my side).

Has anyone come across this before? Or could you provide me with clearer build instructions in order to reproduce the downloadable archive as closely as possible? Thanks in advance!

Robert

My GPG Key ID: 336E2680

rmetzger0

Re: Flink performance pre-packaged vs. self-compiled

Hi Robert,

check out the tools/create_release_files.sh file in the source tree. There you can see how we are building the release binaries.

It would be quite interesting to find out what caused the performance difference.

On Wed, Apr 13, 2016 at 5:03 PM, Robert Schmidtke <[hidden email]> wrote:

Hi everyone,

I'm using Flink 0.10.2 for some benchmarks and had to add some small changes to Flink, which led me to compiling and running it myself. This is when I noticed a performance difference in the pre-packaged Flink version that I downloaded from the web (http://archive.apache.org/dist/flink/flink-0.10.2/flink-0.10.2-bin-hadoop27.tgz) versus the form of the release-0.10 branch I built myself (mvn -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests -Drat.skip=true clean install // mvn version 3.0.4).

I ran some version of TeraSort (https://github.com/eastcirclek/terasort) and I noticed that the pre-packaged version of Flink performs 10-20% better than the one I built myself (the only tweaks I mead are in the CliFrontend after the Job has finished running, so I would rule out bad programming on my side).

Has anyone come across this before? Or could you provide me with clearer build instructions in order to reproduce the downloadable archive as closely as possible? Thanks in advance!

Robert

--
My GPG Key ID: 336E2680

Robert Schmidtke

Re: Flink performance pre-packaged vs. self-compiled

Hi Robert,

thanks for the hint! Looks like something I could have figured out myself -.-" I'll let you know if I find something.

Robert

On Thu, Apr 14, 2016 at 1:06 PM, Robert Metzger <[hidden email]> wrote:

Hi Robert,

check out the tools/create_release_files.sh file in the source tree. There you can see how we are building the release binaries.
It would be quite interesting to find out what caused the performance difference.

On Wed, Apr 13, 2016 at 5:03 PM, Robert Schmidtke <[hidden email]> wrote:
Hi everyone,

I'm using Flink 0.10.2 for some benchmarks and had to add some small changes to Flink, which led me to compiling and running it myself. This is when I noticed a performance difference in the pre-packaged Flink version that I downloaded from the web (http://archive.apache.org/dist/flink/flink-0.10.2/flink-0.10.2-bin-hadoop27.tgz) versus the form of the release-0.10 branch I built myself (mvn -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests -Drat.skip=true clean install // mvn version 3.0.4).

I ran some version of TeraSort (https://github.com/eastcirclek/terasort) and I noticed that the pre-packaged version of Flink performs 10-20% better than the one I built myself (the only tweaks I mead are in the CliFrontend after the Job has finished running, so I would rule out bad programming on my side).

Has anyone come across this before? Or could you provide me with clearer build instructions in order to reproduce the downloadable archive as closely as possible? Thanks in advance!

Robert

--
My GPG Key ID: 336E2680

My GPG Key ID: 336E2680

Robert Schmidtke

Re: Flink performance pre-packaged vs. self-compiled

I have tried multiple Maven and Scala Versions, but to no avail. I can't seem to achieve performance of the downloaded archive. I am stumped by this and will need to do more experiments when I have more time.

Robert

On Thu, Apr 14, 2016 at 1:13 PM, Robert Schmidtke <[hidden email]> wrote:

Hi Robert,

thanks for the hint! Looks like something I could have figured out myself -.-" I'll let you know if I find something.

Robert

On Thu, Apr 14, 2016 at 1:06 PM, Robert Metzger <[hidden email]> wrote:
Hi Robert,

check out the tools/create_release_files.sh file in the source tree. There you can see how we are building the release binaries.
It would be quite interesting to find out what caused the performance difference.

On Wed, Apr 13, 2016 at 5:03 PM, Robert Schmidtke <[hidden email]> wrote:
Hi everyone,

I'm using Flink 0.10.2 for some benchmarks and had to add some small changes to Flink, which led me to compiling and running it myself. This is when I noticed a performance difference in the pre-packaged Flink version that I downloaded from the web (http://archive.apache.org/dist/flink/flink-0.10.2/flink-0.10.2-bin-hadoop27.tgz) versus the form of the release-0.10 branch I built myself (mvn -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests -Drat.skip=true clean install // mvn version 3.0.4).

I ran some version of TeraSort (https://github.com/eastcirclek/terasort) and I noticed that the pre-packaged version of Flink performs 10-20% better than the one I built myself (the only tweaks I mead are in the CliFrontend after the Job has finished running, so I would rule out bad programming on my side).

Has anyone come across this before? Or could you provide me with clearer build instructions in order to reproduce the downloadable archive as closely as possible? Thanks in advance!

Robert

--
My GPG Key ID: 336E2680

--
My GPG Key ID: 336E2680

My GPG Key ID: 336E2680

Ovidiu-Cristian MARCU

Re: Flink performance pre-packaged vs. self-compiled

Hi,

Your assumption may be incorrect related to the TeraSort use case for eastcirclek's implementation.

How many time did you run your program?

It would be helpful to give more details about your experiment, in terms of configuration, dataset size.

Best,

Ovidiu

On 14 Apr 2016, at 17:14, Robert Schmidtke <[hidden email]> wrote:

I have tried multiple Maven and Scala Versions, but to no avail. I can't seem to achieve performance of the downloaded archive. I am stumped by this and will need to do more experiments when I have more time.

Robert

On Thu, Apr 14, 2016 at 1:13 PM, Robert Schmidtke <[hidden email]> wrote:
Hi Robert,

thanks for the hint! Looks like something I could have figured out myself -.-" I'll let you know if I find something.

Robert

On Thu, Apr 14, 2016 at 1:06 PM, Robert Metzger <[hidden email]> wrote:
Hi Robert,

check out the tools/create_release_files.sh file in the source tree. There you can see how we are building the release binaries.
It would be quite interesting to find out what caused the performance difference.

On Wed, Apr 13, 2016 at 5:03 PM, Robert Schmidtke <[hidden email]> wrote:
Hi everyone,

I'm using Flink 0.10.2 for some benchmarks and had to add some small changes to Flink, which led me to compiling and running it myself. This is when I noticed a performance difference in the pre-packaged Flink version that I downloaded from the web (http://archive.apache.org/dist/flink/flink-0.10.2/flink-0.10.2-bin-hadoop27.tgz) versus the form of the release-0.10 branch I built myself (mvn -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests -Drat.skip=true clean install // mvn version 3.0.4).

I ran some version of TeraSort (https://github.com/eastcirclek/terasort) and I noticed that the pre-packaged version of Flink performs 10-20% better than the one I built myself (the only tweaks I mead are in the CliFrontend after the Job has finished running, so I would rule out bad programming on my side).

Has anyone come across this before? Or could you provide me with clearer build instructions in order to reproduce the downloadable archive as closely as possible? Thanks in advance!

Robert

--
My GPG Key ID: 336E2680

--
My GPG Key ID: 336E2680

--
My GPG Key ID: 336E2680

Robert Schmidtke

Re: Flink performance pre-packaged vs. self-compiled

You're obviously right, the configs were different. In the downloaded version I had set off heap memory to true, whereas in the version I compiled myself this one-time change to flink-conf.yaml was overwritten by recompiling. I have fixed it now and performance is the same.

For the record, I had 30 GiB of TeraGen'd data:

-m yarn-cluster \

-yn 10 \

-ys 4 \

-p 40 \

-yjm 3072 \

-ytm 4096

Each of the nodes has 64 GiB of RAM, job ran in 27s, repeatedly.

Thanks and sorry for not having checked the obvious ...

Robert

On Thu, Apr 14, 2016 at 10:23 PM, Ovidiu-Cristian MARCU <[hidden email]> wrote:

Hi,

Your assumption may be incorrect related to the TeraSort use case for eastcirclek's implementation.
How many time did you run your program?
It would be helpful to give more details about your experiment, in terms of configuration, dataset size.

Best,
Ovidiu

On 14 Apr 2016, at 17:14, Robert Schmidtke <[hidden email]> wrote:

I have tried multiple Maven and Scala Versions, but to no avail. I can't seem to achieve performance of the downloaded archive. I am stumped by this and will need to do more experiments when I have more time.

Robert

On Thu, Apr 14, 2016 at 1:13 PM, Robert Schmidtke <[hidden email]> wrote:
Hi Robert,

thanks for the hint! Looks like something I could have figured out myself -.-" I'll let you know if I find something.

Robert

On Thu, Apr 14, 2016 at 1:06 PM, Robert Metzger <[hidden email]> wrote:
Hi Robert,

check out the tools/create_release_files.sh file in the source tree. There you can see how we are building the release binaries.
It would be quite interesting to find out what caused the performance difference.

On Wed, Apr 13, 2016 at 5:03 PM, Robert Schmidtke <[hidden email]> wrote:
Hi everyone,

I'm using Flink 0.10.2 for some benchmarks and had to add some small changes to Flink, which led me to compiling and running it myself. This is when I noticed a performance difference in the pre-packaged Flink version that I downloaded from the web (http://archive.apache.org/dist/flink/flink-0.10.2/flink-0.10.2-bin-hadoop27.tgz) versus the form of the release-0.10 branch I built myself (mvn -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests -Drat.skip=true clean install // mvn version 3.0.4).

I ran some version of TeraSort (https://github.com/eastcirclek/terasort) and I noticed that the pre-packaged version of Flink performs 10-20% better than the one I built myself (the only tweaks I mead are in the CliFrontend after the Job has finished running, so I would rule out bad programming on my side).

Has anyone come across this before? Or could you provide me with clearer build instructions in order to reproduce the downloadable archive as closely as possible? Thanks in advance!

Robert

--
My GPG Key ID: 336E2680

--
My GPG Key ID: 336E2680

--
My GPG Key ID: 336E2680

My GPG Key ID: 336E2680