Flink performance pre-packaged vs. self-compiled

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink performance pre-packaged vs. self-compiled

Robert Schmidtke
Hi everyone,

I'm using Flink 0.10.2 for some benchmarks and had to add some small changes to Flink, which led me to compiling and running it myself. This is when I noticed a performance difference in the pre-packaged Flink version that I downloaded from the web (http://archive.apache.org/dist/flink/flink-0.10.2/flink-0.10.2-bin-hadoop27.tgz) versus the form of the release-0.10 branch I built myself (mvn -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests -Drat.skip=true clean install // mvn version 3.0.4).

I ran some version of TeraSort (https://github.com/eastcirclek/terasort) and I noticed that the pre-packaged version of Flink performs 10-20% better than the one I built myself (the only tweaks I mead are in the CliFrontend after the Job has finished running, so I would rule out bad programming on my side).

Has anyone come across this before? Or could you provide me with clearer build instructions in order to reproduce the downloadable archive as closely as possible? Thanks in advance!

Robert

--
My GPG Key ID: 336E2680
Reply | Threaded
Open this post in threaded view
|

Re: Flink performance pre-packaged vs. self-compiled

rmetzger0
Hi Robert,

check out the tools/create_release_files.sh file in the source tree. There you can see how we are building the release binaries.
It would be quite interesting to find out what caused the performance difference.

On Wed, Apr 13, 2016 at 5:03 PM, Robert Schmidtke <[hidden email]> wrote:
Hi everyone,

I'm using Flink 0.10.2 for some benchmarks and had to add some small changes to Flink, which led me to compiling and running it myself. This is when I noticed a performance difference in the pre-packaged Flink version that I downloaded from the web (http://archive.apache.org/dist/flink/flink-0.10.2/flink-0.10.2-bin-hadoop27.tgz) versus the form of the release-0.10 branch I built myself (mvn -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests -Drat.skip=true clean install // mvn version 3.0.4).

I ran some version of TeraSort (https://github.com/eastcirclek/terasort) and I noticed that the pre-packaged version of Flink performs 10-20% better than the one I built myself (the only tweaks I mead are in the CliFrontend after the Job has finished running, so I would rule out bad programming on my side).

Has anyone come across this before? Or could you provide me with clearer build instructions in order to reproduce the downloadable archive as closely as possible? Thanks in advance!

Robert

--
My GPG Key ID: 336E2680

Reply | Threaded
Open this post in threaded view
|

Re: Flink performance pre-packaged vs. self-compiled

Robert Schmidtke
Hi Robert,

thanks for the hint! Looks like something I could have figured out myself -.-" I'll let you know if I find something.

Robert

On Thu, Apr 14, 2016 at 1:06 PM, Robert Metzger <[hidden email]> wrote:
Hi Robert,

check out the tools/create_release_files.sh file in the source tree. There you can see how we are building the release binaries.
It would be quite interesting to find out what caused the performance difference.

On Wed, Apr 13, 2016 at 5:03 PM, Robert Schmidtke <[hidden email]> wrote:
Hi everyone,

I'm using Flink 0.10.2 for some benchmarks and had to add some small changes to Flink, which led me to compiling and running it myself. This is when I noticed a performance difference in the pre-packaged Flink version that I downloaded from the web (http://archive.apache.org/dist/flink/flink-0.10.2/flink-0.10.2-bin-hadoop27.tgz) versus the form of the release-0.10 branch I built myself (mvn -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests -Drat.skip=true clean install // mvn version 3.0.4).

I ran some version of TeraSort (https://github.com/eastcirclek/terasort) and I noticed that the pre-packaged version of Flink performs 10-20% better than the one I built myself (the only tweaks I mead are in the CliFrontend after the Job has finished running, so I would rule out bad programming on my side).

Has anyone come across this before? Or could you provide me with clearer build instructions in order to reproduce the downloadable archive as closely as possible? Thanks in advance!

Robert

--
My GPG Key ID: 336E2680




--
My GPG Key ID: 336E2680
Reply | Threaded
Open this post in threaded view
|

Re: Flink performance pre-packaged vs. self-compiled

Robert Schmidtke
I have tried multiple Maven and Scala Versions, but to no avail. I can't seem to achieve performance of the downloaded archive. I am stumped by this and will need to do more experiments when I have more time.

Robert

On Thu, Apr 14, 2016 at 1:13 PM, Robert Schmidtke <[hidden email]> wrote:
Hi Robert,

thanks for the hint! Looks like something I could have figured out myself -.-" I'll let you know if I find something.

Robert

On Thu, Apr 14, 2016 at 1:06 PM, Robert Metzger <[hidden email]> wrote:
Hi Robert,

check out the tools/create_release_files.sh file in the source tree. There you can see how we are building the release binaries.
It would be quite interesting to find out what caused the performance difference.

On Wed, Apr 13, 2016 at 5:03 PM, Robert Schmidtke <[hidden email]> wrote:
Hi everyone,

I'm using Flink 0.10.2 for some benchmarks and had to add some small changes to Flink, which led me to compiling and running it myself. This is when I noticed a performance difference in the pre-packaged Flink version that I downloaded from the web (http://archive.apache.org/dist/flink/flink-0.10.2/flink-0.10.2-bin-hadoop27.tgz) versus the form of the release-0.10 branch I built myself (mvn -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests -Drat.skip=true clean install // mvn version 3.0.4).

I ran some version of TeraSort (https://github.com/eastcirclek/terasort) and I noticed that the pre-packaged version of Flink performs 10-20% better than the one I built myself (the only tweaks I mead are in the CliFrontend after the Job has finished running, so I would rule out bad programming on my side).

Has anyone come across this before? Or could you provide me with clearer build instructions in order to reproduce the downloadable archive as closely as possible? Thanks in advance!

Robert

--
My GPG Key ID: 336E2680




--
My GPG Key ID: 336E2680



--
My GPG Key ID: 336E2680
Reply | Threaded
Open this post in threaded view
|

Re: Flink performance pre-packaged vs. self-compiled

Ovidiu-Cristian MARCU
Hi,

Your assumption may be incorrect related to the TeraSort use case for eastcirclek's implementation. 
How many time did you run your program?
It would be helpful to give more details about your experiment, in terms of configuration, dataset size.

Best,
Ovidiu

On 14 Apr 2016, at 17:14, Robert Schmidtke <[hidden email]> wrote:

I have tried multiple Maven and Scala Versions, but to no avail. I can't seem to achieve performance of the downloaded archive. I am stumped by this and will need to do more experiments when I have more time.

Robert

On Thu, Apr 14, 2016 at 1:13 PM, Robert Schmidtke <[hidden email]> wrote:
Hi Robert,

thanks for the hint! Looks like something I could have figured out myself -.-" I'll let you know if I find something.

Robert

On Thu, Apr 14, 2016 at 1:06 PM, Robert Metzger <[hidden email]> wrote:
Hi Robert,

check out the tools/create_release_files.sh file in the source tree. There you can see how we are building the release binaries.
It would be quite interesting to find out what caused the performance difference.

On Wed, Apr 13, 2016 at 5:03 PM, Robert Schmidtke <[hidden email]> wrote:
Hi everyone,

I'm using Flink 0.10.2 for some benchmarks and had to add some small changes to Flink, which led me to compiling and running it myself. This is when I noticed a performance difference in the pre-packaged Flink version that I downloaded from the web (http://archive.apache.org/dist/flink/flink-0.10.2/flink-0.10.2-bin-hadoop27.tgz) versus the form of the release-0.10 branch I built myself (mvn -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests -Drat.skip=true clean install // mvn version 3.0.4).

I ran some version of TeraSort (https://github.com/eastcirclek/terasort) and I noticed that the pre-packaged version of Flink performs 10-20% better than the one I built myself (the only tweaks I mead are in the CliFrontend after the Job has finished running, so I would rule out bad programming on my side).

Has anyone come across this before? Or could you provide me with clearer build instructions in order to reproduce the downloadable archive as closely as possible? Thanks in advance!

Robert

--
My GPG Key ID: 336E2680




--
My GPG Key ID: 336E2680



--
My GPG Key ID: 336E2680

Reply | Threaded
Open this post in threaded view
|

Re: Flink performance pre-packaged vs. self-compiled

Robert Schmidtke
You're obviously right, the configs were different. In the downloaded version I had set off heap memory to true, whereas in the version I compiled myself this one-time change to flink-conf.yaml was overwritten by recompiling. I have fixed it now and performance is the same.

For the record, I had 30 GiB of TeraGen'd data:

-m yarn-cluster \
  -yn 10 \
  -ys 4 \
  -p 40 \
  -yjm 3072 \
  -ytm 4096

Each of the nodes has 64 GiB of RAM, job ran in 27s, repeatedly.

Thanks and sorry for not having checked the obvious ...

Robert

On Thu, Apr 14, 2016 at 10:23 PM, Ovidiu-Cristian MARCU <[hidden email]> wrote:
Hi,

Your assumption may be incorrect related to the TeraSort use case for eastcirclek's implementation. 
How many time did you run your program?
It would be helpful to give more details about your experiment, in terms of configuration, dataset size.

Best,
Ovidiu

On 14 Apr 2016, at 17:14, Robert Schmidtke <[hidden email]> wrote:

I have tried multiple Maven and Scala Versions, but to no avail. I can't seem to achieve performance of the downloaded archive. I am stumped by this and will need to do more experiments when I have more time.

Robert

On Thu, Apr 14, 2016 at 1:13 PM, Robert Schmidtke <[hidden email]> wrote:
Hi Robert,

thanks for the hint! Looks like something I could have figured out myself -.-" I'll let you know if I find something.

Robert

On Thu, Apr 14, 2016 at 1:06 PM, Robert Metzger <[hidden email]> wrote:
Hi Robert,

check out the tools/create_release_files.sh file in the source tree. There you can see how we are building the release binaries.
It would be quite interesting to find out what caused the performance difference.

On Wed, Apr 13, 2016 at 5:03 PM, Robert Schmidtke <[hidden email]> wrote:
Hi everyone,

I'm using Flink 0.10.2 for some benchmarks and had to add some small changes to Flink, which led me to compiling and running it myself. This is when I noticed a performance difference in the pre-packaged Flink version that I downloaded from the web (http://archive.apache.org/dist/flink/flink-0.10.2/flink-0.10.2-bin-hadoop27.tgz) versus the form of the release-0.10 branch I built myself (mvn -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests -Drat.skip=true clean install // mvn version 3.0.4).

I ran some version of TeraSort (https://github.com/eastcirclek/terasort) and I noticed that the pre-packaged version of Flink performs 10-20% better than the one I built myself (the only tweaks I mead are in the CliFrontend after the Job has finished running, so I would rule out bad programming on my side).

Has anyone come across this before? Or could you provide me with clearer build instructions in order to reproduce the downloadable archive as closely as possible? Thanks in advance!

Robert

--
My GPG Key ID: 336E2680




--
My GPG Key ID: 336E2680



--
My GPG Key ID: 336E2680




--
My GPG Key ID: 336E2680