(DEPRECATED) Apache Flink User Mailing List archive.

Batch Flink Job S3 write performance vs Spark

Classic

List

Threaded

6 messages Options

sri hari kali charan Tummala

Batch Flink Job S3 write performance vs Spark

Hi All,

have a question did anyone compared the performance of Flink batch job writing to s3 vs spark writing to s3?

Thanks & Regards

Sri Tummala

Arvid Heise-3

Re: Batch Flink Job S3 write performance vs Spark

Fair benchmarks are notoriously difficult to setup.

Usually, it's easy to find a workload where one system shines and as its vendor you report that. Then, the competitor benchmarks a different use case where his system outperforms ours. In the end, customers are more confused than before.

You should do your own benchmarks for your own workloads. That is the only reliable way.

In the end, both systems use similar setups and improvements in one system are often also incorporated into the other system with some delay, such that there should be no ground-breaking differences between the two systems running on Java and using the same set of libraries.

Of course, if one system has a very specific optimization for your use case, that could be much faster.

On Mon, Feb 24, 2020 at 11:26 PM sri hari kali charan Tummala <[hidden email]> wrote:

Hi All,

have a question did anyone compared the performance of Flink batch job writing to s3 vs spark writing to s3?

--
Thanks & Regards
Sri Tummala

sri hari kali charan Tummala

Re: Batch Flink Job S3 write performance vs Spark

Thank you (the two systems running on Java and using the same set of libraries), so from my understanding, Flink uses AWS SDK behind the scenes same as spark.

On Wed, Feb 26, 2020 at 8:49 AM Arvid Heise <[hidden email]> wrote:

Fair benchmarks are notoriously difficult to setup.

Usually, it's easy to find a workload where one system shines and as its vendor you report that. Then, the competitor benchmarks a different use case where his system outperforms ours. In the end, customers are more confused than before.

You should do your own benchmarks for your own workloads. That is the only reliable way.

In the end, both systems use similar setups and improvements in one system are often also incorporated into the other system with some delay, such that there should be no ground-breaking differences between the two systems running on Java and using the same set of libraries.
Of course, if one system has a very specific optimization for your use case, that could be much faster.

On Mon, Feb 24, 2020 at 11:26 PM sri hari kali charan Tummala <[hidden email]> wrote:
Hi All,

have a question did anyone compared the performance of Flink batch job writing to s3 vs spark writing to s3?

--
Thanks & Regards
Sri Tummala

Thanks & Regards

Sri Tummala

Arvid Heise-3

Re: Batch Flink Job S3 write performance vs Spark

Exactly. We use the hadoop-fs as an indirection on top of that, but Spark probably does the same.

On Wed, Feb 26, 2020 at 3:52 PM sri hari kali charan Tummala <[hidden email]> wrote:

Thank you (the two systems running on Java and using the same set of libraries), so from my understanding, Flink uses AWS SDK behind the scenes same as spark.

On Wed, Feb 26, 2020 at 8:49 AM Arvid Heise <[hidden email]> wrote:
Fair benchmarks are notoriously difficult to setup.

Usually, it's easy to find a workload where one system shines and as its vendor you report that. Then, the competitor benchmarks a different use case where his system outperforms ours. In the end, customers are more confused than before.

You should do your own benchmarks for your own workloads. That is the only reliable way.

In the end, both systems use similar setups and improvements in one system are often also incorporated into the other system with some delay, such that there should be no ground-breaking differences between the two systems running on Java and using the same set of libraries.
Of course, if one system has a very specific optimization for your use case, that could be much faster.

On Mon, Feb 24, 2020 at 11:26 PM sri hari kali charan Tummala <[hidden email]> wrote:
Hi All,

have a question did anyone compared the performance of Flink batch job writing to s3 vs spark writing to s3?

--
Thanks & Regards
Sri Tummala

--
Thanks & Regards
Sri Tummala

sri hari kali charan Tummala

Re: Batch Flink Job S3 write performance vs Spark

Ok, thanks for the clarification.

On Wed, Feb 26, 2020 at 9:22 AM Arvid Heise <[hidden email]> wrote:

Exactly. We use the hadoop-fs as an indirection on top of that, but Spark probably does the same.

On Wed, Feb 26, 2020 at 3:52 PM sri hari kali charan Tummala <[hidden email]> wrote:
Thank you (the two systems running on Java and using the same set of libraries), so from my understanding, Flink uses AWS SDK behind the scenes same as spark.

On Wed, Feb 26, 2020 at 8:49 AM Arvid Heise <[hidden email]> wrote:
Fair benchmarks are notoriously difficult to setup.

Usually, it's easy to find a workload where one system shines and as its vendor you report that. Then, the competitor benchmarks a different use case where his system outperforms ours. In the end, customers are more confused than before.

You should do your own benchmarks for your own workloads. That is the only reliable way.

In the end, both systems use similar setups and improvements in one system are often also incorporated into the other system with some delay, such that there should be no ground-breaking differences between the two systems running on Java and using the same set of libraries.
Of course, if one system has a very specific optimization for your use case, that could be much faster.

On Mon, Feb 24, 2020 at 11:26 PM sri hari kali charan Tummala <[hidden email]> wrote:
Hi All,

have a question did anyone compared the performance of Flink batch job writing to s3 vs spark writing to s3?

--
Thanks & Regards
Sri Tummala

--
Thanks & Regards
Sri Tummala

Thanks & Regards

Sri Tummala

sri hari kali charan Tummala

Re: Batch Flink Job S3 write performance vs Spark

sorry for being lazy I would have gone through flink source code.

On Wed, Feb 26, 2020 at 9:35 AM sri hari kali charan Tummala <[hidden email]> wrote:

Ok, thanks for the clarification.

On Wed, Feb 26, 2020 at 9:22 AM Arvid Heise <[hidden email]> wrote:
Exactly. We use the hadoop-fs as an indirection on top of that, but Spark probably does the same.

On Wed, Feb 26, 2020 at 3:52 PM sri hari kali charan Tummala <[hidden email]> wrote:
Thank you (the two systems running on Java and using the same set of libraries), so from my understanding, Flink uses AWS SDK behind the scenes same as spark.

On Wed, Feb 26, 2020 at 8:49 AM Arvid Heise <[hidden email]> wrote:
Fair benchmarks are notoriously difficult to setup.

Usually, it's easy to find a workload where one system shines and as its vendor you report that. Then, the competitor benchmarks a different use case where his system outperforms ours. In the end, customers are more confused than before.

You should do your own benchmarks for your own workloads. That is the only reliable way.

In the end, both systems use similar setups and improvements in one system are often also incorporated into the other system with some delay, such that there should be no ground-breaking differences between the two systems running on Java and using the same set of libraries.
Of course, if one system has a very specific optimization for your use case, that could be much faster.

On Mon, Feb 24, 2020 at 11:26 PM sri hari kali charan Tummala <[hidden email]> wrote:
Hi All,

have a question did anyone compared the performance of Flink batch job writing to s3 vs spark writing to s3?

--
Thanks & Regards
Sri Tummala

--
Thanks & Regards
Sri Tummala

--
Thanks & Regards
Sri Tummala

Thanks & Regards

Sri Tummala