(DEPRECATED) Apache Flink User Mailing List archive.

Is there a Flink DataSet equivalent to Spark's RDD.persist?

Classic

List

Threaded

2 messages Options

Frank Grimes

Is there a Flink DataSet equivalent to Spark's RDD.persist?

Hi,

I'm trying to port an existing Spark job to Flink and have gotten stuck on the same issue brought up here:

https://stackoverflow.com/questions/46243181/cache-and-persist-datasets

Is there some way to accomplish this same thing in Flink?

i.e. avoid re-computing a particular DataSet when multiple different subsequent transformations are required on it.

I've even tried explicitly writing out the DataSet to avoid the re-computation but still taking an I/O hit for the initial write to HDFS and subsequent re-reading of it in the following stages.

While it does yield a performance improvement over no caching at all, it doesn't match the performance I get with RDD.persist in Spark.

Thanks,

Frank Grimes

Andrey Zagrebin-3

Re: Is there a Flink DataSet equivalent to Spark's RDD.persist?

Hi Frank,

This feature is currently under discussion. You can follow it in this issue:
https://issues.apache.org/jira/browse/FLINK-11199

Best,

Andrey

On Thu, Feb 21, 2019 at 7:41 PM Frank Grimes <[hidden email]> wrote:

Hi,

I'm trying to port an existing Spark job to Flink and have gotten stuck on the same issue brought up here:

https://stackoverflow.com/questions/46243181/cache-and-persist-datasets

Is there some way to accomplish this same thing in Flink?
i.e. avoid re-computing a particular DataSet when multiple different subsequent transformations are required on it.

I've even tried explicitly writing out the DataSet to avoid the re-computation but still taking an I/O hit for the initial write to HDFS and subsequent re-reading of it in the following stages.
While it does yield a performance improvement over no caching at all, it doesn't match the performance I get with RDD.persist in Spark.

Thanks,

Frank Grimes