Hi,
I'm trying to port an existing Spark job to Flink and have gotten stuck on the same issue brought up here:
Is there some way to accomplish this same thing in Flink?
i.e. avoid re-computing a particular DataSet when multiple different subsequent transformations are required on it.
I've even tried explicitly writing out the DataSet to avoid the re-computation but still taking an I/O hit for the initial write to HDFS and subsequent re-reading of it in the following stages.
While it does yield a performance improvement over no caching at all, it doesn't match the performance I get with RDD.persist in Spark.
Thanks,
Frank Grimes