(DEPRECATED) Apache Flink User Mailing List archive.

Writing Intermediates to disk

Classic

List

Threaded

4 messages Options

Paschek, Robert

Writing Intermediates to disk

Hi Mailing List,

I want to write and read intermediates to/from disk.

The following foo- codesnippet may illustrate my intention:

public void mapPartition(Iterable<T> tuples, Collector<T> out) {

for (T tuple : tuples) {

if (Condition)

out.collect(tuple);

else

writeTupleToDisk

}

While (‘TupleOnDisk’)

out.collect(‘ReadNextTupleFromDisk’);

}

I'am wondering if flink provides an integrated class for this purpose. I also have to precise identify the files with the intermediates due parallelism of mapPartition.

Thank you in advance!

Robert

Ufuk Celebi

Re: Writing Intermediates to disk

Flink has support for spillable intermediate results. Currently they
are only set if necessary to avoid pipeline deadlocks.

You can force this via

env.getConfig().setExecutionMode(ExecutionMode.BATCH);

This will write shuffles to disk, but you don't get the fine-grained
control you probably need for your use case.

– Ufuk

On Thu, May 5, 2016 at 3:29 PM, Paschek, Robert
<[hidden email]> wrote:

> Hi Mailing List,
>
>
>
> I want to write and read intermediates to/from disk.
>
> The following foo- codesnippet may illustrate my intention:
>
>
>
> public void mapPartition(Iterable<T> tuples, Collector<T> out) {
>
>
>
> for (T tuple : tuples) {
>
>
>
> if (Condition)
>
> out.collect(tuple);
>
> else
>
> writeTupleToDisk
>
> }
>
>
>
> While (‘TupleOnDisk’)
>
> out.collect(‘ReadNextTupleFromDisk’);
>
> }
>
>
>
> I'am wondering if flink provides an integrated class for this purpose. I
> also have to precise identify the files with the intermediates due
> parallelism of mapPartition.
>
>
>
>
>
> Thank you in advance!
>
> Robert

Vikram Saxena

Re: Writing Intermediates to disk

I do not know if I understand completely, but I would create a new DataSet based on filtering the condition and then persist this DataSet.

So :

DataSet ds2 = DataSet1.filter(Condition)

2ds.output(...)

On Mon, May 9, 2016 at 11:09 AM, Ufuk Celebi <[hidden email]> wrote:

Flink has support for spillable intermediate results. Currently they
are only set if necessary to avoid pipeline deadlocks.

You can force this via

env.getConfig().setExecutionMode(ExecutionMode.BATCH);

This will write shuffles to disk, but you don't get the fine-grained
control you probably need for your use case.

– Ufuk

On Thu, May 5, 2016 at 3:29 PM, Paschek, Robert
<[hidden email]> wrote:
> Hi Mailing List,
>
>
>
> I want to write and read intermediates to/from disk.
>
> The following foo- codesnippet may illustrate my intention:
>
>
>
> public void mapPartition(Iterable<T> tuples, Collector<T> out) {
>
>
>
> for (T tuple : tuples) {
>
>
>
> if (Condition)
>
> out.collect(tuple);
>
> else
>
> writeTupleToDisk
>
> }
>
>
>
> While (‘TupleOnDisk’)
>
> out.collect(‘ReadNextTupleFromDisk’);
>
> }
>
>
>
> I'am wondering if flink provides an integrated class for this purpose. I
> also have to precise identify the files with the intermediates due
> parallelism of mapPartition.
>
>
>
>
>
> Thank you in advance!
>
> Robert

Paschek, Robert

AW: Writing Intermediates to disk

Hey,

thank you for your answers and sorry for my late response : - (

The intention was to store some of the data to disk, when the main memory gets full / my temporary ArrayList<Tuple> reaches a pre-defined size.

I used com.opencsv.CSVReader and import com.opencsv.CSVWriter for this task and getRuntimeContext().getIndexOfThisSubtask() to differ the filenames from other tasks, running on the same machine.

Fortunately that isn’t no longer necessary form my work.

Best