Iterative queries on Flink

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Iterative queries on Flink

Flavio Pompermaier
Hi to all, 
I was wondering if Flink could fit a use case where a user load a dataset in memory and then he/she wants to explore it interactively. Let's say I want to load a csv, then filter out the rows where the column value match some criteria, then apply another criteria after seeing the results of the first filter.
Is there a way to keep the dataset in memory and modify it interactively without re-reading all the dataset every time I want to chain another operation to my dataset?

Best,
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: Iterative queries on Flink

Fabian Hueske-2
Hi Flavio,

Flink does not support caching of data sets in memory yet.

Best, Fabian

2015-11-30 16:45 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Hi to all, 
I was wondering if Flink could fit a use case where a user load a dataset in memory and then he/she wants to explore it interactively. Let's say I want to load a csv, then filter out the rows where the column value match some criteria, then apply another criteria after seeing the results of the first filter.
Is there a way to keep the dataset in memory and modify it interactively without re-reading all the dataset every time I want to chain another operation to my dataset?

Best,
Flavio

Reply | Threaded
Open this post in threaded view
|

Re: Iterative queries on Flink

Flavio Pompermaier
Is there any effort in this direction? maybe I could achieve something like that using Tachyon in some way...?

On Mon, Nov 30, 2015 at 4:52 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

Flink does not support caching of data sets in memory yet.

Best, Fabian

2015-11-30 16:45 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Hi to all, 
I was wondering if Flink could fit a use case where a user load a dataset in memory and then he/she wants to explore it interactively. Let's say I want to load a csv, then filter out the rows where the column value match some criteria, then apply another criteria after seeing the results of the first filter.
Is there a way to keep the dataset in memory and modify it interactively without re-reading all the dataset every time I want to chain another operation to my dataset?

Best,
Flavio



Reply | Threaded
Open this post in threaded view
|

Re: Iterative queries on Flink

Fabian Hueske-2
The basic building blocks are there but I am not aware of any efforts to implement caching and add it to the API.

2015-11-30 16:55 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Is there any effort in this direction? maybe I could achieve something like that using Tachyon in some way...?

On Mon, Nov 30, 2015 at 4:52 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

Flink does not support caching of data sets in memory yet.

Best, Fabian

2015-11-30 16:45 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Hi to all, 
I was wondering if Flink could fit a use case where a user load a dataset in memory and then he/she wants to explore it interactively. Let's say I want to load a csv, then filter out the rows where the column value match some criteria, then apply another criteria after seeing the results of the first filter.
Is there a way to keep the dataset in memory and modify it interactively without re-reading all the dataset every time I want to chain another operation to my dataset?

Best,
Flavio




Reply | Threaded
Open this post in threaded view
|

Re: Iterative queries on Flink

Flavio Pompermaier
I think that with some support I could try to implement it...actually I just need to add a persist(StorageLevel.OFF_HEAP) method to the Dataset APIs (similar to what Spark does..) and output it to a tachyon directory configured in the flink-conf.yml and then re-read that dataset using its generated name on tachyon. Do you have other suggestions?

On Mon, Nov 30, 2015 at 4:58 PM, Fabian Hueske <[hidden email]> wrote:
The basic building blocks are there but I am not aware of any efforts to implement caching and add it to the API.

2015-11-30 16:55 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Is there any effort in this direction? maybe I could achieve something like that using Tachyon in some way...?

On Mon, Nov 30, 2015 at 4:52 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

Flink does not support caching of data sets in memory yet.

Best, Fabian

2015-11-30 16:45 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Hi to all, 
I was wondering if Flink could fit a use case where a user load a dataset in memory and then he/she wants to explore it interactively. Let's say I want to load a csv, then filter out the rows where the column value match some criteria, then apply another criteria after seeing the results of the first filter.
Is there a way to keep the dataset in memory and modify it interactively without re-reading all the dataset every time I want to chain another operation to my dataset?

Best,
Flavio






Reply | Threaded
Open this post in threaded view
|

Re: Iterative queries on Flink

Maximilian Michels
Hi Flavio,

I was working on this some time ago but it didn't make it in yet and
priorities shifted a bit. The pull request is here:
https://github.com/apache/flink/pull/640

The basic idea is to remove Flink's ResultPartition buffers in memory
lazily, i.e. keep them as long as enough memory is available. When a
new job is resumed, it picks up the old results again. The pull
request needs some overhaul now and the API integration is not there
yet.

Cheers,
Max

On Mon, Nov 30, 2015 at 5:35 PM, Flavio Pompermaier
<[hidden email]> wrote:

> I think that with some support I could try to implement it...actually I just
> need to add a persist(StorageLevel.OFF_HEAP) method to the Dataset APIs
> (similar to what Spark does..) and output it to a tachyon directory
> configured in the flink-conf.yml and then re-read that dataset using its
> generated name on tachyon. Do you have other suggestions?
>
>
> On Mon, Nov 30, 2015 at 4:58 PM, Fabian Hueske <[hidden email]> wrote:
>>
>> The basic building blocks are there but I am not aware of any efforts to
>> implement caching and add it to the API.
>>
>> 2015-11-30 16:55 GMT+01:00 Flavio Pompermaier <[hidden email]>:
>>>
>>> Is there any effort in this direction? maybe I could achieve something
>>> like that using Tachyon in some way...?
>>>
>>> On Mon, Nov 30, 2015 at 4:52 PM, Fabian Hueske <[hidden email]> wrote:
>>>>
>>>> Hi Flavio,
>>>>
>>>> Flink does not support caching of data sets in memory yet.
>>>>
>>>> Best, Fabian
>>>>
>>>> 2015-11-30 16:45 GMT+01:00 Flavio Pompermaier <[hidden email]>:
>>>>>
>>>>> Hi to all,
>>>>> I was wondering if Flink could fit a use case where a user load a
>>>>> dataset in memory and then he/she wants to explore it interactively. Let's
>>>>> say I want to load a csv, then filter out the rows where the column value
>>>>> match some criteria, then apply another criteria after seeing the results of
>>>>> the first filter.
>>>>> Is there a way to keep the dataset in memory and modify it
>>>>> interactively without re-reading all the dataset every time I want to chain
>>>>> another operation to my dataset?
>>>>>
>>>>> Best,
>>>>> Flavio
>>>>
>>>>
>>>
>>>
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Iterative queries on Flink

Flavio Pompermaier
Do you think it is possible to push ahead this thing? I need to implement this interactive feature of Datasets. Do you think it is possible to implement the persist() method in Flink (similar to Spark)? If you want I can work on it with some instructions..

On Wed, Dec 2, 2015 at 3:05 PM, Maximilian Michels <[hidden email]> wrote:
Hi Flavio,

I was working on this some time ago but it didn't make it in yet and
priorities shifted a bit. The pull request is here:
https://github.com/apache/flink/pull/640

The basic idea is to remove Flink's ResultPartition buffers in memory
lazily, i.e. keep them as long as enough memory is available. When a
new job is resumed, it picks up the old results again. The pull
request needs some overhaul now and the API integration is not there
yet.

Cheers,
Max

On Mon, Nov 30, 2015 at 5:35 PM, Flavio Pompermaier
<[hidden email]> wrote:
> I think that with some support I could try to implement it...actually I just
> need to add a persist(StorageLevel.OFF_HEAP) method to the Dataset APIs
> (similar to what Spark does..) and output it to a tachyon directory
> configured in the flink-conf.yml and then re-read that dataset using its
> generated name on tachyon. Do you have other suggestions?
>
>
> On Mon, Nov 30, 2015 at 4:58 PM, Fabian Hueske <[hidden email]> wrote:
>>
>> The basic building blocks are there but I am not aware of any efforts to
>> implement caching and add it to the API.
>>
>> 2015-11-30 16:55 GMT+01:00 Flavio Pompermaier <[hidden email]>:
>>>
>>> Is there any effort in this direction? maybe I could achieve something
>>> like that using Tachyon in some way...?
>>>
>>> On Mon, Nov 30, 2015 at 4:52 PM, Fabian Hueske <[hidden email]> wrote:
>>>>
>>>> Hi Flavio,
>>>>
>>>> Flink does not support caching of data sets in memory yet.
>>>>
>>>> Best, Fabian
>>>>
>>>> 2015-11-30 16:45 GMT+01:00 Flavio Pompermaier <[hidden email]>:
>>>>>
>>>>> Hi to all,
>>>>> I was wondering if Flink could fit a use case where a user load a
>>>>> dataset in memory and then he/she wants to explore it interactively. Let's
>>>>> say I want to load a csv, then filter out the rows where the column value
>>>>> match some criteria, then apply another criteria after seeing the results of
>>>>> the first filter.
>>>>> Is there a way to keep the dataset in memory and modify it
>>>>> interactively without re-reading all the dataset every time I want to chain
>>>>> another operation to my dataset?
>>>>>
>>>>> Best,
>>>>> Flavio
>>>>
>>>>
>>>
>>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Iterative queries on Flink

Flavio Pompermaier

Any progress in this direction?how mich effort do you think it's required in order to implement this feature?


On 2 Dec 2015 16:29, "Flavio Pompermaier" <[hidden email]> wrote:
Do you think it is possible to push ahead this thing? I need to implement this interactive feature of Datasets. Do you think it is possible to implement the persist() method in Flink (similar to Spark)? If you want I can work on it with some instructions..

On Wed, Dec 2, 2015 at 3:05 PM, Maximilian Michels <[hidden email]> wrote:
Hi Flavio,

I was working on this some time ago but it didn't make it in yet and
priorities shifted a bit. The pull request is here:
https://github.com/apache/flink/pull/640

The basic idea is to remove Flink's ResultPartition buffers in memory
lazily, i.e. keep them as long as enough memory is available. When a
new job is resumed, it picks up the old results again. The pull
request needs some overhaul now and the API integration is not there
yet.

Cheers,
Max

On Mon, Nov 30, 2015 at 5:35 PM, Flavio Pompermaier
<[hidden email]> wrote:
> I think that with some support I could try to implement it...actually I just
> need to add a persist(StorageLevel.OFF_HEAP) method to the Dataset APIs
> (similar to what Spark does..) and output it to a tachyon directory
> configured in the flink-conf.yml and then re-read that dataset using its
> generated name on tachyon. Do you have other suggestions?
>
>
> On Mon, Nov 30, 2015 at 4:58 PM, Fabian Hueske <[hidden email]> wrote:
>>
>> The basic building blocks are there but I am not aware of any efforts to
>> implement caching and add it to the API.
>>
>> 2015-11-30 16:55 GMT+01:00 Flavio Pompermaier <[hidden email]>:
>>>
>>> Is there any effort in this direction? maybe I could achieve something
>>> like that using Tachyon in some way...?
>>>
>>> On Mon, Nov 30, 2015 at 4:52 PM, Fabian Hueske <[hidden email]> wrote:
>>>>
>>>> Hi Flavio,
>>>>
>>>> Flink does not support caching of data sets in memory yet.
>>>>
>>>> Best, Fabian
>>>>
>>>> 2015-11-30 16:45 GMT+01:00 Flavio Pompermaier <[hidden email]>:
>>>>>
>>>>> Hi to all,
>>>>> I was wondering if Flink could fit a use case where a user load a
>>>>> dataset in memory and then he/she wants to explore it interactively. Let's
>>>>> say I want to load a csv, then filter out the rows where the column value
>>>>> match some criteria, then apply another criteria after seeing the results of
>>>>> the first filter.
>>>>> Is there a way to keep the dataset in memory and modify it
>>>>> interactively without re-reading all the dataset every time I want to chain
>>>>> another operation to my dataset?
>>>>>
>>>>> Best,
>>>>> Flavio
>>>>
>>>>
>>>
>>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Iterative queries on Flink

Stephan Ewen
There is still quite a bit needed to do this properly:
  (1) incremental recovery
  (2) network stack caching

(1) will probably happen quite soon, I am not aware of any committer having concrete plans for (2).

Best,
Stephan


On Sat, Oct 8, 2016 at 4:41 PM, Flavio Pompermaier <[hidden email]> wrote:

Any progress in this direction?how mich effort do you think it's required in order to implement this feature?


On 2 Dec 2015 16:29, "Flavio Pompermaier" <[hidden email]> wrote:
Do you think it is possible to push ahead this thing? I need to implement this interactive feature of Datasets. Do you think it is possible to implement the persist() method in Flink (similar to Spark)? If you want I can work on it with some instructions..

On Wed, Dec 2, 2015 at 3:05 PM, Maximilian Michels <[hidden email]> wrote:
Hi Flavio,

I was working on this some time ago but it didn't make it in yet and
priorities shifted a bit. The pull request is here:
https://github.com/apache/flink/pull/640

The basic idea is to remove Flink's ResultPartition buffers in memory
lazily, i.e. keep them as long as enough memory is available. When a
new job is resumed, it picks up the old results again. The pull
request needs some overhaul now and the API integration is not there
yet.

Cheers,
Max

On Mon, Nov 30, 2015 at 5:35 PM, Flavio Pompermaier
<[hidden email]> wrote:
> I think that with some support I could try to implement it...actually I just
> need to add a persist(StorageLevel.OFF_HEAP) method to the Dataset APIs
> (similar to what Spark does..) and output it to a tachyon directory
> configured in the flink-conf.yml and then re-read that dataset using its
> generated name on tachyon. Do you have other suggestions?
>
>
> On Mon, Nov 30, 2015 at 4:58 PM, Fabian Hueske <[hidden email]> wrote:
>>
>> The basic building blocks are there but I am not aware of any efforts to
>> implement caching and add it to the API.
>>
>> 2015-11-30 16:55 GMT+01:00 Flavio Pompermaier <[hidden email]>:
>>>
>>> Is there any effort in this direction? maybe I could achieve something
>>> like that using Tachyon in some way...?
>>>
>>> On Mon, Nov 30, 2015 at 4:52 PM, Fabian Hueske <[hidden email]> wrote:
>>>>
>>>> Hi Flavio,
>>>>
>>>> Flink does not support caching of data sets in memory yet.
>>>>
>>>> Best, Fabian
>>>>
>>>> 2015-11-30 16:45 GMT+01:00 Flavio Pompermaier <[hidden email]>:
>>>>>
>>>>> Hi to all,
>>>>> I was wondering if Flink could fit a use case where a user load a
>>>>> dataset in memory and then he/she wants to explore it interactively. Let's
>>>>> say I want to load a csv, then filter out the rows where the column value
>>>>> match some criteria, then apply another criteria after seeing the results of
>>>>> the first filter.
>>>>> Is there a way to keep the dataset in memory and modify it
>>>>> interactively without re-reading all the dataset every time I want to chain
>>>>> another operation to my dataset?
>>>>>
>>>>> Best,
>>>>> Flavio
>>>>
>>>>
>>>
>>>
>>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Iterative queries on Flink

Flavio Pompermaier
Thaks Stephan for the answer. 
As I told to Fabian we need to apply some transformation to datasets interactively.
For the moment I will use livy + spark[1] but I'll prefer to stick with Flink if possible.
So, if there's any effor in this direction just let me know and I'll be happy to contribute.

Best,
Flavio


On Mon, Oct 10, 2016 at 3:15 PM, Stephan Ewen <[hidden email]> wrote:
There is still quite a bit needed to do this properly:
  (1) incremental recovery
  (2) network stack caching

(1) will probably happen quite soon, I am not aware of any committer having concrete plans for (2).

Best,
Stephan


On Sat, Oct 8, 2016 at 4:41 PM, Flavio Pompermaier <[hidden email]> wrote:

Any progress in this direction?how mich effort do you think it's required in order to implement this feature?


On 2 Dec 2015 16:29, "Flavio Pompermaier" <[hidden email]> wrote:
Do you think it is possible to push ahead this thing? I need to implement this interactive feature of Datasets. Do you think it is possible to implement the persist() method in Flink (similar to Spark)? If you want I can work on it with some instructions..

On Wed, Dec 2, 2015 at 3:05 PM, Maximilian Michels <[hidden email]> wrote:
Hi Flavio,

I was working on this some time ago but it didn't make it in yet and
priorities shifted a bit. The pull request is here:
https://github.com/apache/flink/pull/640

The basic idea is to remove Flink's ResultPartition buffers in memory
lazily, i.e. keep them as long as enough memory is available. When a
new job is resumed, it picks up the old results again. The pull
request needs some overhaul now and the API integration is not there
yet.

Cheers,
Max

On Mon, Nov 30, 2015 at 5:35 PM, Flavio Pompermaier
<[hidden email]> wrote:
> I think that with some support I could try to implement it...actually I just
> need to add a persist(StorageLevel.OFF_HEAP) method to the Dataset APIs
> (similar to what Spark does..) and output it to a tachyon directory
> configured in the flink-conf.yml and then re-read that dataset using its
> generated name on tachyon. Do you have other suggestions?
>
>
> On Mon, Nov 30, 2015 at 4:58 PM, Fabian Hueske <[hidden email]> wrote:
>>
>> The basic building blocks are there but I am not aware of any efforts to
>> implement caching and add it to the API.
>>
>> 2015-11-30 16:55 GMT+01:00 Flavio Pompermaier <[hidden email]>:
>>>
>>> Is there any effort in this direction? maybe I could achieve something
>>> like that using Tachyon in some way...?
>>>
>>> On Mon, Nov 30, 2015 at 4:52 PM, Fabian Hueske <[hidden email]> wrote:
>>>>
>>>> Hi Flavio,
>>>>
>>>> Flink does not support caching of data sets in memory yet.
>>>>
>>>> Best, Fabian
>>>>
>>>> 2015-11-30 16:45 GMT+01:00 Flavio Pompermaier <[hidden email]>:
>>>>>
>>>>> Hi to all,
>>>>> I was wondering if Flink could fit a use case where a user load a
>>>>> dataset in memory and then he/she wants to explore it interactively. Let's
>>>>> say I want to load a csv, then filter out the rows where the column value
>>>>> match some criteria, then apply another criteria after seeing the results of
>>>>> the first filter.
>>>>> Is there a way to keep the dataset in memory and modify it
>>>>> interactively without re-reading all the dataset every time I want to chain
>>>>> another operation to my dataset?
>>>>>
>>>>> Best,
>>>>> Flavio
>>>>
>>>>
>>>
>>>
>>
>
>