Flink first() operator

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink first() operator

Biplob Biswas
Hi,

It might be a naive question but I was concerned as I am trying to read from a file.
My question is if I have a file with n lines and i want m lines out of that where n << m, would the first operator process only the first m lines or would it go through the entire file? 

If it does go through the entire file, is there a better way to just get the top m lines using readCsvFile function?

Thanks & Regards
Biplob Biswas
Reply | Threaded
Open this post in threaded view
|

Re: Flink first() operator

Ufuk Celebi
Hey Biplob,

Yes, the file source will read all input. The first operator will add
a combiner to the source for pre-aggregation and then shuffle
everything to a single reduce instance, which emits the N first
elements. Keep in mind that there is no strict order in which the
records will be emitted.

If you need to optimize this you could write a custom
File/TextInputFormat, which discards the lines at the sources. You can
have a look at these classes and then get back with questions on the
mailing list.

– Ufuk

On Sat, Apr 23, 2016 at 6:37 PM, Biplob Biswas <[hidden email]> wrote:

> Hi,
>
> It might be a naive question but I was concerned as I am trying to read from
> a file.
> My question is if I have a file with n lines and i want m lines out of that
> where n << m, would the first operator process only the first m lines or
> would it go through the entire file?
>
> If it does go through the entire file, is there a better way to just get the
> top m lines using readCsvFile function?
>
> Thanks & Regards
> Biplob Biswas
Reply | Threaded
Open this post in threaded view
|

Re: Flink first() operator

Fabian Hueske-2
Hi Biplop,

you can also implement a generic IF that wraps another IF (such as a CsvInputFormat).
The wrapping IF forwards all calls to the wrapped IF and in addition counts how many records were emitted (how often InputFormat.nextRecord() was called).
Once the count arrives at the threshold, it returns true for InputFormat.reachedEnd().

Cheers, Fabian

2016-04-25 11:06 GMT+02:00 Ufuk Celebi <[hidden email]>:
Hey Biplob,

Yes, the file source will read all input. The first operator will add
a combiner to the source for pre-aggregation and then shuffle
everything to a single reduce instance, which emits the N first
elements. Keep in mind that there is no strict order in which the
records will be emitted.

If you need to optimize this you could write a custom
File/TextInputFormat, which discards the lines at the sources. You can
have a look at these classes and then get back with questions on the
mailing list.

– Ufuk

On Sat, Apr 23, 2016 at 6:37 PM, Biplob Biswas <[hidden email]> wrote:
> Hi,
>
> It might be a naive question but I was concerned as I am trying to read from
> a file.
> My question is if I have a file with n lines and i want m lines out of that
> where n << m, would the first operator process only the first m lines or
> would it go through the entire file?
>
> If it does go through the entire file, is there a better way to just get the
> top m lines using readCsvFile function?
>
> Thanks & Regards
> Biplob Biswas

Reply | Threaded
Open this post in threaded view
|

Re: Flink first() operator

Biplob Biswas
In reply to this post by Ufuk Celebi
Thanks, I was looking into the Textinputformat you suggested, and would get back to it once I start working with huge files. I would assume there's no workaround or additonal parameters to the readscvfile function so as to restrict the number of lines read in one go as reading a big file would be a big problem in terms of memory.
Reply | Threaded
Open this post in threaded view
|

Re: Flink first() operator

Fabian Hueske-2
Actually, memory should not be a problem since the full data set would not be materialized in memory.
Flink has a streaming runtime so most of the data would be immediately filtered out.
However, reading the whole file causes of course a lot of unnecessary IO.

2016-04-26 17:09 GMT+02:00 Biplob Biswas <[hidden email]>:
Thanks, I was looking into the Textinputformat you suggested, and would get
back to it once I start working with huge files. I would assume there's no
workaround or additonal parameters to the readscvfile function so as to
restrict the number of lines read in one go as reading a big file would be a
big problem in terms of memory.



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-first-operator-tp6377p6451.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.