Hi,
It might be a naive question but I was concerned as I am trying to read from a file. My question is if I have a file with n lines and i want m lines out of that where n << m, would the first operator process only the first m lines or would it go through the entire file? If it does go through the entire file, is there a better way to just get the top m lines using readCsvFile function? Thanks & Regards Biplob Biswas |
Hey Biplob,
Yes, the file source will read all input. The first operator will add a combiner to the source for pre-aggregation and then shuffle everything to a single reduce instance, which emits the N first elements. Keep in mind that there is no strict order in which the records will be emitted. If you need to optimize this you could write a custom File/TextInputFormat, which discards the lines at the sources. You can have a look at these classes and then get back with questions on the mailing list. – Ufuk On Sat, Apr 23, 2016 at 6:37 PM, Biplob Biswas <[hidden email]> wrote: > Hi, > > It might be a naive question but I was concerned as I am trying to read from > a file. > My question is if I have a file with n lines and i want m lines out of that > where n << m, would the first operator process only the first m lines or > would it go through the entire file? > > If it does go through the entire file, is there a better way to just get the > top m lines using readCsvFile function? > > Thanks & Regards > Biplob Biswas |
Hi Biplop, you can also implement a generic IF that wraps another IF (such as a CsvInputFormat). The wrapping IF forwards all calls to the wrapped IF and in addition counts how many records were emitted (how often InputFormat.nextRecord() was called). Cheers, Fabian 2016-04-25 11:06 GMT+02:00 Ufuk Celebi <[hidden email]>: Hey Biplob, |
In reply to this post by Ufuk Celebi
Thanks, I was looking into the Textinputformat you suggested, and would get back to it once I start working with huge files. I would assume there's no workaround or additonal parameters to the readscvfile function so as to restrict the number of lines read in one go as reading a big file would be a big problem in terms of memory.
|
Actually, memory should not be a problem since the full data set would not be materialized in memory. Flink has a streaming runtime so most of the data would be immediately filtered out. 2016-04-26 17:09 GMT+02:00 Biplob Biswas <[hidden email]>: Thanks, I was looking into the Textinputformat you suggested, and would get |
Free forum by Nabble | Edit this page |