Is there a way to get the current line number (or generally the number of element currently being processed) inside a mapper?
The example is a matrix you read line-line by line from the file and need both the row and the column numbers. Column number is easy to get, but how to know the row number? Thanks a lot in advance,
Anastasiia |
Hi Anastasiia, this is difficult because the input is usually read in parallel, i.e., an input file is split into several blogs which are independently read and processed by different threads (possibly on different machines). So it is difficult to have a sequential row number. If all rows have the same length (number of bytes), you could compute the row number from the byte offset. If this is not given, you can only read the input sequentially. Flink does not provide InputFormats for this. So you would need to implement a custom InputFormat. You can also keep track of the number of elements that you processed in a Mapper, but this is probably not what you are looking for. 2016-02-04 0:37 GMT+01:00 Анастасія Баша <[hidden email]>:
|
I had that problem/question some time ago, too.
The quick fix is to just put the line number in the line itself. Go for it. However, we worked out a solution for another distributed processing system, that did the following: Read each partition, count the lines, broadcast a map "partition->lineCount", re-read the data and attach the line-numbers. This is basically how distributed zipWithIndex works, that is available in Flink too. But: That only works if the data by both mapPartitions is read in the same order and if the partitions used by both are in the same boundaries. I don't now if you can get that guarantee in Flink without a range-partition and sortPartition on the byte offset. Doing that would work (I think), but it would add significant overhead, that can be completely avoided by adding the line-numbers into the lines in the first place. I think it's just not worth it. Am 4. Februar 2016 00:56:43 MEZ, schrieb
Fabian Hueske [hidden email]:
|
Free forum by Nabble | Edit this page |