Possibility to get the line numbers?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Possibility to get the line numbers?

Анастасія Баша
Is there a way to get the current line number (or generally the number of element currently being processed) inside a mapper?
The example is a matrix you read line-line by line from the file and need both the row and the column numbers. Column number is easy to get, but how to know the row number?
 
Thanks a lot in advance, 
Anastasiia
Reply | Threaded
Open this post in threaded view
|

Re: Possibility to get the line numbers?

Fabian Hueske-2
Hi Anastasiia,

this is difficult because the input is usually read in parallel, i.e., an input file is split into several blogs which are independently read and processed by different threads (possibly on different machines). So it is difficult to have a sequential row number.

If all rows have the same length (number of bytes), you could compute the row number from the byte offset. If this is not given, you can only read the input sequentially.
Flink does not provide InputFormats for this. So you would need to implement a custom InputFormat.

You can also keep track of the number of elements that you processed in a Mapper, but this is probably not what you are looking for.

Best,
Fabian

2016-02-04 0:37 GMT+01:00 Анастасія Баша <[hidden email]>:
Is there a way to get the current line number (or generally the number of element currently being processed) inside a mapper?
The example is a matrix you read line-line by line from the file and need both the row and the column numbers. Column number is easy to get, but how to know the row number?
 
Thanks a lot in advance, 
Anastasiia

Reply | Threaded
Open this post in threaded view
|

Re: Possibility to get the line numbers?

Fridtjof Sander
I had that problem/question some time ago, too.

The quick fix is to just put the line number in the line itself. Go for it.

However, we worked out a solution for another distributed processing system, that did the following:
Read each partition, count the lines, broadcast a map "partition->lineCount", re-read the data and attach the line-numbers.
This is basically how distributed zipWithIndex works, that is available in Flink too.

But:

That only works if the data by both mapPartitions is read in the same order and if the partitions used by both are in the same boundaries.
I don't now if you can get that guarantee in Flink without a range-partition and sortPartition on the byte offset.
Doing that would work (I think), but it would add significant overhead, that can be completely avoided by adding the line-numbers into the lines in the first place.
I think it's just not worth it.

Am 4. Februar 2016 00:56:43 MEZ, schrieb Fabian Hueske [hidden email]:
Hi Anastasiia,

this is difficult because the input is usually read in parallel, i.e., an input file is split into several blogs which are independently read and processed by different threads (possibly on different machines). So it is difficult to have a sequential row number.

If all rows have the same length (number of bytes), you could compute the row number from the byte offset. If this is not given, you can only read the input sequentially.
Flink does not provide InputFormats for this. So you would need to implement a custom InputFormat.

You can also keep track of the number of elements that you processed in a Mapper, but this is probably not what you are looking for.

Best,
Fabian

2016-02-04 0:37 GMT+01:00 Анастасія Баша <[hidden email]>:
Is there a way to get the current line number (or generally the number of element currently being processed) inside a mapper?
The example is a matrix you read line-line by line from the file and need both the row and the column numbers. Column number is easy to get, but how to know the row number?
 
Thanks a lot in advance, 
Anastasiia