Dataset read csv file problem

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Dataset read csv file problem

ebru
Hello all,

We are trying to read csv files which contains fields containing  \n character, also \n character is line delimiter. We used parseQuotedStrings('\"')
 Method but, it ignores only field delimiters so we couldn’t parse the fields that contains \n character. How can we solve this problem?

-Ebru
Reply | Threaded
Open this post in threaded view
|

Re: Dataset read csv file problem

Fabian Hueske-2
Hi Ebru,

this case is not supported by Flink's CsvInputFormat. The problem is that such a file could not be read in parallel because it is not possible to identify record boundaries if you start reading in the middle of the file.
We have a new CsvInputFormat under development that follows the RFC 4180 standard which will have an parameter to support row delimiters that are encapsulated in a String field.

Until that is available, the only solution is to implement a custom InputFormat.

Best, Fabian

2017-11-24 11:40 GMT+01:00 ebru <[hidden email]>:
Hello all,

We are trying to read csv files which contains fields containing  \n character, also \n character is line delimiter. We used parseQuotedStrings('\"')
 Method but, it ignores only field delimiters so we couldn’t parse the fields that contains \n character. How can we solve this problem?

-Ebru

Reply | Threaded
Open this post in threaded view
|

Re: Dataset read csv file problem

ebru
Thank you Fabian, we’ve implemented a custom CsvInputFormat.


On 24 Nov 2017, at 15:35, Fabian Hueske <[hidden email]> wrote:

Hi Ebru,

this case is not supported by Flink's CsvInputFormat. The problem is that such a file could not be read in parallel because it is not possible to identify record boundaries if you start reading in the middle of the file.
We have a new CsvInputFormat under development that follows the RFC 4180 standard which will have an parameter to support row delimiters that are encapsulated in a String field.

Until that is available, the only solution is to implement a custom InputFormat.

Best, Fabian

2017-11-24 11:40 GMT+01:00 ebru <[hidden email]>:
Hello all,

We are trying to read csv files which contains fields containing  \n character, also \n character is line delimiter. We used parseQuotedStrings('\"')
 Method but, it ignores only field delimiters so we couldn’t parse the fields that contains \n character. How can we solve this problem?

-Ebru