Quotes in fields of CsvInputFormat

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Quotes in fields of CsvInputFormat

Malte Schwarzer
Hi,

I’m try to import a CSV file but the parser seems to have problems this quotes in the beginning of a field. Is there a way to set or disable enclosures for the CSV input?

This is my  code:

DataSet<Tuple2<String, String>> res = env.readCsvFile(inputCsvFilename)
                .fieldDelimiter('|')
                .types(String.class, String.class)

CSV:

A|ggg
B|"hhh" xx
C|xxx

As result I’m receiving a ParserException for line B:

org.apache.flink.api.common.io.ParseException: Line could not be parsed: 'B|"hhh" xx


Thanks,
Malte
Reply | Threaded
Open this post in threaded view
|

Re: Quotes in fields of CsvInputFormat

Stephan Ewen
Hi!

The parser interprets the quotes as quotes for the field. That means the second field (the string) stops after the "hhh" and the xx is considered invalid trailing data.

What do you expect as the result of parsing that line?

Stephan


On Fri, Dec 5, 2014 at 4:16 PM, Malte Schwarzer <[hidden email]> wrote:
Hi,

I’m try to import a CSV file but the parser seems to have problems this quotes in the beginning of a field. Is there a way to set or disable enclosures for the CSV input?

This is my  code:

DataSet<Tuple2<String, String>> res = env.readCsvFile(inputCsvFilename)
                .fieldDelimiter('|')
                .types(String.class, String.class)

CSV:

A|ggg
B|"hhh" xx
C|xxx

As result I’m receiving a ParserException for line B:

org.apache.flink.api.common.io.ParseException: Line could not be parsed: 'B|"hhh" xx


Thanks,
Malte

Reply | Threaded
Open this post in threaded view
|

Re: Quotes in fields of CsvInputFormat

Malte Schwarzer
Hi Stephan,

The result should be >"hhh“ xx<  as field value. Enclosures should be disabled but there seems to be no method to do that. 


Malte

Von: Stephan Ewen <[hidden email]>
Antworten an: <[hidden email]>
Datum: Freitag, 5. Dezember 2014 16:28
An: <[hidden email]>
Betreff: Re: Quotes in fields of CsvInputFormat

Hi!

The parser interprets the quotes as quotes for the field. That means the second field (the string) stops after the "hhh" and the xx is considered invalid trailing data.

What do you expect as the result of parsing that line?

Stephan


On Fri, Dec 5, 2014 at 4:16 PM, Malte Schwarzer <[hidden email]> wrote:
Hi,

I’m try to import a CSV file but the parser seems to have problems this quotes in the beginning of a field. Is there a way to set or disable enclosures for the CSV input?

This is my  code:

DataSet<Tuple2<String, String>> res = env.readCsvFile(inputCsvFilename)
                .fieldDelimiter('|')
                .types(String.class, String.class)

CSV:

A|ggg
B|"hhh" xx
C|xxx

As result I’m receiving a ParserException for line B:

org.apache.flink.api.common.io.ParseException: Line could not be parsed: 'B|"hhh" xx


Thanks,
Malte

Reply | Threaded
Open this post in threaded view
|

Re: Quotes in fields of CsvInputFormat

Max Michels
Hi Malte,

Typically, double quotes are used to identify strings and thus are not interpreted literally. Any data in a field after a double quoted string is regarded as invalid trailing data.

You could replace double quotes with single quotes:

A|ggg
B|'hhh' xx
C|xxx

This results in the expected >'hhh' xx< for the second line.

Best regards,
Max

On Fri, Dec 5, 2014 at 4:44 PM, Malte Schwarzer <[hidden email]> wrote:
Hi Stephan,

The result should be >"hhh“ xx<  as field value. Enclosures should be disabled but there seems to be no method to do that. 


Malte

Von: Stephan Ewen <[hidden email]>
Antworten an: <[hidden email]>
Datum: Freitag, 5. Dezember 2014 16:28
An: <[hidden email]>
Betreff: Re: Quotes in fields of CsvInputFormat

Hi!

The parser interprets the quotes as quotes for the field. That means the second field (the string) stops after the "hhh" and the xx is considered invalid trailing data.

What do you expect as the result of parsing that line?

Stephan


On Fri, Dec 5, 2014 at 4:16 PM, Malte Schwarzer <[hidden email]> wrote:
Hi,

I’m try to import a CSV file but the parser seems to have problems this quotes in the beginning of a field. Is there a way to set or disable enclosures for the CSV input?

This is my  code:

DataSet<Tuple2<String, String>> res = env.readCsvFile(inputCsvFilename)
                .fieldDelimiter('|')
                .types(String.class, String.class)

CSV:

A|ggg
B|"hhh" xx
C|xxx

As result I’m receiving a ParserException for line B:

org.apache.flink.api.common.io.ParseException: Line could not be parsed: 'B|"hhh" xx


Thanks,
Malte


Reply | Threaded
Open this post in threaded view
|

Re: Quotes in fields of CsvInputFormat

Fabian Hueske
With the current implementation, quoted string parsing kicks in, if the first non-whitespace character of a field is a double quote (just as in Malte's case). I think this behaviour can be quite unexpected for users. 
Wouldn't it be better to make the behaviour of the String parsing more explicit, i.e., add a switch to dis/enable quoted string parsing. With the current implementation, the configuration would affect all String fields in a file, though...

Cheers, Fabian

2014-12-09 12:17 GMT+01:00 Max Michels <[hidden email]>:
Hi Malte,

Typically, double quotes are used to identify strings and thus are not interpreted literally. Any data in a field after a double quoted string is regarded as invalid trailing data.

You could replace double quotes with single quotes:

A|ggg
B|'hhh' xx
C|xxx

This results in the expected >'hhh' xx< for the second line.

Best regards,
Max

On Fri, Dec 5, 2014 at 4:44 PM, Malte Schwarzer <[hidden email]> wrote:
Hi Stephan,

The result should be >"hhh“ xx<  as field value. Enclosures should be disabled but there seems to be no method to do that. 


Malte

Von: Stephan Ewen <[hidden email]>
Antworten an: <[hidden email]>
Datum: Freitag, 5. Dezember 2014 16:28
An: <[hidden email]>
Betreff: Re: Quotes in fields of CsvInputFormat

Hi!

The parser interprets the quotes as quotes for the field. That means the second field (the string) stops after the "hhh" and the xx is considered invalid trailing data.

What do you expect as the result of parsing that line?

Stephan


On Fri, Dec 5, 2014 at 4:16 PM, Malte Schwarzer <[hidden email]> wrote:
Hi,

I’m try to import a CSV file but the parser seems to have problems this quotes in the beginning of a field. Is there a way to set or disable enclosures for the CSV input?

This is my  code:

DataSet<Tuple2<String, String>> res = env.readCsvFile(inputCsvFilename)
                .fieldDelimiter('|')
                .types(String.class, String.class)

CSV:

A|ggg
B|"hhh" xx
C|xxx

As result I’m receiving a ParserException for line B:

org.apache.flink.api.common.io.ParseException: Line could not be parsed: 'B|"hhh" xx


Thanks,
Malte



Reply | Threaded
Open this post in threaded view
|

Re: Quotes in fields of CsvInputFormat

Max Michels
That sounds like a good idea. Just like setDelimeter("|"), one should be able to do a setParseDoubleQuotes(false) to disable the special handling of double quotes.

You're right, Fabian, the current implementation treats all String fields alike. Maybe we can expect the user to provide a consistently formatted input file (i.e. with or without the use of double quotes as identifiers)?

On Tue, Dec 9, 2014 at 2:32 PM, Fabian Hueske <[hidden email]> wrote:
With the current implementation, quoted string parsing kicks in, if the first non-whitespace character of a field is a double quote (just as in Malte's case). I think this behaviour can be quite unexpected for users. 
Wouldn't it be better to make the behaviour of the String parsing more explicit, i.e., add a switch to dis/enable quoted string parsing. With the current implementation, the configuration would affect all String fields in a file, though...

Cheers, Fabian

2014-12-09 12:17 GMT+01:00 Max Michels <[hidden email]>:
Hi Malte,

Typically, double quotes are used to identify strings and thus are not interpreted literally. Any data in a field after a double quoted string is regarded as invalid trailing data.

You could replace double quotes with single quotes:

A|ggg
B|'hhh' xx
C|xxx

This results in the expected >'hhh' xx< for the second line.

Best regards,
Max

On Fri, Dec 5, 2014 at 4:44 PM, Malte Schwarzer <[hidden email]> wrote:
Hi Stephan,

The result should be >"hhh“ xx<  as field value. Enclosures should be disabled but there seems to be no method to do that. 


Malte

Von: Stephan Ewen <[hidden email]>
Antworten an: <[hidden email]>
Datum: Freitag, 5. Dezember 2014 16:28
An: <[hidden email]>
Betreff: Re: Quotes in fields of CsvInputFormat

Hi!

The parser interprets the quotes as quotes for the field. That means the second field (the string) stops after the "hhh" and the xx is considered invalid trailing data.

What do you expect as the result of parsing that line?

Stephan


On Fri, Dec 5, 2014 at 4:16 PM, Malte Schwarzer <[hidden email]> wrote:
Hi,

I’m try to import a CSV file but the parser seems to have problems this quotes in the beginning of a field. Is there a way to set or disable enclosures for the CSV input?

This is my  code:

DataSet<Tuple2<String, String>> res = env.readCsvFile(inputCsvFilename)
                .fieldDelimiter('|')
                .types(String.class, String.class)

CSV:

A|ggg
B|"hhh" xx
C|xxx

As result I’m receiving a ParserException for line B:

org.apache.flink.api.common.io.ParseException: Line could not be parsed: 'B|"hhh" xx


Thanks,
Malte




Reply | Threaded
Open this post in threaded view
|

Re: Quotes in fields of CsvInputFormat

Fabian Hueske
I think that's a fair assumption to make.

I'll open a JIRA for making quoted string parsing optional and a configurable quote character.

2014-12-09 18:51 GMT+01:00 Max Michels <[hidden email]>:
That sounds like a good idea. Just like setDelimeter("|"), one should be able to do a setParseDoubleQuotes(false) to disable the special handling of double quotes.

You're right, Fabian, the current implementation treats all String fields alike. Maybe we can expect the user to provide a consistently formatted input file (i.e. with or without the use of double quotes as identifiers)?

On Tue, Dec 9, 2014 at 2:32 PM, Fabian Hueske <[hidden email]> wrote:
With the current implementation, quoted string parsing kicks in, if the first non-whitespace character of a field is a double quote (just as in Malte's case). I think this behaviour can be quite unexpected for users. 
Wouldn't it be better to make the behaviour of the String parsing more explicit, i.e., add a switch to dis/enable quoted string parsing. With the current implementation, the configuration would affect all String fields in a file, though...

Cheers, Fabian

2014-12-09 12:17 GMT+01:00 Max Michels <[hidden email]>:
Hi Malte,

Typically, double quotes are used to identify strings and thus are not interpreted literally. Any data in a field after a double quoted string is regarded as invalid trailing data.

You could replace double quotes with single quotes:

A|ggg
B|'hhh' xx
C|xxx

This results in the expected >'hhh' xx< for the second line.

Best regards,
Max

On Fri, Dec 5, 2014 at 4:44 PM, Malte Schwarzer <[hidden email]> wrote:
Hi Stephan,

The result should be >"hhh“ xx<  as field value. Enclosures should be disabled but there seems to be no method to do that. 


Malte

Von: Stephan Ewen <[hidden email]>
Antworten an: <[hidden email]>
Datum: Freitag, 5. Dezember 2014 16:28
An: <[hidden email]>
Betreff: Re: Quotes in fields of CsvInputFormat

Hi!

The parser interprets the quotes as quotes for the field. That means the second field (the string) stops after the "hhh" and the xx is considered invalid trailing data.

What do you expect as the result of parsing that line?

Stephan


On Fri, Dec 5, 2014 at 4:16 PM, Malte Schwarzer <[hidden email]> wrote:
Hi,

I’m try to import a CSV file but the parser seems to have problems this quotes in the beginning of a field. Is there a way to set or disable enclosures for the CSV input?

This is my  code:

DataSet<Tuple2<String, String>> res = env.readCsvFile(inputCsvFilename)
                .fieldDelimiter('|')
                .types(String.class, String.class)

CSV:

A|ggg
B|"hhh" xx
C|xxx

As result I’m receiving a ParserException for line B:

org.apache.flink.api.common.io.ParseException: Line could not be parsed: 'B|"hhh" xx


Thanks,
Malte