Hi, I’m try to import a CSV file but the parser seems to have problems this quotes in the beginning of a field. Is there a way to set or disable enclosures for the CSV input? This is my code: DataSet<Tuple2<String, String>> res = env.readCsvFile(inputCsvFilename) .fieldDelimiter('|') .types(String.class, String.class) CSV: A|ggg B|"hhh" xx C|xxx As result I’m receiving a ParserException for line B: org.apache.flink.api.common.io.ParseException: Line could not be parsed: 'B|"hhh" xx‘ Thanks, Malte
|
Hi! The parser interprets the quotes as quotes for the field. That means the second field (the string) stops after the "hhh" and the xx is considered invalid trailing data. What do you expect as the result of parsing that line? Stephan On Fri, Dec 5, 2014 at 4:16 PM, Malte Schwarzer <[hidden email]> wrote:
|
Hi Stephan, The result should be >"hhh“ xx< as field value. Enclosures should be disabled but there seems to be no method to do that. Malte Von: Stephan Ewen <[hidden email]> Antworten an: <[hidden email]> Datum: Freitag, 5. Dezember 2014 16:28 An: <[hidden email]> Betreff: Re: Quotes in fields of CsvInputFormat Hi! The parser interprets the quotes as quotes for the field. That means the second field (the string) stops after the "hhh" and the xx is considered invalid trailing data. What do you expect as the result of parsing that line? Stephan On Fri, Dec 5, 2014 at 4:16 PM, Malte Schwarzer <[hidden email]> wrote:
|
Hi Malte, Typically, double quotes are used to identify strings and thus are not interpreted literally. Any data in a field after a double quoted string is regarded as invalid trailing data. You could replace double quotes with single quotes: A|ggg B|'hhh' xx C|xxx This results in the expected >'hhh' xx< for the second line. Best regards, Max On Fri, Dec 5, 2014 at 4:44 PM, Malte Schwarzer <[hidden email]> wrote:
|
With the current implementation, quoted string parsing kicks in, if the first non-whitespace character of a field is a double quote (just as in Malte's case). I think this behaviour can be quite unexpected for users. Wouldn't it be better to make the behaviour of the String parsing more explicit, i.e., add a switch to dis/enable quoted string parsing. With the current implementation, the configuration would affect all String fields in a file, though... Cheers, Fabian 2014-12-09 12:17 GMT+01:00 Max Michels <[hidden email]>:
|
That sounds like a good idea. Just like setDelimeter("|"), one should be able to do a setParseDoubleQuotes(false) to disable the special handling of double quotes. You're right, Fabian, the current implementation treats all String fields alike. Maybe we can expect the user to provide a consistently formatted input file (i.e. with or without the use of double quotes as identifiers)? On Tue, Dec 9, 2014 at 2:32 PM, Fabian Hueske <[hidden email]> wrote:
|
I think that's a fair assumption to make. I'll open a JIRA for making quoted string parsing optional and a configurable quote character. 2014-12-09 18:51 GMT+01:00 Max Michels <[hidden email]>:
|
Free forum by Nabble | Edit this page |