Re: Read multiline JSON/XML

Posted by Flavio Pompermaier on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Read-multiline-JSON-XML-tp31338p31346.html

Parallel files processing would be enough, inner file parallelism would be awesome but it's a plus

On Fri, Nov 29, 2019 at 3:46 PM Arvid Heise <[hidden email]> wrote:
A while ago, I implemented XML and Json input formats. However, having proper split support for structured formats without sync markers is not that easy. Any split that has a random start offset need to figure out the start of the next record on its own, which is fragile by definition.
That's why supporting jsonl files is much easier; you just need to look for the next newline. For the same reason, supporting json or xml in Kafka is fairly straightforward: records are already split.

It would be easier to support XML and Json if we can get of splits. @Flavio would you expect to get inner file parallelism or would you be fine with processing only the files in parallel?

Best,

Arvid

On Fri, Nov 29, 2019 at 3:26 PM Chesnay Schepler <[hidden email]> wrote:
I know that at least the Table API can read json, but I don't know how well this translates into other APIs.

On 29/11/2019 12:09, Flavio Pompermaier wrote:
Hi to all,
is there any out-of-the-box option to read multiline JSON or XML like in Spark?
It would be awesome to have something like

spark.read .option("multiline", true) .json("/path/to/user.json")

Best,
Flavio