(DEPRECATED) Apache Flink User Mailing List archive.

Read multiline JSON/XML

Classic

List

Threaded

8 messages Options

Flavio Pompermaier

Read multiline JSON/XML

Hi to all,

is there any out-of-the-box option to read multiline JSON or XML like in Spark?

It would be awesome to have something like

spark.read .option("multiline", true) .json("/path/to/user.json")

Best,

Flavio

vino yang

Re: Read multiline JSON/XML

Hi Flavio,

IMO, it would take more effect to ask this question in the Spark user mailing list.

WDYT?

Best,

Vino

Flavio Pompermaier <[hidden email]> 于2019年11月29日周五下午7:09写道：

Hi to all,
is there any out-of-the-box option to read multiline JSON or XML like in Spark?
It would be awesome to have something like

spark.read .option("multiline", true) .json("/path/to/user.json")

Best,
Flavio

Chesnay Schepler

Re: Read multiline JSON/XML

Why vino?

He's specifically asking whether Flink offers something _like_ spark.

On 29/11/2019 14:39, vino yang wrote:

Hi Flavio,

IMO, it would take more effect to ask this question in the Spark user mailing list.

WDYT?

Best,

Vino

Flavio Pompermaier <[hidden email]> 于2019年11月29日周五下午7:09写道：

Hi to all,
is there any out-of-the-box option to read multiline JSON or XML like in Spark?

It would be awesome to have something like

spark.read .option("multiline", true) .json("/path/to/user.json")

Best,

Flavio

Suneel Marthi

Re: Read multiline JSON/XML

For XML, u could look at Mahout's XMLInputFormat (if u r using HadoopInput Format).

On Fri, Nov 29, 2019 at 9:01 AM Chesnay Schepler <[hidden email]> wrote:

Why vino?

He's specifically asking whether Flink offers something _like_ spark.

On 29/11/2019 14:39, vino yang wrote:

Hi Flavio,

IMO, it would take more effect to ask this question in the Spark user mailing list.

WDYT?

Best,

Vino

Flavio Pompermaier <[hidden email]> 于2019年11月29日周五下午7:09写道：

Hi to all,
is there any out-of-the-box option to read multiline JSON or XML like in Spark?

It would be awesome to have something like

spark.read .option("multiline", true) .json("/path/to/user.json")

Best,

Flavio

Chesnay Schepler

Re: Read multiline JSON/XML

In reply to this post by Flavio Pompermaier

I know that at least the Table API can read json, but I don't know how well this translates into other APIs.

On 29/11/2019 12:09, Flavio Pompermaier wrote:

Hi to all,
is there any out-of-the-box option to read multiline JSON or XML like in Spark?

It would be awesome to have something like

spark.read .option("multiline", true) .json("/path/to/user.json")

Best,

Flavio

Arvid Heise-3

Re: Read multiline JSON/XML

A while ago, I implemented XML and Json input formats. However, having proper split support for structured formats without sync markers is not that easy. Any split that has a random start offset need to figure out the start of the next record on its own, which is fragile by definition.

That's why supporting jsonl files is much easier; you just need to look for the next newline. For the same reason, supporting json or xml in Kafka is fairly straightforward: records are already split.

It would be easier to support XML and Json if we can get of splits. @Flavio would you expect to get inner file parallelism or would you be fine with processing only the files in parallel?

Best,

Arvid

On Fri, Nov 29, 2019 at 3:26 PM Chesnay Schepler <[hidden email]> wrote:

I know that at least the Table API can read json, but I don't know how well this translates into other APIs.

On 29/11/2019 12:09, Flavio Pompermaier wrote:

Hi to all,
is there any out-of-the-box option to read multiline JSON or XML like in Spark?

It would be awesome to have something like

spark.read .option("multiline", true) .json("/path/to/user.json")

Best,

Flavio

Flavio Pompermaier

Re: Read multiline JSON/XML

Parallel files processing would be enough, inner file parallelism would be awesome but it's a plus

On Fri, Nov 29, 2019 at 3:46 PM Arvid Heise <[hidden email]> wrote:

A while ago, I implemented XML and Json input formats. However, having proper split support for structured formats without sync markers is not that easy. Any split that has a random start offset need to figure out the start of the next record on its own, which is fragile by definition.
That's why supporting jsonl files is much easier; you just need to look for the next newline. For the same reason, supporting json or xml in Kafka is fairly straightforward: records are already split.

It would be easier to support XML and Json if we can get of splits. @Flavio would you expect to get inner file parallelism or would you be fine with processing only the files in parallel?

Best,

Arvid

On Fri, Nov 29, 2019 at 3:26 PM Chesnay Schepler <[hidden email]> wrote:

I know that at least the Table API can read json, but I don't know how well this translates into other APIs.

On 29/11/2019 12:09, Flavio Pompermaier wrote:

Hi to all,
is there any out-of-the-box option to read multiline JSON or XML like in Spark?

It would be awesome to have something like

spark.read .option("multiline", true) .json("/path/to/user.json")

Best,

Flavio

vino yang

Re: Read multiline JSON/XML

In reply to this post by Chesnay Schepler

Also, say sorry to Flavio!

Best,

Vino

vino yang <[hidden email]> 于2019年12月2日周一上午10:29写道：

Hi Chesnay,

Sorry, yes, I lost the "like" keyword. I mistakenly thought he wanted to ask how to use Spark to accomplish this job.

Best,
Vino

Chesnay Schepler <[hidden email]> 于2019年11月29日周五下午10:01写道：

Why vino?

He's specifically asking whether Flink offers something _like_ spark.

On 29/11/2019 14:39, vino yang wrote:

Hi Flavio,

IMO, it would take more effect to ask this question in the Spark user mailing list.

WDYT?

Best,

Vino

Flavio Pompermaier <[hidden email]> 于2019年11月29日周五下午7:09写道：

Hi to all,
is there any out-of-the-box option to read multiline JSON or XML like in Spark?

It would be awesome to have something like

spark.read .option("multiline", true) .json("/path/to/user.json")

Best,

Flavio