Read multiline JSON/XML

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Read multiline JSON/XML

Flavio Pompermaier
Hi to all,
is there any out-of-the-box option to read multiline JSON or XML like in Spark?
It would be awesome to have something like

spark.read .option("multiline", true) .json("/path/to/user.json")

Best,
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: Read multiline JSON/XML

vino yang
Hi Flavio,

IMO, it would take more effect to ask this question in the Spark user mailing list. 

WDYT?

Best,
Vino

Flavio Pompermaier <[hidden email]> 于2019年11月29日周五 下午7:09写道:
Hi to all,
is there any out-of-the-box option to read multiline JSON or XML like in Spark?
It would be awesome to have something like

spark.read .option("multiline", true) .json("/path/to/user.json")

Best,
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: Read multiline JSON/XML

Chesnay Schepler
Why vino?

He's specifically asking whether Flink offers something _like_ spark.

On 29/11/2019 14:39, vino yang wrote:
Hi Flavio,

IMO, it would take more effect to ask this question in the Spark user mailing list. 

WDYT?

Best,
Vino

Flavio Pompermaier <[hidden email]> 于2019年11月29日周五 下午7:09写道:
Hi to all,
is there any out-of-the-box option to read multiline JSON or XML like in Spark?
It would be awesome to have something like

spark.read .option("multiline", true) .json("/path/to/user.json")

Best,
Flavio


Reply | Threaded
Open this post in threaded view
|

Re: Read multiline JSON/XML

Suneel Marthi
For XML, u could look at Mahout's XMLInputFormat (if u r using HadoopInput Format). 

On Fri, Nov 29, 2019 at 9:01 AM Chesnay Schepler <[hidden email]> wrote:
Why vino?

He's specifically asking whether Flink offers something _like_ spark.

On 29/11/2019 14:39, vino yang wrote:
Hi Flavio,

IMO, it would take more effect to ask this question in the Spark user mailing list. 

WDYT?

Best,
Vino

Flavio Pompermaier <[hidden email]> 于2019年11月29日周五 下午7:09写道:
Hi to all,
is there any out-of-the-box option to read multiline JSON or XML like in Spark?
It would be awesome to have something like

spark.read .option("multiline", true) .json("/path/to/user.json")

Best,
Flavio


Reply | Threaded
Open this post in threaded view
|

Re: Read multiline JSON/XML

Chesnay Schepler
In reply to this post by Flavio Pompermaier
I know that at least the Table API can read json, but I don't know how well this translates into other APIs.

On 29/11/2019 12:09, Flavio Pompermaier wrote:
Hi to all,
is there any out-of-the-box option to read multiline JSON or XML like in Spark?
It would be awesome to have something like

spark.read .option("multiline", true) .json("/path/to/user.json")

Best,
Flavio


Reply | Threaded
Open this post in threaded view
|

Re: Read multiline JSON/XML

Arvid Heise-3
A while ago, I implemented XML and Json input formats. However, having proper split support for structured formats without sync markers is not that easy. Any split that has a random start offset need to figure out the start of the next record on its own, which is fragile by definition.
That's why supporting jsonl files is much easier; you just need to look for the next newline. For the same reason, supporting json or xml in Kafka is fairly straightforward: records are already split.

It would be easier to support XML and Json if we can get of splits. @Flavio would you expect to get inner file parallelism or would you be fine with processing only the files in parallel?

Best,

Arvid

On Fri, Nov 29, 2019 at 3:26 PM Chesnay Schepler <[hidden email]> wrote:
I know that at least the Table API can read json, but I don't know how well this translates into other APIs.

On 29/11/2019 12:09, Flavio Pompermaier wrote:
Hi to all,
is there any out-of-the-box option to read multiline JSON or XML like in Spark?
It would be awesome to have something like

spark.read .option("multiline", true) .json("/path/to/user.json")

Best,
Flavio


Reply | Threaded
Open this post in threaded view
|

Re: Read multiline JSON/XML

Flavio Pompermaier
Parallel files processing would be enough, inner file parallelism would be awesome but it's a plus

On Fri, Nov 29, 2019 at 3:46 PM Arvid Heise <[hidden email]> wrote:
A while ago, I implemented XML and Json input formats. However, having proper split support for structured formats without sync markers is not that easy. Any split that has a random start offset need to figure out the start of the next record on its own, which is fragile by definition.
That's why supporting jsonl files is much easier; you just need to look for the next newline. For the same reason, supporting json or xml in Kafka is fairly straightforward: records are already split.

It would be easier to support XML and Json if we can get of splits. @Flavio would you expect to get inner file parallelism or would you be fine with processing only the files in parallel?

Best,

Arvid

On Fri, Nov 29, 2019 at 3:26 PM Chesnay Schepler <[hidden email]> wrote:
I know that at least the Table API can read json, but I don't know how well this translates into other APIs.

On 29/11/2019 12:09, Flavio Pompermaier wrote:
Hi to all,
is there any out-of-the-box option to read multiline JSON or XML like in Spark?
It would be awesome to have something like

spark.read .option("multiline", true) .json("/path/to/user.json")

Best,
Flavio



Reply | Threaded
Open this post in threaded view
|

Re: Read multiline JSON/XML

vino yang
In reply to this post by Chesnay Schepler
Also, say sorry to Flavio!

Best,
Vino

vino yang <[hidden email]> 于2019年12月2日周一 上午10:29写道:
Hi Chesnay,

Sorry, yes, I lost the "like" keyword. I mistakenly thought he wanted to ask how to use Spark to accomplish this job.

Best,
Vino

Chesnay Schepler <[hidden email]> 于2019年11月29日周五 下午10:01写道:
Why vino?

He's specifically asking whether Flink offers something _like_ spark.

On 29/11/2019 14:39, vino yang wrote:
Hi Flavio,

IMO, it would take more effect to ask this question in the Spark user mailing list. 

WDYT?

Best,
Vino

Flavio Pompermaier <[hidden email]> 于2019年11月29日周五 下午7:09写道:
Hi to all,
is there any out-of-the-box option to read multiline JSON or XML like in Spark?
It would be awesome to have something like

spark.read .option("multiline", true) .json("/path/to/user.json")

Best,
Flavio