Hi to all,
is there any out-of-the-box option to read multiline JSON or XML like in Spark? It would be awesome to have something like spark.read .option("multiline", true) .json("/path/to/user.json") Best, Flavio |
Hi Flavio, IMO, it would take more effect to ask this question in the Spark user mailing list. WDYT? Best, Vino Flavio Pompermaier <[hidden email]> 于2019年11月29日周五 下午7:09写道:
|
Why vino?
He's specifically asking whether Flink
offers something _like_ spark.
On 29/11/2019 14:39, vino yang wrote:
|
For XML, u could look at Mahout's XMLInputFormat (if u r using HadoopInput Format). On Fri, Nov 29, 2019 at 9:01 AM Chesnay Schepler <[hidden email]> wrote:
|
In reply to this post by Flavio Pompermaier
I know that at least the Table
API can read json, but I don't know how well this translates
into other APIs.
On 29/11/2019 12:09, Flavio Pompermaier
wrote:
|
A while ago, I implemented XML and Json input formats. However, having proper split support for structured formats without sync markers is not that easy. Any split that has a random start offset need to figure out the start of the next record on its own, which is fragile by definition. That's why supporting jsonl files is much easier; you just need to look for the next newline. For the same reason, supporting json or xml in Kafka is fairly straightforward: records are already split. It would be easier to support XML and Json if we can get of splits. @Flavio would you expect to get inner file parallelism or would you be fine with processing only the files in parallel? Best, Arvid On Fri, Nov 29, 2019 at 3:26 PM Chesnay Schepler <[hidden email]> wrote:
|
Parallel files processing would be enough, inner file parallelism would be awesome but it's a plus On Fri, Nov 29, 2019 at 3:46 PM Arvid Heise <[hidden email]> wrote:
|
In reply to this post by Chesnay Schepler
Also, say sorry to Flavio! Best, Vino vino yang <[hidden email]> 于2019年12月2日周一 上午10:29写道:
|
Free forum by Nabble | Edit this page |