[Flink SQL] Leniency of JSON parsing

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[Flink SQL] Leniency of JSON parsing

Magri, Sebastian
I'm trying to extract data from a Debezium CDC source, in which one of the backing tables has an open schema nested JSON field like this:


"objectives": {
    "items": [
        {
            "id": 1,
            "label": "test 1"
            "size": 1000.0
        },
        {
            "id": 2,
            "label": "test 2"
            "size": 500.0
        }
    ],
    "threshold": 10.0,
    "threshold_period": "hourly",
    "max_ms": 30000.0
}


Any of these fields can be missing at any time, and there can also be additional, different fields. It is guaranteed that a field will have the same data type for all occurrences.

For now, I really need to get only the `threshold` and `threshold_period` fields. For which I'm using a field as the following:


CREATE TABLE probes (
  `objectives` ROW(`threshold` FLOAT, `threshold_period` STRING)
  ...
) WITH (
     ...
      'format' = 'debezium-json',
      'debezium-json.schema-include' = 'true',
      'debezium-json.ignore-parse-errors' = 'true'
)


However I keep getting `NULL` values in my `objectives` column, or corrupt JSON message exceptions when I disable the `ignore-parse-errors` option.

Does JSON parsing need to match 100% the schema of the field or is it lenient?

Is there any option or syntactic detail I'm missing?

Best Regards,
Reply | Threaded
Open this post in threaded view
|

Re: [Flink SQL] Leniency of JSON parsing

Roman Khachatryan
Hi Sebastian,

Did you try setting debezium-json-map-null-key-mode to DROP [1]?

I'm also pulling in Timo who might know better.

[1]
https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/formats/debezium.html#debezium-json-map-null-key-mode

Regards,
Roman



On Fri, Mar 12, 2021 at 2:42 PM Magri, Sebastian
<[hidden email]> wrote:

>
> I'm trying to extract data from a Debezium CDC source, in which one of the backing tables has an open schema nested JSON field like this:
>
>
> "objectives": {
>     "items": [
>         {
>             "id": 1,
>             "label": "test 1"
>             "size": 1000.0
>         },
>         {
>             "id": 2,
>             "label": "test 2"
>             "size": 500.0
>         }
>     ],
>     "threshold": 10.0,
>     "threshold_period": "hourly",
>     "max_ms": 30000.0
> }
>
>
> Any of these fields can be missing at any time, and there can also be additional, different fields. It is guaranteed that a field will have the same data type for all occurrences.
>
> For now, I really need to get only the `threshold` and `threshold_period` fields. For which I'm using a field as the following:
>
>
> CREATE TABLE probes (
>   `objectives` ROW(`threshold` FLOAT, `threshold_period` STRING)
>   ...
> ) WITH (
>      ...
>       'format' = 'debezium-json',
>       'debezium-json.schema-include' = 'true',
>       'debezium-json.ignore-parse-errors' = 'true'
> )
>
>
> However I keep getting `NULL` values in my `objectives` column, or corrupt JSON message exceptions when I disable the `ignore-parse-errors` option.
>
> Does JSON parsing need to match 100% the schema of the field or is it lenient?
>
> Is there any option or syntactic detail I'm missing?
>
> Best Regards,
Reply | Threaded
Open this post in threaded view
|

Re: [Flink SQL] Leniency of JSON parsing

Magri, Sebastian
Hi Roman!

Seems like that option is no longer available.

Best Regards,
Sebastian

From: Roman Khachatryan <[hidden email]>
Sent: Friday, March 12, 2021 16:59
To: Magri, Sebastian <[hidden email]>; Timo Walther <[hidden email]>
Cc: user <[hidden email]>
Subject: Re: [Flink SQL] Leniency of JSON parsing
 
Hi Sebastian,

Did you try setting debezium-json-map-null-key-mode to DROP [1]?

I'm also pulling in Timo who might know better.

[1]
https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/formats/debezium.html#debezium-json-map-null-key-mode

Regards,
Roman



On Fri, Mar 12, 2021 at 2:42 PM Magri, Sebastian
<[hidden email]> wrote:
>
> I'm trying to extract data from a Debezium CDC source, in which one of the backing tables has an open schema nested JSON field like this:
>
>
> "objectives": {
>     "items": [
>         {
>             "id": 1,
>             "label": "test 1"
>             "size": 1000.0
>         },
>         {
>             "id": 2,
>             "label": "test 2"
>             "size": 500.0
>         }
>     ],
>     "threshold": 10.0,
>     "threshold_period": "hourly",
>     "max_ms": 30000.0
> }
>
>
> Any of these fields can be missing at any time, and there can also be additional, different fields. It is guaranteed that a field will have the same data type for all occurrences.
>
> For now, I really need to get only the `threshold` and `threshold_period` fields. For which I'm using a field as the following:
>
>
> CREATE TABLE probes (
>   `objectives` ROW(`threshold` FLOAT, `threshold_period` STRING)
>   ...
> ) WITH (
>      ...
>       'format' = 'debezium-json',
>       'debezium-json.schema-include' = 'true',
>       'debezium-json.ignore-parse-errors' = 'true'
> )
>
>
> However I keep getting `NULL` values in my `objectives` column, or corrupt JSON message exceptions when I disable the `ignore-parse-errors` option.
>
> Does JSON parsing need to match 100% the schema of the field or is it lenient?
>
> Is there any option or syntactic detail I'm missing?
>
> Best Regards,
Reply | Threaded
Open this post in threaded view
|

Re: [Flink SQL] Leniency of JSON parsing

Magri, Sebastian
I validated it's still accepted by the connector but it's not in the documentation anymore.

It doesn't seem to help in my case.

Thanks,
Sebastian

From: Magri, Sebastian <[hidden email]>
Sent: Friday, March 12, 2021 18:50
To: Timo Walther <[hidden email]>; [hidden email] <[hidden email]>
Cc: user <[hidden email]>
Subject: Re: [Flink SQL] Leniency of JSON parsing
 
Hi Roman!

Seems like that option is no longer available.

Best Regards,
Sebastian

From: Roman Khachatryan <[hidden email]>
Sent: Friday, March 12, 2021 16:59
To: Magri, Sebastian <[hidden email]>; Timo Walther <[hidden email]>
Cc: user <[hidden email]>
Subject: Re: [Flink SQL] Leniency of JSON parsing
 
Hi Sebastian,

Did you try setting debezium-json-map-null-key-mode to DROP [1]?

I'm also pulling in Timo who might know better.

[1]
https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/formats/debezium.html#debezium-json-map-null-key-mode

Regards,
Roman



On Fri, Mar 12, 2021 at 2:42 PM Magri, Sebastian
<[hidden email]> wrote:
>
> I'm trying to extract data from a Debezium CDC source, in which one of the backing tables has an open schema nested JSON field like this:
>
>
> "objectives": {
>     "items": [
>         {
>             "id": 1,
>             "label": "test 1"
>             "size": 1000.0
>         },
>         {
>             "id": 2,
>             "label": "test 2"
>             "size": 500.0
>         }
>     ],
>     "threshold": 10.0,
>     "threshold_period": "hourly",
>     "max_ms": 30000.0
> }
>
>
> Any of these fields can be missing at any time, and there can also be additional, different fields. It is guaranteed that a field will have the same data type for all occurrences.
>
> For now, I really need to get only the `threshold` and `threshold_period` fields. For which I'm using a field as the following:
>
>
> CREATE TABLE probes (
>   `objectives` ROW(`threshold` FLOAT, `threshold_period` STRING)
>   ...
> ) WITH (
>      ...
>       'format' = 'debezium-json',
>       'debezium-json.schema-include' = 'true',
>       'debezium-json.ignore-parse-errors' = 'true'
> )
>
>
> However I keep getting `NULL` values in my `objectives` column, or corrupt JSON message exceptions when I disable the `ignore-parse-errors` option.
>
> Does JSON parsing need to match 100% the schema of the field or is it lenient?
>
> Is there any option or syntactic detail I'm missing?
>
> Best Regards,
Reply | Threaded
Open this post in threaded view
|

Re: [Flink SQL] Leniency of JSON parsing

Timo Walther
Hi Sebastian,

you can checkout the logic your self by looking into

https://github.com/apache/flink/blob/master/flink-formats/flink-json/src/main/java/org/apache/flink/formats/json/debezium/DebeziumJsonDeserializationSchema.java

and

https://github.com/apache/flink/blob/master/flink-formats/flink-json/src/main/java/org/apache/flink/formats/json/JsonRowDataDeserializationSchema.java

So actually your use case should work. Could you help investogating what
is going wrong? In any case we should open an issue for it. It seems to
be a bug.

Regards,
Timo

On 12.03.21 21:10, Magri, Sebastian wrote:

> I validated it's still accepted by the connector but it's not in the
> documentation anymore.
>
> It doesn't seem to help in my case.
>
> Thanks,
> Sebastian
> ------------------------------------------------------------------------
> *From:* Magri, Sebastian <[hidden email]>
> *Sent:* Friday, March 12, 2021 18:50
> *To:* Timo Walther <[hidden email]>; [hidden email] <[hidden email]>
> *Cc:* user <[hidden email]>
> *Subject:* Re: [Flink SQL] Leniency of JSON parsing
> Hi Roman!
>
> Seems like that option is no longer available.
>
> Best Regards,
> Sebastian
> ------------------------------------------------------------------------
> *From:* Roman Khachatryan <[hidden email]>
> *Sent:* Friday, March 12, 2021 16:59
> *To:* Magri, Sebastian <[hidden email]>; Timo Walther
> <[hidden email]>
> *Cc:* user <[hidden email]>
> *Subject:* Re: [Flink SQL] Leniency of JSON parsing
> Hi Sebastian,
>
> Did you try setting debezium-json-map-null-key-mode to DROP [1]?
>
> I'm also pulling in Timo who might know better.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/formats/debezium.html#debezium-json-map-null-key-mode 
> <https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/formats/debezium.html#debezium-json-map-null-key-mode>
>
> Regards,
> Roman
>
>
>
> On Fri, Mar 12, 2021 at 2:42 PM Magri, Sebastian
> <[hidden email]> wrote:
>>
>> I'm trying to extract data from a Debezium CDC source, in which one of the backing tables has an open schema nested JSON field like this:
>>
>>
>> "objectives": {
>>     "items": [
>>         {
>>             "id": 1,
>>             "label": "test 1"
>>             "size": 1000.0
>>         },
>>         {
>>             "id": 2,
>>             "label": "test 2"
>>             "size": 500.0
>>         }
>>     ],
>>     "threshold": 10.0,
>>     "threshold_period": "hourly",
>>     "max_ms": 30000.0
>> }
>>
>>
>> Any of these fields can be missing at any time, and there can also be additional, different fields. It is guaranteed that a field will have the same data type for all occurrences.
>>
>> For now, I really need to get only the `threshold` and `threshold_period` fields. For which I'm using a field as the following:
>>
>>
>> CREATE TABLE probes (
>>   `objectives` ROW(`threshold` FLOAT, `threshold_period` STRING)
>>   ...
>> ) WITH (
>>      ...
>>       'format' = 'debezium-json',
>>       'debezium-json.schema-include' = 'true',
>>       'debezium-json.ignore-parse-errors' = 'true'
>> )
>>
>>
>> However I keep getting `NULL` values in my `objectives` column, or corrupt JSON message exceptions when I disable the `ignore-parse-errors` option.
>>
>> Does JSON parsing need to match 100% the schema of the field or is it lenient?
>>
>> Is there any option or syntactic detail I'm missing?
>>
>> Best Regards,

Reply | Threaded
Open this post in threaded view
|

Re: [Flink SQL] Leniency of JSON parsing

Magri, Sebastian
Thanks a lot Timo,

I will check those links out and create an issue with more information.

Best Regards,
Sebastian

From: Timo Walther <[hidden email]>
Sent: Tuesday, March 16, 2021 15:29
To: Magri, Sebastian <[hidden email]>; [hidden email] <[hidden email]>
Cc: user <[hidden email]>
Subject: Re: [Flink SQL] Leniency of JSON parsing
 
Hi Sebastian,

you can checkout the logic your self by looking into

https://github.com/apache/flink/blob/master/flink-formats/flink-json/src/main/java/org/apache/flink/formats/json/debezium/DebeziumJsonDeserializationSchema.java

and

https://github.com/apache/flink/blob/master/flink-formats/flink-json/src/main/java/org/apache/flink/formats/json/JsonRowDataDeserializationSchema.java

So actually your use case should work. Could you help investogating what
is going wrong? In any case we should open an issue for it. It seems to
be a bug.

Regards,
Timo

On 12.03.21 21:10, Magri, Sebastian wrote:
> I validated it's still accepted by the connector but it's not in the
> documentation anymore.
>
> It doesn't seem to help in my case.
>
> Thanks,
> Sebastian
> ------------------------------------------------------------------------
> *From:* Magri, Sebastian <[hidden email]>
> *Sent:* Friday, March 12, 2021 18:50
> *To:* Timo Walther <[hidden email]>; [hidden email] <[hidden email]>
> *Cc:* user <[hidden email]>
> *Subject:* Re: [Flink SQL] Leniency of JSON parsing
> Hi Roman!
>
> Seems like that option is no longer available.
>
> Best Regards,
> Sebastian
> ------------------------------------------------------------------------
> *From:* Roman Khachatryan <[hidden email]>
> *Sent:* Friday, March 12, 2021 16:59
> *To:* Magri, Sebastian <[hidden email]>; Timo Walther
> <[hidden email]>
> *Cc:* user <[hidden email]>
> *Subject:* Re: [Flink SQL] Leniency of JSON parsing
> Hi Sebastian,
>
> Did you try setting debezium-json-map-null-key-mode to DROP [1]?
>
> I'm also pulling in Timo who might know better.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/formats/debezium.html#debezium-json-map-null-key-mode
> <https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/formats/debezium.html#debezium-json-map-null-key-mode>
>
> Regards,
> Roman
>
>
>
> On Fri, Mar 12, 2021 at 2:42 PM Magri, Sebastian
> <[hidden email]> wrote:
>>
>> I'm trying to extract data from a Debezium CDC source, in which one of the backing tables has an open schema nested JSON field like this:
>>
>>
>> "objectives": {
>>     "items": [
>>         {
>>             "id": 1,
>>             "label": "test 1"
>>             "size": 1000.0
>>         },
>>         {
>>             "id": 2,
>>             "label": "test 2"
>>             "size": 500.0
>>         }
>>     ],
>>     "threshold": 10.0,
>>     "threshold_period": "hourly",
>>     "max_ms": 30000.0
>> }
>>
>>
>> Any of these fields can be missing at any time, and there can also be additional, different fields. It is guaranteed that a field will have the same data type for all occurrences.
>>
>> For now, I really need to get only the `threshold` and `threshold_period` fields. For which I'm using a field as the following:
>>
>>
>> CREATE TABLE probes (
>>   `objectives` ROW(`threshold` FLOAT, `threshold_period` STRING)
>>   ...
>> ) WITH (
>>      ...
>>       'format' = 'debezium-json',
>>       'debezium-json.schema-include' = 'true',
>>       'debezium-json.ignore-parse-errors' = 'true'
>> )
>>
>>
>> However I keep getting `NULL` values in my `objectives` column, or corrupt JSON message exceptions when I disable the `ignore-parse-errors` option.
>>
>> Does JSON parsing need to match 100% the schema of the field or is it lenient?
>>
>> Is there any option or syntactic detail I'm missing?
>>
>> Best Regards,