Data Quality Library in Flink

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Data Quality Library in Flink

anuj.aj07
Hello All,

I  want to do some data quality analysis on stream data example.

1. Fill rate in a particular column
2. How many events are going to error queue due to favor schema validation failed?
3. Different statistics measure of a column.
3. Alert if a particular threshold is breached (like if fill rate is less than 90% for a column)

Is there any library that exists on top of Flink for data quality. As I am looking there is a library on top of the spark https://github.com/awslabs/deequ

This proved all that I am looking for. 

--
Thanks & Regards,
Anuj Jain



Reply | Threaded
Open this post in threaded view
|

Re: Data Quality Library in Flink

Andrey Zagrebin-5
Hi Anuj,

I am not familiar with data quality measurement methods and deequ in depth.
What you describe looks like monitoring some data metrics.
Maybe, there are other community users aware of better solution.
Meanwhile, I would recommend to implement the checks and failures as separate operators and side outputs (for streaming) [1], if not yet
Then you could also use Flink metrics to aggregate and monitor the data [2].
The metrics systems usually allow to define alerts on metrics, like in prometheus [3], [4].

Best,
Andrey

On Sat, Jun 6, 2020 at 9:23 AM aj <[hidden email]> wrote:
Hello All,

I  want to do some data quality analysis on stream data example.

1. Fill rate in a particular column
2. How many events are going to error queue due to favor schema validation failed?
3. Different statistics measure of a column.
3. Alert if a particular threshold is breached (like if fill rate is less than 90% for a column)

Is there any library that exists on top of Flink for data quality. As I am looking there is a library on top of the spark https://github.com/awslabs/deequ

This proved all that I am looking for. 

--
Thanks & Regards,
Anuj Jain



Reply | Threaded
Open this post in threaded view
|

Re: Data Quality Library in Flink

anuj.aj07
Thanks, Andrey, I will check it out. 

On Mon, Jun 8, 2020 at 8:10 PM Andrey Zagrebin <[hidden email]> wrote:
Hi Anuj,

I am not familiar with data quality measurement methods and deequ in depth.
What you describe looks like monitoring some data metrics.
Maybe, there are other community users aware of better solution.
Meanwhile, I would recommend to implement the checks and failures as separate operators and side outputs (for streaming) [1], if not yet
Then you could also use Flink metrics to aggregate and monitor the data [2].
The metrics systems usually allow to define alerts on metrics, like in prometheus [3], [4].

Best,
Andrey

On Sat, Jun 6, 2020 at 9:23 AM aj <[hidden email]> wrote:
Hello All,

I  want to do some data quality analysis on stream data example.

1. Fill rate in a particular column
2. How many events are going to error queue due to favor schema validation failed?
3. Different statistics measure of a column.
3. Alert if a particular threshold is breached (like if fill rate is less than 90% for a column)

Is there any library that exists on top of Flink for data quality. As I am looking there is a library on top of the spark https://github.com/awslabs/deequ

This proved all that I am looking for. 

--
Thanks & Regards,
Anuj Jain





--
Thanks & Regards,
Anuj Jain
Mob. : +91- 8588817877
Skype : anuj.jain07