Hello All,
I want to do some data quality analysis on stream data example. 1. Fill rate in a particular column 2. How many events are going to error queue due to favor schema validation failed? 3. Different statistics measure of a column. 3. Alert if a particular threshold is breached (like if fill rate is less than 90% for a column) Is there any library that exists on top of Flink for data quality. As I am looking there is a library on top of the spark https://github.com/awslabs/deequ |
Hi Anuj, I am not familiar with data quality measurement methods and deequ in depth. What you describe looks like monitoring some data metrics. Maybe, there are other community users aware of better solution. Meanwhile, I would recommend to implement the checks and failures as separate operators and side outputs (for streaming) [1], if not yet Then you could also use Flink metrics to aggregate and monitor the data [2]. The metrics systems usually allow to define alerts on metrics, like in prometheus [3], [4]. Best, Andrey On Sat, Jun 6, 2020 at 9:23 AM aj <[hidden email]> wrote:
|
Thanks, Andrey, I will check it out. On Mon, Jun 8, 2020 at 8:10 PM Andrey Zagrebin <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |