Flink and Directory Monitors

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink and Directory Monitors

phiroc
Hello,

has anyone ever used Flink with file/directory monitoring applications such as Directory Monitor (https://directorymonitor.com/)?

Is it conceivable to process file-update events with Flink? For instance, let's says hundreds of users simultaneously modify files on a filesystem. Directory Monitor detects those modifications and send them as events/streams/or logs entries to Flink, which processes them to extract, say, the names of the files that have been modified the most, over a period of time, or the names of the biggest filesystem hogs (i.e., users who consume the most filesystem space).

Would Hadoop be needed between Directory Monitor and Flink, to store historical, filesystem-change data?

Many thanks.

Philippe

Reply | Threaded
Open this post in threaded view
|

Re: Flink and Directory Monitors

Fabian Hueske-2
Hi Philippe,

I am not aware of anybody using Directory Monitor with Flink. However, the application you described sounds reasonable and I think it should be possible to implement that with Flink.

You would need to implement a SourceFunction that forwards events from DM to Flink or you push the DM events into Kafka and use Flink's Kakfa SourceFunction. Using Kafka has the benefit that fault tolerance and exactly-once behavior are much easier to achieve because Kafka buffers events for some time and Flink's Kafka source can replay the events if necessary. If you implement a direct DM source for Flink, you would need to implement the buffering yourself to achieve exactly-once or at-least-once guarantees.

You do not need HDFS to communicate between DM and Flink, events can be directly consumed without going through a filesystem. However, Flink requires a persistent state backend to backup checkpoints for failure recovery. This is usually HDFS but that component is pluggable.

Cheers, Fabian

2016-03-07 15:53 GMT+01:00 <[hidden email]>:
Hello,

has anyone ever used Flink with file/directory monitoring applications such as Directory Monitor (https://directorymonitor.com/)?

Is it conceivable to process file-update events with Flink? For instance, let's says hundreds of users simultaneously modify files on a filesystem. Directory Monitor detects those modifications and send them as events/streams/or logs entries to Flink, which processes them to extract, say, the names of the files that have been modified the most, over a period of time, or the names of the biggest filesystem hogs (i.e., users who consume the most filesystem space).

Would Hadoop be needed between Directory Monitor and Flink, to store historical, filesystem-change data?

Many thanks.

Philippe