Flink design questions - parallel processing and state management

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Flink design questions - parallel processing and state management

lgfmt@yahoo.com
Hi,

We are new to Flink and would like to get some ideas of achieving parallelism and for maintaining state in Flink for few interesting problems.

We receive files that contain multiple types of events. We need to process them in the following way:

1. Multiple event sources send event files to our system.
2. Files from different event sources should be processed in parallel.
3. Each file contains events with one or more event types. We want to process each event file as follows:
3.1. Group all events of the same event type (e.g. EvTypeA, EvTypeB, EvTypeC, etc.).
3.2. Sequence the event groups in a predefined order and process them in that order as follows (EvTypeA-->EvTypeB-->EvTypeC) .
3.3 Based on an external configuration, the events in the each event type group should be processed either sequentially (based on the event time specified in each event) or in parallel. Each event type will get processed using a different Java class.
3.4. After all events for each event type are processed, process the events for the next event type group and so on. It is important to note that all events in each event type group need to be processed before the events in the next event group can be processed. Also as mentioned above, events in each event type group can be processed sequentially or in parallel based on a configuration.
3.5 Different files may contain different types of events and different number of event types. E.g. one file may contain EvTypeA+EvTypeB and another file may contain EvTypeA+EvTypeB+EvTypeC. So the execution plan for each file will be different.

Questions:
1. The parallelism should be achieved across multiple event sources and also within each event type group. How can we achieve this type of parallelism using Flink?
2. We want to record the progress of each file processing. So if a system failure occurs while processing a file, we will be able to resume from where we had left. For example, if a file contains 1000 events, 100 events of EvTypeA, 700 events of EvTypeB, 200 events of EvTypeC and the system failure occurs after processing all 100 events of EvTypeA and after processing 150 events of EvTypeB, the system should resume  processing from 151st event of EvTypeB.
3. We also want to show the progress of each file processing status on a dashboard that shows all the files being processed and the progress status for each file (e.g. 250 out of 1000 event processed). Should we use Flink state management to keep status of each processed event or should we use our own state management and why?
4. If the Flink task node that was processing an event crashes, will Flink route that task to another node automatically?

A few lines of Flink code that shows how to solve the above problems will be really helpful. Thank you for any help.

Search phrase: EventParallelism


- lgfmt