DataStream transformation isolation in Flink Streaming

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

DataStream transformation isolation in Flink Streaming

Juan Rodríguez Hortalá
Hi, 

I was thinking on a problem and how to solve it with Flink Streaming. Imagine you have a stream of data where you want to apply several transformations, where some transformations depend on previous transformations and there is a final set of actions. This is modeled in a natural way as a DAG which could be implemented in Flink Streaming or Storm or Spark Streaming. So far this is a typical problem, but imagine that now I have the requirement that each of the paths in the graph must be able to fail without affecting the other paths. For example given the following DAG

I -f1 -> A -f2-> B ->a1
  \g1-> C -g2-> D ->a2
  \h1-> E /
         \-h2-> F ->a3
 
here I have 
- a single input DataStream I 
- several derived DataStream  A, B, C, D, E
- several DataStream transformations f1, f2, g1, g2, h1, h2, each of them of arity 1 except for g2 that defined D = g2(C, E)
- final data sinks a1, a2, a3

I don't know much about Flink, but I assume that if some of the transformations starts failing temporarily, then the whole program will temporarily fail until that transformation goes back to normal. For example if h2 starts failing to compute F = h2(E), because h2 uses has some dependency that is temporarily unavailable (a database, a service, ...), then there is no warranty that B will keep being computed correctly and sending records to the sink a1, event though there is no path from I to B that prevents those records from being computed. At least that is what I would expect to happen with Spark Streaming. My first question is that this would in fact happen with Flink Streaming too. 

Also it would be nice to be able to update the code of each DataStream transformation independently, I guess you can't because you have a single Flink program. Hence if you want to modify the implementation of a single transformation, even if you are still respecting the transformation input-output interface, you have to stop and restart the whole topology. You could have an approximation to that topology by defining a micro service per each DataStream transformation, and connecting the services with queues. You could also make this more scalable by using several server instances for each service behind a load balancer per service. Or you could use Akka actors or something like that, which is basically equivalent to these groups of processes communicating through queues. But then you lose the high level programming interface and all the other benefits of Flink, and also the system infraestructure gets way more complicated that a single YARN cluster. I was wondering if it could be possible to split the DAG into several sub DAGs, implement each of those sub DAGs as Flink Streaming program, and then connecting those DAGs without having to use some intermediate external queue like Kafka, but using the internal queues used by Flink. In other words, it is possible to connect several Flink Streaming programs without using an external queue? That could be an interesting compromise that would allow to have different types of modularities (in functions, in different physical components) and isolation levels. 
This is quite of a speculative problem, but I thinks situations like this are not uncommon in practice. 
Thank for your help in any case. 

Greetings, 

Juan