(DEPRECATED) Apache Flink User Mailing List archive.

General Data questions - streams vs batch

Posted by Kostya Kulagin on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/General-Data-questions-streams-vs-batch-tp6384.html

Hi guys,

I have some kind of general question in order to get more understanding of stream vs final data transformation. More specific - I am trying to understand 'entities' lifecycle during processing.

1) For example in a case of streams: suppose we start with some key-value source, parallel it into 2 streams by key. Each stream modifies entry's values, lets say adds some fields. And we want to merge it back later. How does it happen?
Merging point will keep some finite buffer of entries? Basing on time or size?

I understand that probably right solution in this case would be having one stream and achieve more more performance by increasing parallelism, but what if I have 2 sources from the beginning?

2) Also I assume that in a case of streaming each entry considered as 'processed' once it passes whole chain and emitted into some sink, so after it will not consume resources. Basically similar to what Storm is doing.

But in a case of finite data (data sets): how big amount of data system will keep in memory? The whole set?

I probably have some example of dataset vs stream 'mix': I need to *transform* big but finite chunk of data, I don't really need to do any 'joins', grouping or smth like that so I never need to store whole dataset in memory/storage. What my choice would be in this case?

Thanks!

Konstantin