Hi Experts,
In batch computing, there are products like Azkaban or airflow to manage batch job dependencies. By using the dependency management tool, we can build a large-scale system consist of small jobs. In stream processing, it is not practical to put all dependencies in one job, because it will make the job being too complicated, and the state is too large. I want to build a large-scale realtime system which is consist of many Kafka sources and many streaming jobs, but the first thing I can think of is how to build the dependencies and connections between streaming jobs. The only method I can think of is using a self-implemented retract Kafka sink, each streaming job is connected by Kafka topic. But because each job may fail and retry, for example, the message in Kafka topic may look like this: { “retract”:”false”, “id”:”1”, “amount”:100 } { “retract”:”false”, “id”:”2”, “amount”:200 } { “retract”:”true”, “id”:”1”, “amount”:100 } { “retract”:”true”, “id”:”2”, “amount”:200 } { “retract”:”false”, “id”:”1”, “amount”:100 } { “retract”:”false”, “id”:”2”, “amount”:200 } if the topic is “topic_1”, the SQL in the downstream job may look like this: select id, latest(amount) from topic_1 where retract=“false" group by id But it will also make big state because each id is being grouped. I wonder if using Kafka to connect streaming jobs is applicable, how to build a large-scale realtime system consists of many streaming job? Thanks a lot. Best Henry |
Hi Henry, Apache Kafka or other message queue like Apache Pulsar or AWS Kinesis are in general the most common way to connect multiple streaming jobs. The dependencies between streaming jobs are in my experience of a different nature though. For batch jobs, it makes sense to schedule one after the other or having more complicated relationships. Streaming jobs are all processing data continuously, so the "coordination" happens on a different level. To avoid duplication, you can use the Kafka exactly-once sink, but this comes with a latency penalty (transactions are only committed on checkpoint completion). Generally, I would advise to always attach meaningful timestamps to your records, so that you can use watermarking [1] to trade off between latency and completeness. These could also be used to identify late records (resulting from catch up after recovers), which should be ignored by downstream jobs. There are other users, who assign a unique ID to every message going through there system and only use idempotent operations (set operations) within Flink, because messages are sometimes already duplicated before reaching the stream processor. For downstream jobs, where an upstream job might duplicate records, this could be a viable, yet limiting, approach as well. Hope this helps and let me know, what you think. Cheers, Konstantin On Thu, May 30, 2019 at 11:39 AM 徐涛 <[hidden email]> wrote: Hi Experts, -- Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Data Artisans GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen |
Hi Knauf,
The solution that I can think of to coordinate between different stream jobs is : For example there are two streaming jobs, Job_1 and Job_2: Job_1: receive data from the original kafka topic, TOPIC_ORIG for example, sink the data to another kafka topic, TOPIC_JOB_1_SINK for example. It should be mentioned that: ① I implement a retract kafka sink ②I do not use kafka exactly-once sink ③ every record in the TOPIC_JOB_1_SINK should have one unique key. ④ each record with the same key should be send to the same kafka partition. Job_2: receive data from TOPIC_JOB_1_SINK, first group by the unique key and get the latest value, then go on with the logic of job 2 , finally sink the data to final sink(es, hbase, mysql for example) Here I group by unique key first, because Job_1 may fail and retry, so some dirty data may be included in the TOPIC_JOB_1_SINK. So from the overview: Job_1 Job_2 ------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------- | TOPIC_ORIG -> Logic_Job_1 -> TOPIC_JOB_1_SINK | ——> | TOPIC_JOB_1_SINK -> GROUP_BY_UNIQUE_KEY_GET_LATEST -> Logic_Job_2 -> FINAL_JOB_2_SINK | ------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------- Would you please help review the solution, if there are some better solutions, kindly let me know about it , thank you. Best Henry
|
Free forum by Nabble | Edit this page |