Flink pipeline;

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink pipeline;

Aissa Elaffani
Hello Guys,
I am new to the real-time streaming field, and I am trying to build a BIG DATA architecture for processing real-time streaming. I have some sensors that generate data in json format, they are sent to Apache kafka cluster then i want to consume them with Apache flinkin ordre to do some aggregation. The probleme is that the data coming from kafka contains " the sensor ID , the equipement ID in wiche it is installed, and the status of the equipment..", knowing that the each sensor is installed in an equipement, and the equipement is linked to an workshop that it self linked to factory. So i need an other data source for the workshop and factories, because i want to do aggregation on factories, and the data sent by the sensors contains just the sensorIDand the equipementID...
Guys I am new to the this field, and i am stuck in this. Can someone please help me to achieve my goal, and explain to me how can i do that. And how can i do this complexed aggregation??And if there is any optmisation to do? Sorry for disturbing you !!!
AISSA

Reply | Threaded
Open this post in threaded view
|

Re: Flink pipeline;

hemant singh
You will have to enrich the data coming in for eg- { "equipment-id" : "1-234", "sensor-id" : "1-vcy", ..... } . Since you will most likely have a keyedstream based on equipment-id+sensor-id or equipment-id, you can have a control stream with data about equipment to workshop/factory mapping something like this - { "equipment-id" : "1-234", "workshop-id" : "1-234","factory-id" : "1-vcy", ..... } and then you can use CoProcess function to join these two streams to have the enriched stream. Once you have the enriched stream you can do aggregations at level you want to.
You can refer here [1] or [2] for some sample and reference.


Hemant

On Wed, May 6, 2020 at 3:10 AM Aissa Elaffani <[hidden email]> wrote:
Hello Guys,
I am new to the real-time streaming field, and I am trying to build a BIG DATA architecture for processing real-time streaming. I have some sensors that generate data in json format, they are sent to Apache kafka cluster then i want to consume them with Apache flinkin ordre to do some aggregation. The probleme is that the data coming from kafka contains " the sensor ID , the equipement ID in wiche it is installed, and the status of the equipment..", knowing that the each sensor is installed in an equipement, and the equipement is linked to an workshop that it self linked to factory. So i need an other data source for the workshop and factories, because i want to do aggregation on factories, and the data sent by the sensors contains just the sensorIDand the equipementID...
Guys I am new to the this field, and i am stuck in this. Can someone please help me to achieve my goal, and explain to me how can i do that. And how can i do this complexed aggregation??And if there is any optmisation to do? Sorry for disturbing you !!!
AISSA

Reply | Threaded
Open this post in threaded view
|

Re: Flink pipeline;

Leonard Xu
Hi Aissa

Looks like your requirements is to enrich a real stream data(from kafka) with dimension data(your case will like: {sensor_id, equipment_id, workshop_id, factory_id} ), you can achieve your purpose by Flink DataStream API or just use FLINK SQL. I think use pure SQL will be esaier if your dimension data is in 
DB(like mysql, postgresql, hbase which can be treated as temporal table in flink),   and you can use first join a temporal table[1] to enrich your real stream and 
then write the aggregation sql [2] to finish your application.


Best,
Leonard Xu
[2] https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sql/queries.html#aggregations

在 2020年5月6日,21:39,hemant singh <[hidden email]> 写道:

You will have to enrich the data coming in for eg- { "equipment-id" : "1-234", "sensor-id" : "1-vcy", ..... } . Since you will most likely have a keyedstream based on equipment-id+sensor-id or equipment-id, you can have a control stream with data about equipment to workshop/factory mapping something like this - { "equipment-id" : "1-234", "workshop-id" : "1-234","factory-id" : "1-vcy", ..... } and then you can use CoProcess function to join these two streams to have the enriched stream. Once you have the enriched stream you can do aggregations at level you want to.
You can refer here [1] or [2] for some sample and reference.


Hemant

On Wed, May 6, 2020 at 3:10 AM Aissa Elaffani <[hidden email]> wrote:
Hello Guys,
I am new to the real-time streaming field, and I am trying to build a BIG DATA architecture for processing real-time streaming. I have some sensors that generate data in json format, they are sent to Apache kafka cluster then i want to consume them with Apache flinkin ordre to do some aggregation. The probleme is that the data coming from kafka contains " the sensor ID , the equipement ID in wiche it is installed, and the status of the equipment..", knowing that the each sensor is installed in an equipement, and the equipement is linked to an workshop that it self linked to factory. So i need an other data source for the workshop and factories, because i want to do aggregation on factories, and the data sent by the sensors contains just the sensorIDand the equipementID...
Guys I am new to the this field, and i am stuck in this. Can someone please help me to achieve my goal, and explain to me how can i do that. And how can i do this complexed aggregation??And if there is any optmisation to do? Sorry for disturbing you !!!
AISSA