Where to put live model and business logic in Hadoop/Flink BigData system

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Where to put live model and business logic in Hadoop/Flink BigData system

palle
Hi there.

We are putting together some BigData components for handling a large amount of incoming data from different log files and perform some analysis on the data.

All data being fed into the system will go into HDFS. We plan on using Logstash, Kafka and Flink for bringing data from the log files and into HDFS. All our data located in HDFS we will designate as our historic data and we will use MapReduce (probably Flink, but could also be Hadoop) to create some aggregate views of the historic data. These views we will locate probably in HBase or MongoDB.

These views of the historic data (also called batch views in the Lambda Architecture if any of you are familiar with that) we will use from the live model in the system. The live model is also being fed with the same data (through Kafka) and when the live model detects a certain value in the incoming data, it will perform some analysis using the views in HBase/MongoDB of the historic data.

Now, could anyone share some knowledge regarding where it would be possible to implement such a live model given the components we plan on using? Apart from the business logic that will perform the analysis, our live model will at all times also contain a java object structure of maybe 5-10 java collections (maps, lists) containing approx 5 mio objects.

So, where is it possible to implement our live model? Can we do this in Flink? Can we do this with another component within the Hadoop Big Data ecosystem?

Thanks.

/Palle
Reply | Threaded
Open this post in threaded view
|

RE: Where to put live model and business logic in Hadoop/Flink BigData system

Lohith Samaga M
HI Palle,
        I am a beginner in Flink.

        However, I can say something about your other questions:
        1. It is better to use Spark to create aggregate views. It is a lot faster than MR. You could use either batch or streaming mode in spark based on your needs.
        2. If your aggregate data is in tabular format, you could store it in Hive.
        3. For your live model, you could use either spark streaming (micro batches) or use Storm (process individual tuples). It is easy to put business logic in storm bolts to work on each tuple.
        4. But please take care of latency (and other issues) when accessing aggregate data from live model. Your model should be able to handle latencies (from aggregate data access) and not create a backlog of streaming data that may lead to Storm failing the tuple.

        Hope this helps.


Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga
 
-----Original Message-----
From: [hidden email] [mailto:[hidden email]
Sent: Friday, May 06, 2016 16.23
To: [hidden email]
Subject: Where to put live model and business logic in Hadoop/Flink BigData system

Hi there.

We are putting together some BigData components for handling a large amount of incoming data from different log files and perform some analysis on the data.

All data being fed into the system will go into HDFS. We plan on using Logstash, Kafka and Flink for bringing data from the log files and into HDFS. All our data located in HDFS we will designate as our historic data and we will use MapReduce (probably Flink, but could also be Hadoop) to create some aggregate views of the historic data. These views we will locate probably in HBase or MongoDB.

These views of the historic data (also called batch views in the Lambda Architecture if any of you are familiar with that) we will use from the live model in the system. The live model is also being fed with the same data (through Kafka) and when the live model detects a certain value in the incoming data, it will perform some analysis using the views in HBase/MongoDB of the historic data.

Now, could anyone share some knowledge regarding where it would be possible to implement such a live model given the components we plan on using? Apart from the business logic that will perform the analysis, our live model will at all times also contain a java object structure of maybe 5-10 java collections (maps, lists) containing approx 5 mio objects.

So, where is it possible to implement our live model? Can we do this in Flink? Can we do this with another component within the Hadoop Big Data ecosystem?

Thanks.

/Palle
Information transmitted by this e-mail is proprietary to Mphasis, its associated companies and/ or its customers and is intended
for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or
exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded
to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly
prohibited. In such cases, please notify us immediately at [hidden email] and delete this mail from your records.
Reply | Threaded
Open this post in threaded view
|

Re: Where to put live model and business logic in Hadoop/Flink BigData system

Deepak Sharma
In reply to this post by palle
I see the flow to be as below:
LogStash->Log Stream->Flink ->Kafka->Live Model
                                       |
                                    Mongo/HBASE

The Live Model will again be Flink streaming data sets from Kakfa.
There you analyze the incoming stream for the certain value and once you find this certain value , read the historical view and then do the analysis in Flink itself.
For your java objects , i guess you can use checkpointed interface (have not used it though yet)

Thanks
Deepak


On Fri, May 6, 2016 at 4:22 PM, <[hidden email]> wrote:
Hi there.

We are putting together some BigData components for handling a large amount of incoming data from different log files and perform some analysis on the data.

All data being fed into the system will go into HDFS. We plan on using Logstash, Kafka and Flink for bringing data from the log files and into HDFS. All our data located in HDFS we will designate as our historic data and we will use MapReduce (probably Flink, but could also be Hadoop) to create some aggregate views of the historic data. These views we will locate probably in HBase or MongoDB.

These views of the historic data (also called batch views in the Lambda Architecture if any of you are familiar with that) we will use from the live model in the system. The live model is also being fed with the same data (through Kafka) and when the live model detects a certain value in the incoming data, it will perform some analysis using the views in HBase/MongoDB of the historic data.

Now, could anyone share some knowledge regarding where it would be possible to implement such a live model given the components we plan on using? Apart from the business logic that will perform the analysis, our live model will at all times also contain a java object structure of maybe 5-10 java collections (maps, lists) containing approx 5 mio objects.

So, where is it possible to implement our live model? Can we do this in Flink? Can we do this with another component within the Hadoop Big Data ecosystem?

Thanks.

/Palle



--
Reply | Threaded
Open this post in threaded view
|

Re: Where to put live model and business logic in Hadoop/Flink BigData system

Fabian Hueske-2
Hi Palle,

this sounds indeed like a good use case for Flink.

Depending on the complexity of the aggregated historical views, you can implement a Flink DataStream program which builds the views on the fly, i.e., you do not need to periodically trigger MR/Flink/Spark batch jobs to compute the views. Instead, you can use the concept of windows to group data by time (and other attributes) and compute the aggregates (depends on the type of aggregates) on-the-fly while data is arriving.

The live model can also be computed by Flink. You can access the historic data from an external store (HBase / Mongo) also cache parts of it in the Flink job to achieve lower latency. It is also possible to store the live model in your Flink job and query it from there (see this blogpost [1], section "Winning Twitter Hack Week: Eliminating the key-value store bottleneck"). Flink will partition the data, so it should be able to handle the data sizes you mentioned.

Best, Fabian

[1] http://data-artisans.com/extending-the-yahoo-streaming-benchmark/

2016-05-06 13:40 GMT+02:00 Deepak Sharma <[hidden email]>:
I see the flow to be as below:
LogStash->Log Stream->Flink ->Kafka->Live Model
                                       |
                                    Mongo/HBASE

The Live Model will again be Flink streaming data sets from Kakfa.
There you analyze the incoming stream for the certain value and once you find this certain value , read the historical view and then do the analysis in Flink itself.
For your java objects , i guess you can use checkpointed interface (have not used it though yet)

Thanks
Deepak


On Fri, May 6, 2016 at 4:22 PM, <[hidden email]> wrote:
Hi there.

We are putting together some BigData components for handling a large amount of incoming data from different log files and perform some analysis on the data.

All data being fed into the system will go into HDFS. We plan on using Logstash, Kafka and Flink for bringing data from the log files and into HDFS. All our data located in HDFS we will designate as our historic data and we will use MapReduce (probably Flink, but could also be Hadoop) to create some aggregate views of the historic data. These views we will locate probably in HBase or MongoDB.

These views of the historic data (also called batch views in the Lambda Architecture if any of you are familiar with that) we will use from the live model in the system. The live model is also being fed with the same data (through Kafka) and when the live model detects a certain value in the incoming data, it will perform some analysis using the views in HBase/MongoDB of the historic data.

Now, could anyone share some knowledge regarding where it would be possible to implement such a live model given the components we plan on using? Apart from the business logic that will perform the analysis, our live model will at all times also contain a java object structure of maybe 5-10 java collections (maps, lists) containing approx 5 mio objects.

So, where is it possible to implement our live model? Can we do this in Flink? Can we do this with another component within the Hadoop Big Data ecosystem?

Thanks.

/Palle



--