GSoC project proposal: Query optimisation layer for Flink Streaming

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

GSoC project proposal: Query optimisation layer for Flink Streaming

bwepngong
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: GSoC project proposal: Query optimisation layer for Flink Streaming

rmetzger0
Just a quick ping on this for the streaming folks: The deadline for the proposal submissions is Friday, so the GSoC applicants need to get our feedback asap.
The student asked me today in the #flink channel whether we can review this proposal.


I have the following comments regarding the proposal:
- I don't exactly understand how you've chosen the dates for the milestones. According to https://www.google-melange.com/gsoc/events/google/gsoc2015 the coding phase begins at 25 May and ends on 21 August. It seems that you are suggesting to start with the implementation before the offical GSoC start date.
I would suggest to align the milestones with the official GSoC timeline (or at least justify in the proposal why you're deviating from that)
- Can you explain a bit more how you are planning to do operator reordering and how the "rete algorithm" is working. Also some background on why you've chosen that algorithm would be helpful.



On Sun, Mar 22, 2015 at 4:44 AM, Wepngong Benaiah <[hidden email]> wrote:
Hello,
I cam out with the following proposal which I believe needs alot of review. I will appreciate if you can help me make appropriate corrections before the deadline for submission.
Thanks @gyfora, @pariscarbone


GSoC project: Query optimisation layer for Flink Streaming

NAME: Wepngong Ngeh Benaiah

EMAIL: [hidden email]

SYNOPSIS

I would very much like to participate for GSOC2015 with Apache working with Flink streaming as my way of contributing to open-source.

Flink streaming currently only supports a limited set of optimisations applied on the streaming programs such as operator chaining, and several optimisations for windowing computations.

Also, there is currently no optimizer as a separate module on its own. Though operator chaining improves performance, alot more has to be done to further improve system performance.

My project will be to implement a Query Optimisation layer for Flink Streaming. This is supposed to do statistical graph analysis and streaming graph optimization. This would bring major system performance improvements.

HOW WOULD THE COMMUNITY BENEFIT FROM THIS?

Much of “big data” is received in real time, and is most valuable at its time of arrival. For example, a social network may want to identify trending conversation topics within minutes, an ad provider may want to train a model of which users click a new ad, and a service operator may want to mine log files to detect failures within seconds.

Big Data Analytics is greatly gaining grounds in all domains in industry today and Flink is the solution. By reducing overheads and system bottlenecks, the throughput of the companies will be improved and many more people will to use and support the project.

ABOUT ME

I am an IT enthusiast and 3rd year Software Engineering student at the University of Buea, Cameroon pursuing a Bachelor of Engineering in Computer Engineering. I have been programming in Java for 2years+, MySQL, PostGRES, web application development in PHP (Laravel and Yii frameworks), 3 years experience with C programming language, Linux System Administration and recently, Stream Processing. I'm currently in my 2nd Semester of my 3rd year and will be on Internship at Orange Cameroon, a mobile telecommunications company by September 2015.

I have contributed to https://github.com/ch3ck/sams where work on the student attendance management system is still going on, https://github.com/NetLogo/NetLogo and.

Finally, this is my github account: https://github.com/bwepngong and Google Plus: https://plus.google.com/+WepngongBenaiahNgeh

I am finishing my B.Eng at the University of Buea in Cameroon in December 2016.

Milestones

30th March-27th April 2015

1. Understand how flink streaming works look into the streamgraph and the stramingjobgraphbuilder and start doing simpler things with flink

2. Design and analysis of the entire system.

3. Ask questions in mailing lists for clarifications.

27th April – 26 June(Mid term)

Implement

  1.  OPERATOR REORDERING Means changing the order in which the operators appear in the stream graph to eliminate overheads.

  2. Perform unit testing for this algorithm.

27th June – 13 August

Implement

  1. REDUNDANCY ELIMINATION: Eliminate redundant computations by analysing the streaming graph using the RETE algorithm and remove duplicate operators which are not necessary. When other operators depend on another, compute that operator once only and share between other operators.

2. Perform unit testing

13th August - 21 August

  1. Integrate modules and do system testing

22nd August – 28th August(final evaluation)

Polish testing and get the required code samples ready

29th August –8th November

  1. More testing

  2. Code documentation.

  3. And debugging




Reply | Threaded
Open this post in threaded view
|

Re: GSoC project proposal: Query optimisation layer for Flink Streaming

Márton Balassi-2
In reply to this post by bwepngong
Thanks for the proposal Wepngong and for the ping Robert. Sorry for my late reply. 

I like the general concept, and I do think that this topic is really "Flink-ish" in the sense of focusing on the optimization. Let me add some comments:

Synopsis:

   * By reducing the system overhead the throughput of the job increases thus potentially reducing the number of resources needed to carry out the task. This is usually one of the main motivating factor for industry.
   * Be aware how you phrase operator chaining's effect on performance. It might be beneficial, but it can also be counter-productive, because you trade an available thread for getting rid of network latency in the typical case. Let us also be aware that you can not beat carefully hand optimized code in the general case. :)
   * Typo: alot -> a lot

How would the community benefit from this:

   * I would omit the much of the big data is received in real time part and make the time is value part more prominent. When you receive the data ultimately depends on your infrastructure and ingestion tools. :) Real time ingestion is getting more popular today.
   * Not the social network tries to identify the trending conversation, but the provider behind such service tries to expose that as a feature to its users - be that the users of the network or the companies trying to gain insight from the network. It might sound a bit to over-zealous, bit it nice to make the distinction. Someone is trying to extract some information, the social network itself does not do that.
   * "Flink is the solution": I personally really appreciate that quote, but the Storm, Spark, Samza etc. guys will also have their fair share. Be careful with such statements, because it might make an expression that we have not done your homework.

About me:
   
  * Nice, I like it. Good awareness for also mentioning the Google+ account. :)
  * Could you be more specific with stream processing?

Milestones:

  * I feel that operator reordering in general is a bit less than 6 weeks of work. ;) Be a bit more specific and add more stuff there.
  * Great that you have specifically included testing in different phases

Please add a paragraph in general justifying why you chose certain optimizations and how they would affect the system in general, what do you expect from them.

Keep up the good work, regards:

Marton


On Tue, Mar 24, 2015 at 10:01 PM, Robert Metzger <[hidden email]> wrote:
Just a quick ping on this for the streaming folks: The deadline for the proposal submissions is Friday, so the GSoC applicants need to get our feedback asap.
The student asked me today in the #flink channel whether we can review this proposal.


I have the following comments regarding the proposal:
- I don't exactly understand how you've chosen the dates for the milestones. According to https://www.google-melange.com/gsoc/events/google/gsoc2015 the coding phase begins at 25 May and ends on 21 August. It seems that you are suggesting to start with the implementation before the offical GSoC start date.
I would suggest to align the milestones with the official GSoC timeline (or at least justify in the proposal why you're deviating from that)
- Can you explain a bit more how you are planning to do operator reordering and how the "rete algorithm" is working. Also some background on why you've chosen that algorithm would be helpful.



On Sun, Mar 22, 2015 at 4:44 AM, Wepngong Benaiah <[hidden email]> wrote:
Hello,
I cam out with the following proposal which I believe needs alot of review. I will appreciate if you can help me make appropriate corrections before the deadline for submission.
Thanks @gyfora, @pariscarbone


GSoC project: Query optimisation layer for Flink Streaming

NAME: Wepngong Ngeh Benaiah

EMAIL: [hidden email]

SYNOPSIS

I would very much like to participate for GSOC2015 with Apache working with Flink streaming as my way of contributing to open-source.

Flink streaming currently only supports a limited set of optimisations applied on the streaming programs such as operator chaining, and several optimisations for windowing computations.

Also, there is currently no optimizer as a separate module on its own. Though operator chaining improves performance, alot more has to be done to further improve system performance.

My project will be to implement a Query Optimisation layer for Flink Streaming. This is supposed to do statistical graph analysis and streaming graph optimization. This would bring major system performance improvements.

HOW WOULD THE COMMUNITY BENEFIT FROM THIS?

Much of “big data” is received in real time, and is most valuable at its time of arrival. For example, a social network may want to identify trending conversation topics within minutes, an ad provider may want to train a model of which users click a new ad, and a service operator may want to mine log files to detect failures within seconds.

Big Data Analytics is greatly gaining grounds in all domains in industry today and Flink is the solution. By reducing overheads and system bottlenecks, the throughput of the companies will be improved and many more people will to use and support the project.

ABOUT ME

I am an IT enthusiast and 3rd year Software Engineering student at the University of Buea, Cameroon pursuing a Bachelor of Engineering in Computer Engineering. I have been programming in Java for 2years+, MySQL, PostGRES, web application development in PHP (Laravel and Yii frameworks), 3 years experience with C programming language, Linux System Administration and recently, Stream Processing. I'm currently in my 2nd Semester of my 3rd year and will be on Internship at Orange Cameroon, a mobile telecommunications company by September 2015.

I have contributed to https://github.com/ch3ck/sams where work on the student attendance management system is still going on, https://github.com/NetLogo/NetLogo and.

Finally, this is my github account: https://github.com/bwepngong and Google Plus: https://plus.google.com/+WepngongBenaiahNgeh

I am finishing my B.Eng at the University of Buea in Cameroon in December 2016.

Milestones

30th March-27th April 2015

1. Understand how flink streaming works look into the streamgraph and the stramingjobgraphbuilder and start doing simpler things with flink

2. Design and analysis of the entire system.

3. Ask questions in mailing lists for clarifications.

27th April – 26 June(Mid term)

Implement

  1.  OPERATOR REORDERING Means changing the order in which the operators appear in the stream graph to eliminate overheads.

  2. Perform unit testing for this algorithm.

27th June – 13 August

Implement

  1. REDUNDANCY ELIMINATION: Eliminate redundant computations by analysing the streaming graph using the RETE algorithm and remove duplicate operators which are not necessary. When other operators depend on another, compute that operator once only and share between other operators.

2. Perform unit testing

13th August - 21 August

  1. Integrate modules and do system testing

22nd August – 28th August(final evaluation)

Polish testing and get the required code samples ready

29th August –8th November

  1. More testing

  2. Code documentation.

  3. And debugging





Reply | Threaded
Open this post in threaded view
|

Re: GSoC project proposal: Query optimisation layer for Flink Streaming

bwepngong
In reply to this post by rmetzger0
CONTENTS DELETED
The author has deleted this message.