(DEPRECATED) Apache Flink User Mailing List archive.

Hardware requirements and learning resources

Classic

List

Threaded

13 messages Options

Juan Rodríguez Hortalá

Hardware requirements and learning resources

Hi list,

I'm new to Flink, and I find this project very interesting. I have experience with Apache Spark, and for I've seen so far I find that Flink provides an API at a similar abstraction level but based on single record processing instead of batch processing. I've read in Quora that Flink extends stream processing to batch processing, while Spark extends batch processing to streaming. Therefore I find Flink specially attractive for low latency stream processing. Anyway, I would appreciate if someone could give some indication about where I could find a list of hardware requirements for the slave nodes in a Flink cluster. Something along the lines of https://spark.apache.org/docs/latest/hardware-provisioning.html. Spark is known for having quite high minimal memory requirements (8GB RAM and 8 cores minimum), and I was wondering if it is also the case for Flink. Lower memory requirements would be very interesting for building small Flink clusters for educational purposes, or for small projects.

Apart from that, I wonder if there is some blog post by the comunity about transitioning from Spark to Flink. I think it could be interesting, as there are some similarities in the APIs, but also deep differences in the underlying approaches. I was thinking in something like Breeze's cheatsheet comparing its matrix operatations with those available in Matlab and Numpy https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet, or like http://rosettacode.org/wiki/Factorial. Just an idea anyway. Also, any pointer to some online course, book or training for Flink besides the official programming guides would be much appreciated

Thanks in advance for help

Greetings,

Juan

Juan Rodríguez Hortalá

Re: Hardware requirements and learning resources

Answering to myself, I have found some nice training material at http://dataartisans.github.io/flink-training. There are even videos at youtube for some of the slides

- http://dataartisans.github.io/flink-training/overview/intro.html

https://www.youtube.com/watch?v=XgC6c4Wiqvs

- http://dataartisans.github.io/flink-training/dataSetBasics/intro.html

https://www.youtube.com/watch?v=0EARqW15dDk

The third lecture http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU but not exactly, and there are more lessons at http://dataartisans.github.io/flink-training, for stream processing and the table API for which I haven't found a video. Does anyone have pointers to the missing videos?

Greetings,

Juan

2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <[hidden email]>:

Hi list,

I'm new to Flink, and I find this project very interesting. I have experience with Apache Spark, and for I've seen so far I find that Flink provides an API at a similar abstraction level but based on single record processing instead of batch processing. I've read in Quora that Flink extends stream processing to batch processing, while Spark extends batch processing to streaming. Therefore I find Flink specially attractive for low latency stream processing. Anyway, I would appreciate if someone could give some indication about where I could find a list of hardware requirements for the slave nodes in a Flink cluster. Something along the lines of https://spark.apache.org/docs/latest/hardware-provisioning.html. Spark is known for having quite high minimal memory requirements (8GB RAM and 8 cores minimum), and I was wondering if it is also the case for Flink. Lower memory requirements would be very interesting for building small Flink clusters for educational purposes, or for small projects.

Apart from that, I wonder if there is some blog post by the comunity about transitioning from Spark to Flink. I think it could be interesting, as there are some similarities in the APIs, but also deep differences in the underlying approaches. I was thinking in something like Breeze's cheatsheet comparing its matrix operatations with those available in Matlab and Numpy https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet, or like http://rosettacode.org/wiki/Factorial. Just an idea anyway. Also, any pointer to some online course, book or training for Flink besides the official programming guides would be much appreciated

Thanks in advance for help

Greetings,

Juan

Kostas Tzoumas

Re: Hardware requirements and learning resources

Hi Juan,

Flink is quite nimble with hardware requirements; people have run it in old-ish laptops and also the largest instances available in cloud providers. I will let others chime in with more details.

I am not aware of something along the lines of a cheatsheet that you mention. If you actually try to do this, I would love to see it, and it might be useful to others as well. Both use similar abstractions at the API level (i.e., parallel collections), so if you stay true to the functional paradigm and not try to "abuse" the system by exploiting knowledge of its internals things should be straightforward. These apply to the batch APIs; the streaming API in Flink follows a true streaming paradigm, where you get an unbounded stream of records and operators on these streams.

Funny that you ask about a video for the DataStream slides. There is a Flink training happening as we speak, and a video is being recorded right now :-) Hopefully it will be made available soon.

Best,

Kostas

On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <[hidden email]> wrote:

Answering to myself, I have found some nice training material at http://dataartisans.github.io/flink-training. There are even videos at youtube for some of the slides

- http://dataartisans.github.io/flink-training/overview/intro.html
https://www.youtube.com/watch?v=XgC6c4Wiqvs

- http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
https://www.youtube.com/watch?v=0EARqW15dDk

The third lecture http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU but not exactly, and there are more lessons at http://dataartisans.github.io/flink-training, for stream processing and the table API for which I haven't found a video. Does anyone have pointers to the missing videos?

Greetings,

Juan

2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <[hidden email]>:
Hi list,

I'm new to Flink, and I find this project very interesting. I have experience with Apache Spark, and for I've seen so far I find that Flink provides an API at a similar abstraction level but based on single record processing instead of batch processing. I've read in Quora that Flink extends stream processing to batch processing, while Spark extends batch processing to streaming. Therefore I find Flink specially attractive for low latency stream processing. Anyway, I would appreciate if someone could give some indication about where I could find a list of hardware requirements for the slave nodes in a Flink cluster. Something along the lines of https://spark.apache.org/docs/latest/hardware-provisioning.html. Spark is known for having quite high minimal memory requirements (8GB RAM and 8 cores minimum), and I was wondering if it is also the case for Flink. Lower memory requirements would be very interesting for building small Flink clusters for educational purposes, or for small projects.

Apart from that, I wonder if there is some blog post by the comunity about transitioning from Spark to Flink. I think it could be interesting, as there are some similarities in the APIs, but also deep differences in the underlying approaches. I was thinking in something like Breeze's cheatsheet comparing its matrix operatations with those available in Matlab and Numpy https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet, or like http://rosettacode.org/wiki/Factorial. Just an idea anyway. Also, any pointer to some online course, book or training for Flink besides the official programming guides would be much appreciated

Thanks in advance for help

Greetings,

Juan

rmetzger0

Re: Hardware requirements and learning resources

Hi Juan,

I think the recommendations in the Spark guide are quite good, and are similar to what I would recommend for Flink as well.

Depending on the workloads you are interested to run, you can certainly use Flink with less than 8 GB per machine. I think you can start Flink TaskManagers with 500 MB of heap space and they'll still be able to process some GB of data.

Everything above 2 GB is probably good enough for some initial experimentation (again depending on your workloads, network, disk speed etc.)

On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas <[hidden email]> wrote:

Hi Juan,

Flink is quite nimble with hardware requirements; people have run it in old-ish laptops and also the largest instances available in cloud providers. I will let others chime in with more details.

I am not aware of something along the lines of a cheatsheet that you mention. If you actually try to do this, I would love to see it, and it might be useful to others as well. Both use similar abstractions at the API level (i.e., parallel collections), so if you stay true to the functional paradigm and not try to "abuse" the system by exploiting knowledge of its internals things should be straightforward. These apply to the batch APIs; the streaming API in Flink follows a true streaming paradigm, where you get an unbounded stream of records and operators on these streams.

Funny that you ask about a video for the DataStream slides. There is a Flink training happening as we speak, and a video is being recorded right now :-) Hopefully it will be made available soon.

Best,
Kostas

On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <[hidden email]> wrote:
Answering to myself, I have found some nice training material at http://dataartisans.github.io/flink-training. There are even videos at youtube for some of the slides

- http://dataartisans.github.io/flink-training/overview/intro.html
https://www.youtube.com/watch?v=XgC6c4Wiqvs

- http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
https://www.youtube.com/watch?v=0EARqW15dDk

The third lecture http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU but not exactly, and there are more lessons at http://dataartisans.github.io/flink-training, for stream processing and the table API for which I haven't found a video. Does anyone have pointers to the missing videos?

Greetings,

Juan

2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <[hidden email]>:
Hi list,

I'm new to Flink, and I find this project very interesting. I have experience with Apache Spark, and for I've seen so far I find that Flink provides an API at a similar abstraction level but based on single record processing instead of batch processing. I've read in Quora that Flink extends stream processing to batch processing, while Spark extends batch processing to streaming. Therefore I find Flink specially attractive for low latency stream processing. Anyway, I would appreciate if someone could give some indication about where I could find a list of hardware requirements for the slave nodes in a Flink cluster. Something along the lines of https://spark.apache.org/docs/latest/hardware-provisioning.html. Spark is known for having quite high minimal memory requirements (8GB RAM and 8 cores minimum), and I was wondering if it is also the case for Flink. Lower memory requirements would be very interesting for building small Flink clusters for educational purposes, or for small projects.

Apart from that, I wonder if there is some blog post by the comunity about transitioning from Spark to Flink. I think it could be interesting, as there are some similarities in the APIs, but also deep differences in the underlying approaches. I was thinking in something like Breeze's cheatsheet comparing its matrix operatations with those available in Matlab and Numpy https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet, or like http://rosettacode.org/wiki/Factorial. Just an idea anyway. Also, any pointer to some online course, book or training for Flink besides the official programming guides would be much appreciated

Thanks in advance for help

Greetings,

Juan

jay vyas

Re: Hardware requirements and learning resources

We're also working on a bigpetstore implementation of flink which will help onboard spark/mapreduce folks.

I have prototypical code here that runs a simple job in memory, contributions welcome,

right now there is a serialization error https://github.com/bigpetstore/bigpetstore-flink .

On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger <[hidden email]> wrote:

Hi Juan,

I think the recommendations in the Spark guide are quite good, and are similar to what I would recommend for Flink as well.
Depending on the workloads you are interested to run, you can certainly use Flink with less than 8 GB per machine. I think you can start Flink TaskManagers with 500 MB of heap space and they'll still be able to process some GB of data.

Everything above 2 GB is probably good enough for some initial experimentation (again depending on your workloads, network, disk speed etc.)

On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas <[hidden email]> wrote:
Hi Juan,

Flink is quite nimble with hardware requirements; people have run it in old-ish laptops and also the largest instances available in cloud providers. I will let others chime in with more details.

I am not aware of something along the lines of a cheatsheet that you mention. If you actually try to do this, I would love to see it, and it might be useful to others as well. Both use similar abstractions at the API level (i.e., parallel collections), so if you stay true to the functional paradigm and not try to "abuse" the system by exploiting knowledge of its internals things should be straightforward. These apply to the batch APIs; the streaming API in Flink follows a true streaming paradigm, where you get an unbounded stream of records and operators on these streams.

Funny that you ask about a video for the DataStream slides. There is a Flink training happening as we speak, and a video is being recorded right now :-) Hopefully it will be made available soon.

Best,
Kostas

On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <[hidden email]> wrote:
Answering to myself, I have found some nice training material at http://dataartisans.github.io/flink-training. There are even videos at youtube for some of the slides

- http://dataartisans.github.io/flink-training/overview/intro.html
https://www.youtube.com/watch?v=XgC6c4Wiqvs

- http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
https://www.youtube.com/watch?v=0EARqW15dDk

The third lecture http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU but not exactly, and there are more lessons at http://dataartisans.github.io/flink-training, for stream processing and the table API for which I haven't found a video. Does anyone have pointers to the missing videos?

Greetings,

Juan

2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <[hidden email]>:
Hi list,

I'm new to Flink, and I find this project very interesting. I have experience with Apache Spark, and for I've seen so far I find that Flink provides an API at a similar abstraction level but based on single record processing instead of batch processing. I've read in Quora that Flink extends stream processing to batch processing, while Spark extends batch processing to streaming. Therefore I find Flink specially attractive for low latency stream processing. Anyway, I would appreciate if someone could give some indication about where I could find a list of hardware requirements for the slave nodes in a Flink cluster. Something along the lines of https://spark.apache.org/docs/latest/hardware-provisioning.html. Spark is known for having quite high minimal memory requirements (8GB RAM and 8 cores minimum), and I was wondering if it is also the case for Flink. Lower memory requirements would be very interesting for building small Flink clusters for educational purposes, or for small projects.

Apart from that, I wonder if there is some blog post by the comunity about transitioning from Spark to Flink. I think it could be interesting, as there are some similarities in the APIs, but also deep differences in the underlying approaches. I was thinking in something like Breeze's cheatsheet comparing its matrix operatations with those available in Matlab and Numpy https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet, or like http://rosettacode.org/wiki/Factorial. Just an idea anyway. Also, any pointer to some online course, book or training for Flink besides the official programming guides would be much appreciated

Thanks in advance for help

Greetings,

Juan

jay vyas

Stephan Ewen

Re: Hardware requirements and learning resources

In reply to this post by rmetzger0

I have actually run Flink nodes with 50 MB of memory and processed multiple gigabytes, but that is truely a toy setup for experimentation.

As Robert said, a Mini-Cluster with two local workers (each around 300-400 MB memory) plus a master node (200-300 MB) gives you 1 GB of total needed memory and should allow you to go through many gigabytes (although naturally not in an in-memory fashion in that setting, but involving disk I/O).

On Wed, Sep 2, 2015 at 2:50 PM, Robert Metzger <[hidden email]> wrote:

Hi Juan,

I think the recommendations in the Spark guide are quite good, and are similar to what I would recommend for Flink as well.
Depending on the workloads you are interested to run, you can certainly use Flink with less than 8 GB per machine. I think you can start Flink TaskManagers with 500 MB of heap space and they'll still be able to process some GB of data.

Everything above 2 GB is probably good enough for some initial experimentation (again depending on your workloads, network, disk speed etc.)

On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas <[hidden email]> wrote:
Hi Juan,

Flink is quite nimble with hardware requirements; people have run it in old-ish laptops and also the largest instances available in cloud providers. I will let others chime in with more details.

I am not aware of something along the lines of a cheatsheet that you mention. If you actually try to do this, I would love to see it, and it might be useful to others as well. Both use similar abstractions at the API level (i.e., parallel collections), so if you stay true to the functional paradigm and not try to "abuse" the system by exploiting knowledge of its internals things should be straightforward. These apply to the batch APIs; the streaming API in Flink follows a true streaming paradigm, where you get an unbounded stream of records and operators on these streams.

Funny that you ask about a video for the DataStream slides. There is a Flink training happening as we speak, and a video is being recorded right now :-) Hopefully it will be made available soon.

Best,
Kostas

On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <[hidden email]> wrote:
Answering to myself, I have found some nice training material at http://dataartisans.github.io/flink-training. There are even videos at youtube for some of the slides

- http://dataartisans.github.io/flink-training/overview/intro.html
https://www.youtube.com/watch?v=XgC6c4Wiqvs

- http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
https://www.youtube.com/watch?v=0EARqW15dDk

The third lecture http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU but not exactly, and there are more lessons at http://dataartisans.github.io/flink-training, for stream processing and the table API for which I haven't found a video. Does anyone have pointers to the missing videos?

Greetings,

Juan

2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <[hidden email]>:
Hi list,

I'm new to Flink, and I find this project very interesting. I have experience with Apache Spark, and for I've seen so far I find that Flink provides an API at a similar abstraction level but based on single record processing instead of batch processing. I've read in Quora that Flink extends stream processing to batch processing, while Spark extends batch processing to streaming. Therefore I find Flink specially attractive for low latency stream processing. Anyway, I would appreciate if someone could give some indication about where I could find a list of hardware requirements for the slave nodes in a Flink cluster. Something along the lines of https://spark.apache.org/docs/latest/hardware-provisioning.html. Spark is known for having quite high minimal memory requirements (8GB RAM and 8 cores minimum), and I was wondering if it is also the case for Flink. Lower memory requirements would be very interesting for building small Flink clusters for educational purposes, or for small projects.

Apart from that, I wonder if there is some blog post by the comunity about transitioning from Spark to Flink. I think it could be interesting, as there are some similarities in the APIs, but also deep differences in the underlying approaches. I was thinking in something like Breeze's cheatsheet comparing its matrix operatations with those available in Matlab and Numpy https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet, or like http://rosettacode.org/wiki/Factorial. Just an idea anyway. Also, any pointer to some online course, book or training for Flink besides the official programming guides would be much appreciated

Thanks in advance for help

Greetings,

Juan

rmetzger0

Re: Hardware requirements and learning resources

In reply to this post by jay vyas

Hey jay,

How can I reproduce the error?

On Wed, Sep 2, 2015 at 2:56 PM, jay vyas <[hidden email]> wrote:

We're also working on a bigpetstore implementation of flink which will help onboard spark/mapreduce folks.

I have prototypical code here that runs a simple job in memory, contributions welcome,

right now there is a serialization error https://github.com/bigpetstore/bigpetstore-flink .

On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger <[hidden email]> wrote:
Hi Juan,

I think the recommendations in the Spark guide are quite good, and are similar to what I would recommend for Flink as well.
Depending on the workloads you are interested to run, you can certainly use Flink with less than 8 GB per machine. I think you can start Flink TaskManagers with 500 MB of heap space and they'll still be able to process some GB of data.

Everything above 2 GB is probably good enough for some initial experimentation (again depending on your workloads, network, disk speed etc.)

On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas <[hidden email]> wrote:
Hi Juan,

Flink is quite nimble with hardware requirements; people have run it in old-ish laptops and also the largest instances available in cloud providers. I will let others chime in with more details.

I am not aware of something along the lines of a cheatsheet that you mention. If you actually try to do this, I would love to see it, and it might be useful to others as well. Both use similar abstractions at the API level (i.e., parallel collections), so if you stay true to the functional paradigm and not try to "abuse" the system by exploiting knowledge of its internals things should be straightforward. These apply to the batch APIs; the streaming API in Flink follows a true streaming paradigm, where you get an unbounded stream of records and operators on these streams.

Funny that you ask about a video for the DataStream slides. There is a Flink training happening as we speak, and a video is being recorded right now :-) Hopefully it will be made available soon.

Best,
Kostas

On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <[hidden email]> wrote:
Answering to myself, I have found some nice training material at http://dataartisans.github.io/flink-training. There are even videos at youtube for some of the slides

- http://dataartisans.github.io/flink-training/overview/intro.html
https://www.youtube.com/watch?v=XgC6c4Wiqvs

- http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
https://www.youtube.com/watch?v=0EARqW15dDk

The third lecture http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU but not exactly, and there are more lessons at http://dataartisans.github.io/flink-training, for stream processing and the table API for which I haven't found a video. Does anyone have pointers to the missing videos?

Greetings,

Juan

2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <[hidden email]>:
Hi list,

I'm new to Flink, and I find this project very interesting. I have experience with Apache Spark, and for I've seen so far I find that Flink provides an API at a similar abstraction level but based on single record processing instead of batch processing. I've read in Quora that Flink extends stream processing to batch processing, while Spark extends batch processing to streaming. Therefore I find Flink specially attractive for low latency stream processing. Anyway, I would appreciate if someone could give some indication about where I could find a list of hardware requirements for the slave nodes in a Flink cluster. Something along the lines of https://spark.apache.org/docs/latest/hardware-provisioning.html. Spark is known for having quite high minimal memory requirements (8GB RAM and 8 cores minimum), and I was wondering if it is also the case for Flink. Lower memory requirements would be very interesting for building small Flink clusters for educational purposes, or for small projects.

Apart from that, I wonder if there is some blog post by the comunity about transitioning from Spark to Flink. I think it could be interesting, as there are some similarities in the APIs, but also deep differences in the underlying approaches. I was thinking in something like Breeze's cheatsheet comparing its matrix operatations with those available in Matlab and Numpy https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet, or like http://rosettacode.org/wiki/Factorial. Just an idea anyway. Also, any pointer to some online course, book or training for Flink besides the official programming guides would be much appreciated

Thanks in advance for help

Greetings,

Juan

--
jay vyas

jay vyas

Re: Hardware requirements and learning resources

Just running the main class is sufficient

On Sep 2, 2015, at 8:59 AM, Robert Metzger <[hidden email]> wrote:

Hey jay,

How can I reproduce the error?

On Wed, Sep 2, 2015 at 2:56 PM, jay vyas <[hidden email]> wrote:
We're also working on a bigpetstore implementation of flink which will help onboard spark/mapreduce folks.

I have prototypical code here that runs a simple job in memory, contributions welcome,

right now there is a serialization error https://github.com/bigpetstore/bigpetstore-flink .

On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger <[hidden email]> wrote:
Hi Juan,

I think the recommendations in the Spark guide are quite good, and are similar to what I would recommend for Flink as well.
Depending on the workloads you are interested to run, you can certainly use Flink with less than 8 GB per machine. I think you can start Flink TaskManagers with 500 MB of heap space and they'll still be able to process some GB of data.

Everything above 2 GB is probably good enough for some initial experimentation (again depending on your workloads, network, disk speed etc.)

On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas <[hidden email]> wrote:
Hi Juan,

Flink is quite nimble with hardware requirements; people have run it in old-ish laptops and also the largest instances available in cloud providers. I will let others chime in with more details.

I am not aware of something along the lines of a cheatsheet that you mention. If you actually try to do this, I would love to see it, and it might be useful to others as well. Both use similar abstractions at the API level (i.e., parallel collections), so if you stay true to the functional paradigm and not try to "abuse" the system by exploiting knowledge of its internals things should be straightforward. These apply to the batch APIs; the streaming API in Flink follows a true streaming paradigm, where you get an unbounded stream of records and operators on these streams.

Funny that you ask about a video for the DataStream slides. There is a Flink training happening as we speak, and a video is being recorded right now :-) Hopefully it will be made available soon.

Best,
Kostas

On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <[hidden email]> wrote:
Answering to myself, I have found some nice training material at http://dataartisans.github.io/flink-training. There are even videos at youtube for some of the slides

- http://dataartisans.github.io/flink-training/overview/intro.html
https://www.youtube.com/watch?v=XgC6c4Wiqvs

- http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
https://www.youtube.com/watch?v=0EARqW15dDk

The third lecture http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU but not exactly, and there are more lessons at http://dataartisans.github.io/flink-training, for stream processing and the table API for which I haven't found a video. Does anyone have pointers to the missing videos?

Greetings,

Juan

2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <[hidden email]>:
Hi list,

I'm new to Flink, and I find this project very interesting. I have experience with Apache Spark, and for I've seen so far I find that Flink provides an API at a similar abstraction level but based on single record processing instead of batch processing. I've read in Quora that Flink extends stream processing to batch processing, while Spark extends batch processing to streaming. Therefore I find Flink specially attractive for low latency stream processing. Anyway, I would appreciate if someone could give some indication about where I could find a list of hardware requirements for the slave nodes in a Flink cluster. Something along the lines of https://spark.apache.org/docs/latest/hardware-provisioning.html. Spark is known for having quite high minimal memory requirements (8GB RAM and 8 cores minimum), and I was wondering if it is also the case for Flink. Lower memory requirements would be very interesting for building small Flink clusters for educational purposes, or for small projects.

Apart from that, I wonder if there is some blog post by the comunity about transitioning from Spark to Flink. I think it could be interesting, as there are some similarities in the APIs, but also deep differences in the underlying approaches. I was thinking in something like Breeze's cheatsheet comparing its matrix operatations with those available in Matlab and Numpy https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet, or like http://rosettacode.org/wiki/Factorial. Just an idea anyway. Also, any pointer to some online course, book or training for Flink besides the official programming guides would be much appreciated

Thanks in advance for help

Greetings,

Juan

--
jay vyas

rmetzger0

Re: Hardware requirements and learning resources

@Jay: I've looked into your code, but I was not able to reproduce the issue.

I'll start a new discussion thread on the user@flink list for the Flink-BigPetStore discussion. I don't want to take over Juan's hardware-requirements discussion ;)

On Wed, Sep 2, 2015 at 3:01 PM, Jay Vyas <[hidden email]> wrote:

Just running the main class is sufficient

On Sep 2, 2015, at 8:59 AM, Robert Metzger <[hidden email]> wrote:

Hey jay,

How can I reproduce the error?

On Wed, Sep 2, 2015 at 2:56 PM, jay vyas <[hidden email]> wrote:
We're also working on a bigpetstore implementation of flink which will help onboard spark/mapreduce folks.

I have prototypical code here that runs a simple job in memory, contributions welcome,

right now there is a serialization error https://github.com/bigpetstore/bigpetstore-flink .

On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger <[hidden email]> wrote:
Hi Juan,

I think the recommendations in the Spark guide are quite good, and are similar to what I would recommend for Flink as well.
Depending on the workloads you are interested to run, you can certainly use Flink with less than 8 GB per machine. I think you can start Flink TaskManagers with 500 MB of heap space and they'll still be able to process some GB of data.

Everything above 2 GB is probably good enough for some initial experimentation (again depending on your workloads, network, disk speed etc.)

On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas <[hidden email]> wrote:
Hi Juan,

Flink is quite nimble with hardware requirements; people have run it in old-ish laptops and also the largest instances available in cloud providers. I will let others chime in with more details.

I am not aware of something along the lines of a cheatsheet that you mention. If you actually try to do this, I would love to see it, and it might be useful to others as well. Both use similar abstractions at the API level (i.e., parallel collections), so if you stay true to the functional paradigm and not try to "abuse" the system by exploiting knowledge of its internals things should be straightforward. These apply to the batch APIs; the streaming API in Flink follows a true streaming paradigm, where you get an unbounded stream of records and operators on these streams.

Funny that you ask about a video for the DataStream slides. There is a Flink training happening as we speak, and a video is being recorded right now :-) Hopefully it will be made available soon.

Best,
Kostas

On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <[hidden email]> wrote:
Answering to myself, I have found some nice training material at http://dataartisans.github.io/flink-training. There are even videos at youtube for some of the slides

- http://dataartisans.github.io/flink-training/overview/intro.html
https://www.youtube.com/watch?v=XgC6c4Wiqvs

- http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
https://www.youtube.com/watch?v=0EARqW15dDk

The third lecture http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU but not exactly, and there are more lessons at http://dataartisans.github.io/flink-training, for stream processing and the table API for which I haven't found a video. Does anyone have pointers to the missing videos?

Greetings,

Juan

2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <[hidden email]>:
Hi list,

I'm new to Flink, and I find this project very interesting. I have experience with Apache Spark, and for I've seen so far I find that Flink provides an API at a similar abstraction level but based on single record processing instead of batch processing. I've read in Quora that Flink extends stream processing to batch processing, while Spark extends batch processing to streaming. Therefore I find Flink specially attractive for low latency stream processing. Anyway, I would appreciate if someone could give some indication about where I could find a list of hardware requirements for the slave nodes in a Flink cluster. Something along the lines of https://spark.apache.org/docs/latest/hardware-provisioning.html. Spark is known for having quite high minimal memory requirements (8GB RAM and 8 cores minimum), and I was wondering if it is also the case for Flink. Lower memory requirements would be very interesting for building small Flink clusters for educational purposes, or for small projects.

Apart from that, I wonder if there is some blog post by the comunity about transitioning from Spark to Flink. I think it could be interesting, as there are some similarities in the APIs, but also deep differences in the underlying approaches. I was thinking in something like Breeze's cheatsheet comparing its matrix operatations with those available in Matlab and Numpy https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet, or like http://rosettacode.org/wiki/Factorial. Just an idea anyway. Also, any pointer to some online course, book or training for Flink besides the official programming guides would be much appreciated

Thanks in advance for help

Greetings,

Juan

--
jay vyas

Juan Rodríguez Hortalá

Re: Hardware requirements and learning resources

In reply to this post by Kostas Tzoumas

Hi Kostas,

Thanks a lot for your answer. It's nice to know there are more training videos on their way, they will be on my watch list. I guess you'll be using the data Artisans channel for the new videos too.

Greetings,

Juan

2015-09-02 14:30 GMT+02:00 Kostas Tzoumas <[hidden email]>:

Hi Juan,

Flink is quite nimble with hardware requirements; people have run it in old-ish laptops and also the largest instances available in cloud providers. I will let others chime in with more details.

I am not aware of something along the lines of a cheatsheet that you mention. If you actually try to do this, I would love to see it, and it might be useful to others as well. Both use similar abstractions at the API level (i.e., parallel collections), so if you stay true to the functional paradigm and not try to "abuse" the system by exploiting knowledge of its internals things should be straightforward. These apply to the batch APIs; the streaming API in Flink follows a true streaming paradigm, where you get an unbounded stream of records and operators on these streams.

Funny that you ask about a video for the DataStream slides. There is a Flink training happening as we speak, and a video is being recorded right now :-) Hopefully it will be made available soon.

Best,
Kostas

On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <[hidden email]> wrote:
Answering to myself, I have found some nice training material at http://dataartisans.github.io/flink-training. There are even videos at youtube for some of the slides

- http://dataartisans.github.io/flink-training/overview/intro.html
https://www.youtube.com/watch?v=XgC6c4Wiqvs

- http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
https://www.youtube.com/watch?v=0EARqW15dDk

The third lecture http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU but not exactly, and there are more lessons at http://dataartisans.github.io/flink-training, for stream processing and the table API for which I haven't found a video. Does anyone have pointers to the missing videos?

Greetings,

Juan

2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <[hidden email]>:
Hi list,

I'm new to Flink, and I find this project very interesting. I have experience with Apache Spark, and for I've seen so far I find that Flink provides an API at a similar abstraction level but based on single record processing instead of batch processing. I've read in Quora that Flink extends stream processing to batch processing, while Spark extends batch processing to streaming. Therefore I find Flink specially attractive for low latency stream processing. Anyway, I would appreciate if someone could give some indication about where I could find a list of hardware requirements for the slave nodes in a Flink cluster. Something along the lines of https://spark.apache.org/docs/latest/hardware-provisioning.html. Spark is known for having quite high minimal memory requirements (8GB RAM and 8 cores minimum), and I was wondering if it is also the case for Flink. Lower memory requirements would be very interesting for building small Flink clusters for educational purposes, or for small projects.

Apart from that, I wonder if there is some blog post by the comunity about transitioning from Spark to Flink. I think it could be interesting, as there are some similarities in the APIs, but also deep differences in the underlying approaches. I was thinking in something like Breeze's cheatsheet comparing its matrix operatations with those available in Matlab and Numpy https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet, or like http://rosettacode.org/wiki/Factorial. Just an idea anyway. Also, any pointer to some online course, book or training for Flink besides the official programming guides would be much appreciated

Thanks in advance for help

Greetings,

Juan

Juan Rodríguez Hortalá

Re: Hardware requirements and learning resources

In reply to this post by jay vyas

Hi Robert and Jay,

Thanks for your answers. The petstore jobs could indeed be used as a roseta code for Flink and Spark.

Regarding the memory requirements, those are very good news to me, just 2GB of RAM is certainly a modest amount of memory, you can use even some Single Board Computers for that. Is there any reference load test programs and benchmarks that can be used to compare different deployments of Flink? Maybe the petstore implementation mentioned by Jay could be used for that, and also to compare the performance of Flink to other systems like Spark or Hadoop MapReduce, which I understand is the current goal.

Greetings,

Juan

2015-09-02 14:56 GMT+02:00 jay vyas <[hidden email]>:

We're also working on a bigpetstore implementation of flink which will help onboard spark/mapreduce folks.

I have prototypical code here that runs a simple job in memory, contributions welcome,

right now there is a serialization error https://github.com/bigpetstore/bigpetstore-flink .

On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger <[hidden email]> wrote:
Hi Juan,

I think the recommendations in the Spark guide are quite good, and are similar to what I would recommend for Flink as well.
Depending on the workloads you are interested to run, you can certainly use Flink with less than 8 GB per machine. I think you can start Flink TaskManagers with 500 MB of heap space and they'll still be able to process some GB of data.

Everything above 2 GB is probably good enough for some initial experimentation (again depending on your workloads, network, disk speed etc.)

On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas <[hidden email]> wrote:
Hi Juan,

Flink is quite nimble with hardware requirements; people have run it in old-ish laptops and also the largest instances available in cloud providers. I will let others chime in with more details.

I am not aware of something along the lines of a cheatsheet that you mention. If you actually try to do this, I would love to see it, and it might be useful to others as well. Both use similar abstractions at the API level (i.e., parallel collections), so if you stay true to the functional paradigm and not try to "abuse" the system by exploiting knowledge of its internals things should be straightforward. These apply to the batch APIs; the streaming API in Flink follows a true streaming paradigm, where you get an unbounded stream of records and operators on these streams.

Funny that you ask about a video for the DataStream slides. There is a Flink training happening as we speak, and a video is being recorded right now :-) Hopefully it will be made available soon.

Best,
Kostas

On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <[hidden email]> wrote:
Answering to myself, I have found some nice training material at http://dataartisans.github.io/flink-training. There are even videos at youtube for some of the slides

- http://dataartisans.github.io/flink-training/overview/intro.html
https://www.youtube.com/watch?v=XgC6c4Wiqvs

- http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
https://www.youtube.com/watch?v=0EARqW15dDk

The third lecture http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU but not exactly, and there are more lessons at http://dataartisans.github.io/flink-training, for stream processing and the table API for which I haven't found a video. Does anyone have pointers to the missing videos?

Greetings,

Juan

2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <[hidden email]>:
Hi list,

I'm new to Flink, and I find this project very interesting. I have experience with Apache Spark, and for I've seen so far I find that Flink provides an API at a similar abstraction level but based on single record processing instead of batch processing. I've read in Quora that Flink extends stream processing to batch processing, while Spark extends batch processing to streaming. Therefore I find Flink specially attractive for low latency stream processing. Anyway, I would appreciate if someone could give some indication about where I could find a list of hardware requirements for the slave nodes in a Flink cluster. Something along the lines of https://spark.apache.org/docs/latest/hardware-provisioning.html. Spark is known for having quite high minimal memory requirements (8GB RAM and 8 cores minimum), and I was wondering if it is also the case for Flink. Lower memory requirements would be very interesting for building small Flink clusters for educational purposes, or for small projects.

Apart from that, I wonder if there is some blog post by the comunity about transitioning from Spark to Flink. I think it could be interesting, as there are some similarities in the APIs, but also deep differences in the underlying approaches. I was thinking in something like Breeze's cheatsheet comparing its matrix operatations with those available in Matlab and Numpy https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet, or like http://rosettacode.org/wiki/Factorial. Just an idea anyway. Also, any pointer to some online course, book or training for Flink besides the official programming guides would be much appreciated

Thanks in advance for help

Greetings,

Juan

--
jay vyas

Stefan Winterstein

Re: Hardware requirements and learning resources

In reply to this post by Juan Rodríguez Hortalá

> Answering to myself, I have found some nice training material at
> http://dataartisans.github.io/flink-training.

Excellent resources! Somehow, I managed not to stumble over them by
myself - either I was blind, or they are well hidden... :)

Best,
-Stefan

Kostas Tzoumas

Re: Hardware requirements and learning resources

Well hidden.

I added now a link at the menu of http://data-artisans.com/. This material is provided for free by data Artisans but they are not part of the official Apache Flink project.

On Thu, Sep 3, 2015 at 2:20 PM, Stefan Winterstein <[hidden email]> wrote:

> Answering to myself, I have found some nice training material at
> http://dataartisans.github.io/flink-training.

Excellent resources! Somehow, I managed not to stumble over them by
myself - either I was blind, or they are well hidden... :)

Best,
-Stefan