http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/DISCUSS-Towards-a-leaner-flink-dist-tp25615p25684.html
There are some points where a leaner approach could help.
There are many libraries and connectors that are currently being adding to Flink, which makes the "include all" approach not completely feasible in long run:
- Connectors: For a proper experience with the Shell/CLI (for example for SQL) we need a lot of fat connector jars.
These come often for multiple versions, which alone accounts for 100s of MBs of connector jars.
- The pre-bundled FileSystems are also on the verge of adding 100s of MBs themselves.
- The metric reporters are bit by bit growing as well.
The following could be a compromise:
The flink-dist would include
- the core flink libraries (core, apis, runtime, etc.)
- yarn / mesos etc. adapters
- examples (the examples should be a small set of self-contained programs without additional dependencies)
- default logging
- default metric reporter (jmx)
- shells (scala, sql)
The flink-dist would NOT include the following libs (and these would be offered for individual download)
- Hadoop libs
- the pre-shaded file systems
- the pre-packaged SQL connectors
- additional metric reporters
Thanks Chesnay for raising this discussion thread. I think there are 3 major use scenarios for flink binary distribution.
1. Use it to set up standalone cluster
2. Use it to experience features of flink, such as via scala-shell, sql-client
3. Downstream project use it to integrate with their system
I did a size estimation of flink dist folder, lib folder take around 100M and opt folder take around 200M. Overall I agree to make a thin flink dist.
So the next problem is which components to drop. I check the opt folder, and I think the filesystem components and metrics components could be moved out. Because they are pluggable components and is only used in scenario 1 I think (setting up standalone cluster). Other components like flink-table, flink-ml, flnk-gellay, we should still keep them IMHO, because new user may still use it to try the features of flink. For me, scala-shell is the first option to try new features of flink.
Hi Chesnay,
Thank you for the proposal.
I think this is a good idea.
We follow a similar approach already for Hadoop dependencies and connectors (although in application space).
+1
Fabian
Am Fr., 18. Jan. 2019 um 10:59 Uhr schrieb Chesnay Schepler <
[hidden email]>:
Hello,
the binary distribution that we release by now contains quite a lot of
optional components, including various filesystems, metric reporters and
libraries. Most users will only use a fraction of these, and as such
pretty much only increase the size of flink-dist.
With Flink growing more and more in scope I don't believe it to be
feasible to ship everything we have with every distribution, and instead
suggest more of a "pick-what-you-need" model, where flink-dist is rather
lean and additional components are downloaded separately and added by
the user.
This would primarily affect the /opt directory, but could also be
extended to cover flink-dist. For example, the yarn and mesos code could
be spliced out into separate jars that could be added to lib manually.
Let me know what you think.
Regards,
Chesnay
--
Best Regards
Jeff Zhang