Hi all,
is there any integration between Presto and Flink? I'd like to use Presto for the UI part (preview and so on) while using Flink for the batch processing. Do you suggest something else otherwise? Best, Flavio |
Hi Flavio, Presto contributor and Starburst Partners here. Presto and Flink are solving completely different challenges. Flink is about processing data streams as they come in; Presto is about ad-hoc / periodic querying of data sources. A typical architecture would use Flink to process data streams and write data and aggregations to some data stores (Redis, MemSQL, SQLs, Elasticsearch, etc) and then using Presto to query those data stores (and possible also others using Query Federation). What kind of integration will you be looking for? On Mon, Jan 27, 2020 at 1:44 PM Flavio Pompermaier <[hidden email]> wrote:
--
|
Both Presto and Flink make use of a Catalog in order to be able to read/write data from a source/sink. I don't agree about " Flink is about processing data streams" because Flink is competitive also for the batch workloads (and this will be further improved in the next releases). I'd like to register my data sources/sinks in one single catalog (E.g. Presto) and then being able to reuse it also in Flink (with a simple translation). My idea of integration here is thus more at catalog level since I would use Presto for exploring data from UI and Flink to process it because once the configuration part has finished (since I have many Flink jobs that I don't want to throw away or rewrite). On Mon, Jan 27, 2020 at 2:30 PM Itamar Syn-Hershko <[hidden email]> wrote:
|
Yes, Flink does batch processing by "reevaluating a stream" so to speak. Presto doesn't have sources and sinks, only catalogs (which are always allowing reads, and sometimes also writes). Presto catalogs are a configuration - they are managed on the node filesystem as a configuration file and nowhere else. Flink sources/sinks are programmatically configurable and are compiled into your Flink program. So that is not possible at the moment, and all that's possible to do is get that info form the API of both products and visualize that. Definitely not managing them from a single place. On Mon, Jan 27, 2020 at 3:54 PM Flavio Pompermaier <[hidden email]> wrote:
--
|
Hi Flavio, Your requirement should be to use blink batch to read the tables in Presto? I'm not familiar with Presto's catalog. Is it like hive Metastore? If so, what needs to be done is similar to the hive connector. You need to implement a catalog of presto, which translates the Presto table into a Flink table. You may need to deal with partitions, statistics, and so on. Best, Jingsong Lee On Mon, Jan 27, 2020 at 9:58 PM Itamar Syn-Hershko <[hidden email]> wrote:
Best, Jingsong Lee |
Hi,
Yes, Presto (in presto-hive connector) is just using hive Metastore to get the table definitions/meta data. If you connect to the same hive Metastore with Flink, both systems should be able to see the same tables. Piotrek
|
Hive metastore is the de facto standard for Hadoop but in my use case I have to query other databases (like MySQL, Oracle and SQL Server). So Presto would be a good choice (apart from the fact that you need to restart it when you add a new catalog..), and I'd like to have an easy translation of the catalogs.. Another fear I have is that I could have different versions of the same database type (e.g. Oracle or SQL server) and I'll probably hit an incompatibility when using the latest jar of a connector. From what I see this corner case doesn't have a clear solution but I have some workaround in mind that I need to verify (e.g. shade jars or allocate source reader tasks to different Task Managers based on the deployed jar versions..) On Tue, Jan 28, 2020 at 11:05 AM Piotr Nowojski <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |