http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/DISCUSS-FLIP-115-Filesystem-connector-in-Table-tp33625p33659.html
Thanks Piotr and Yun for involving.
Hi Piotr and Yun, for implementation,
FLINK-14254 [1] introduce batch sink table world, it deals with partitions thing, metastore thing and etc.. And it just reuse Dataset/Datastream FileInputFormat and FileOutputFormat. Filesystem can not do without FileInputFormat, because it need deal with file things, split things. Like orc and parquet, they need read whole file and have different split logic.
So back to file system connector:
- It needs introducing FilesystemTableFactory, FilesystemTableSource and FilesystemTableSink.
- For sources, reusing Dataset/Datastream FileInputFormats, there are no other interface to finish file reading.
For file sinks:
- Batch sink use FLINK-14254
- Streaming sink has two ways.
First way is reusing Batch sink in FLINK-14254, It has handled the partition and metastore logic well.
- unify batch and streaming
- Using FileOutputFormat is consistent with FileInputFormat.
- Add exactly-once related logic. Just 200+ lines code.
- It's natural to support more table features, like partition commit, auto compact and etc..
Second way is reusing Datastream StreamingFileSink:
- unify streaming sink between table and Datastream.
- It maybe hard to introduce table related features to StreamingFileSink.
I prefer the first way a little. What do you think?
Hi Yun,
> Watermark mechanism might not be enough.
Watermarks of subtasks are the same in the "snapshotState".
> we might need to also do some coordination between subtasks.
Yes, JobMaster is the role to control subtasks. Metastore is a very fragile single point, which can not be accessed by distributed, so it is uniformly accessed by JobMaster.
Best,
Jingsong Lee