Hi all, In case it is useful to some of you: I have a big batch that needs to use globs (*.parquet for
example) to read input files. It seems that globs do not work out
of the box (see https://issues.apache.org/jira/browse/FLINK-6417) But there is a workaround: final FileInputFormat inputFormat = new FileInputFormat(new Path(extractDir(filePath))); /* or any subclass of FileInputFormat*/ /*extact parent dir*/ inputFormat.setFilesFilter(new GlobFilePathFilter(Collections.singletonList(filePath), Collections.emptyList())); /*filePath contains glob, the whole path needs to be provided to GlobFilePathFilter*/ inputFormat.setNestedFileEnumeration(true); Hope, it helps some people Etienne Chauchot |
But still this workaround would only work when you have access to
the underlying FileInputFormat. For SQL and
Table APIs, you don't so you'll be unable to apply this
workaround. So what we could do is make a PR to support glob at
the FileInputFormat level to profit for all APIs. I'm gonna do it if everyone agrees. Best Etienne Chauchot On 25/03/2021 13:12, Etienne Chauchot
wrote:
|
Hi Etienne, In general, any small PR on this subject is very welcome. I don't think that the community as a whole will invest much into FileInputFormat as the whole DataSet API is phasing out. Afaik SQL and Table API are only using InputFormat for the legacy compatibility layer (e.g. when it comes to translating into DataSet). All the new batchy stuff is based on BulkFormat and unified source/sink interface. I'm CC'ing Timo who can correct me if I'm wrong. So if you just want to add glob support on FileInputFormat /only/ for SQL and Table API, I don't think it's worth the effort. It would be more interesting to see if the new FileSource does support it properly and rather add it there. On Mon, Mar 29, 2021 at 4:57 PM Etienne Chauchot <[hidden email]> wrote: But still this workaround would only work when you have access to the |
Free forum by Nabble | Edit this page |