(DEPRECATED) Apache Flink User Mailing List archive.

Partitioned tables in SQL client configuration.

Classic

List

Threaded

4 messages Options

Maciek Próchniak

Partitioned tables in SQL client configuration.

Hello,

I try to configure SQL Client to query partitioned ORC data on local
filesystem. I have directory structure like that:

/tmp/table1/startdate=2020-11-28

/tmp/table1/startdate=2020-11-27

etc.

If I run SQL Client session and create table by hand:

create table tst (column1 string, startdate string) partitioned by
(startdate) with ('connector'='filesystem', 'format'='orc',
'path'='/tmp/table1');

everything runs fine:

explain select * from tst where startdate='2020-11-27'

shows that only one partition in 'readPartitions'

However, I struggle to configure table in .yaml config.

I tried like this (after some struggle, as "partition.keys" setting
doesn't seem to be documented...) :

tables:
   - name: tst2
     type: source-table
     connector: filesystem
     path: "/tmp/table1"
     format: orc
     partition.keys:
       - name: startdate
     schema:
       - name: column1
         data-type: string
       - name: startdate
         data-type: string

and it more or less works - queries are executed properly. However,
partitions are not pruned:

explain select * from tst2 where startdate='2020-11-27'

show all partitions in 'readPartitions'

Any idea what can be wrong? I'm using Flink 1.11.2

thanks,

maciek

Till Rohrmann

Re: Partitioned tables in SQL client configuration.

Hi Maciek,

I am pulling in Timo who might help you with this problem.

Cheers,

Till

On Tue, Dec 1, 2020 at 6:51 PM Maciek Próchniak <[hidden email]> wrote:

Hello,

I try to configure SQL Client to query partitioned ORC data on local
filesystem. I have directory structure like that:

/tmp/table1/startdate=2020-11-28

/tmp/table1/startdate=2020-11-27

etc.

If I run SQL Client session and create table by hand:

create table tst (column1 string, startdate string) partitioned by
(startdate) with ('connector'='filesystem', 'format'='orc',
'path'='/tmp/table1');

everything runs fine:

explain select * from tst where startdate='2020-11-27'

shows that only one partition in 'readPartitions'

However, I struggle to configure table in .yaml config.

I tried like this (after some struggle, as "partition.keys" setting
doesn't seem to be documented...) :

tables:
   - name: tst2
     type: source-table
     connector: filesystem
     path: "/tmp/table1"
     format: orc
     partition.keys:
       - name: startdate
     schema:
       - name: column1
         data-type: string
       - name: startdate
         data-type: string

and it more or less works - queries are executed properly. However,
partitions are not pruned:

explain select * from tst2 where startdate='2020-11-27'

show all partitions in 'readPartitions'

Any idea what can be wrong? I'm using Flink 1.11.2

thanks,

maciek

Jark Wu-3

Re: Partitioned tables in SQL client configuration.

Only legacy connectors (`connector.type=kafka` instead of `connector=kafka`) are supported in the YAML at the moment. You can use regular DDL instead. There is a similar discussion in https://issues.apache.org/jira/browse/FLINK-20260 these days.

Best,

Jark

On Thu, 3 Dec 2020 at 00:52, Till Rohrmann <[hidden email]> wrote:

Hi Maciek,

I am pulling in Timo who might help you with this problem.

Cheers,
Till

On Tue, Dec 1, 2020 at 6:51 PM Maciek Próchniak <[hidden email]> wrote:
Hello,

I try to configure SQL Client to query partitioned ORC data on local
filesystem. I have directory structure like that:

/tmp/table1/startdate=2020-11-28

/tmp/table1/startdate=2020-11-27

etc.

If I run SQL Client session and create table by hand:

create table tst (column1 string, startdate string) partitioned by
(startdate) with ('connector'='filesystem', 'format'='orc',
'path'='/tmp/table1');

everything runs fine:

explain select * from tst where startdate='2020-11-27'

shows that only one partition in 'readPartitions'

However, I struggle to configure table in .yaml config.

I tried like this (after some struggle, as "partition.keys" setting
doesn't seem to be documented...) :

tables:
   - name: tst2
     type: source-table
     connector: filesystem
     path: "/tmp/table1"
     format: orc
     partition.keys:
       - name: startdate
     schema:
       - name: column1
         data-type: string
       - name: startdate
         data-type: string

and it more or less works - queries are executed properly. However,
partitions are not pruned:

explain select * from tst2 where startdate='2020-11-27'

show all partitions in 'readPartitions'

Any idea what can be wrong? I'm using Flink 1.11.2

thanks,

maciek

Maciek Próchniak

Re: Partitioned tables in SQL client configuration.

Hi Jark,

thanks for answer. I'm a bit puzzled, because in my yaml I'm using "connector: filesystem" (not connector.type). I don't think I end up using

https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/connect.html#file-system-connector - this connector as partitioning and orc format are handled correctly.

It's also not clear for me what is "not legacy" connector for reading files directly from filesystem (no Hive). I don't see any implementation of DynamicTableSourceFactory which would do this.

I assumed that using DDL I wrote below also gives me FileSystemTableFactory, am I wrong?

thanks,

maciek

On 03.12.2020 16:26, Jark Wu wrote:

Only legacy connectors (`connector.type=kafka` instead of `connector=kafka`) are supported in the YAML at the moment. You can use regular DDL instead. There is a similar discussion in https://issues.apache.org/jira/browse/FLINK-20260 these days.

Best,

Jark

On Thu, 3 Dec 2020 at 00:52, Till Rohrmann <[hidden email]> wrote:

Hi Maciek,

I am pulling in Timo who might help you with this problem.

Cheers,

Till

On Tue, Dec 1, 2020 at 6:51 PM Maciek Próchniak <[hidden email]> wrote:

Hello,

I try to configure SQL Client to query partitioned ORC data on local
filesystem. I have directory structure like that:

/tmp/table1/startdate=2020-11-28

/tmp/table1/startdate=2020-11-27

etc.

If I run SQL Client session and create table by hand:

create table tst (column1 string, startdate string) partitioned by
(startdate) with ('connector'='filesystem', 'format'='orc',
'path'='/tmp/table1');

everything runs fine:

explain select * from tst where startdate='2020-11-27'

shows that only one partition in 'readPartitions'

However, I struggle to configure table in .yaml config.

I tried like this (after some struggle, as "partition.keys" setting
doesn't seem to be documented...) :

tables:
   - name: tst2
     type: source-table
     connector: filesystem
     path: "/tmp/table1"
     format: orc
     partition.keys:
       - name: startdate
     schema:
       - name: column1
         data-type: string
       - name: startdate
         data-type: string

and it more or less works - queries are executed properly. However,
partitions are not pruned:

explain select * from tst2 where startdate='2020-11-27'

show all partitions in 'readPartitions'

Any idea what can be wrong? I'm using Flink 1.11.2

thanks,

maciek