(DEPRECATED) Apache Flink User Mailing List archive.

How to rebalance a table without converting to dataset

Classic

List

Threaded

3 messages Options

Darshan Singh

How to rebalance a table without converting to dataset

I have a table and I want to rebalance the data so that each partition is equal. I cna convert to dataset and rebalance and then convert to table.

I couldnt find any rebalance on table api. Does anyone know any better idea to rebalance table data?

Thanks

Fabian Hueske-2

Re: How to rebalance a table without converting to dataset

Hi Darshan,

You are right. there's currently no rebalancing operation on the Table API.

I see that this might be a good feature, not sure though how easy it would be to integrate because we need to pass it through the Calcite optimizer and rebalancing is not a relational operation.

For now, converting to DataSet and back to Table is the only option.

Best, Fabian

2018-04-13 14:33 GMT+02:00 Darshan Singh <[hidden email]>:

Hi

I have a table and I want to rebalance the data so that each partition is equal. I cna convert to dataset and rebalance and then convert to table.

I couldnt find any rebalance on table api. Does anyone know any better idea to rebalance table data?

Thanks

Shuyi Chen

Re: How to rebalance a table without converting to dataset

Hi Darshan, thanks for raising the problem. We do have similar use of rebalancing in Flink SQL, where we want to rebalance the Kafka input with more partitions to increase parallelism in streaming.

As Fabian suggests, rebalancing is not relation algebra. The closest use of the operation I can find in databases is in vertica (REBALANCE_TABLE), but it's used more as a one-time rebalance operation of the data after adding/removing nodes. However, on the data processing side, I think this might deserve more attention because we can't easily modify the input data source, e.g. number of Kafka partitions.

The closest I can think of to enable this feature is DDL in SQL, e.g. something like ALTER TABLE REBALANCE in SQL. With this DDL statement, it will cause a rebalance() call when StreamTableSource.getDataStream or BatchTableSource.getDataSet is invoked. In such case, we dont need to touch the parser or planner in Calcite. For the table API, a possible solution would be to add a rebalance() api in table API, but will need to close out the LogicalPlan every time rebalance() is called, so we won't need to touch the Calcite planner.

On Mon, Apr 16, 2018 at 5:41 AM, Fabian Hueske <[hidden email]> wrote:

Hi Darshan,

You are right. there's currently no rebalancing operation on the Table API.
I see that this might be a good feature, not sure though how easy it would be to integrate because we need to pass it through the Calcite optimizer and rebalancing is not a relational operation.

For now, converting to DataSet and back to Table is the only option.

Best, Fabian

2018-04-13 14:33 GMT+02:00 Darshan Singh <[hidden email]>:
Hi

I have a table and I want to rebalance the data so that each partition is equal. I cna convert to dataset and rebalance and then convert to table.

I couldnt find any rebalance on table api. Does anyone know any better idea to rebalance table data?

Thanks

"So you have to trust that the dots will somehow connect in your future."