How to rebalance a table without converting to dataset

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

How to rebalance a table without converting to dataset

Darshan Singh
Hi

I have a table and I want to rebalance the data so that each partition is equal. I cna convert to dataset and rebalance and then convert to table.

I couldnt find any rebalance on table api. Does anyone know any better idea to rebalance table data?

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: How to rebalance a table without converting to dataset

Fabian Hueske-2
Hi Darshan,

You are right. there's currently no rebalancing operation on the Table API.
I see that this might be a good feature, not sure though how easy it would be to integrate because we need to pass it through the Calcite optimizer and rebalancing is not a relational operation.

For now, converting to DataSet and back to Table is the only option.

Best, Fabian

2018-04-13 14:33 GMT+02:00 Darshan Singh <[hidden email]>:
Hi

I have a table and I want to rebalance the data so that each partition is equal. I cna convert to dataset and rebalance and then convert to table.

I couldnt find any rebalance on table api. Does anyone know any better idea to rebalance table data?

Thanks

Reply | Threaded
Open this post in threaded view
|

Re: How to rebalance a table without converting to dataset

Shuyi Chen
Hi Darshan, thanks for raising the problem. We do have similar use of rebalancing in Flink SQL, where we want to rebalance the Kafka input with more partitions to increase parallelism in streaming.

As Fabian suggests, rebalancing is not relation algebra. The closest use of the operation I can find in databases is in vertica (REBALANCE_TABLE), but it's used more as a one-time rebalance operation of the data after adding/removing nodes. However, on the data processing side, I think this might deserve more attention because we can't easily modify the input data source, e.g. number of Kafka partitions.

The closest I can think of to enable this feature is DDL in SQL, e.g. something like ALTER TABLE REBALANCE in SQL. With this DDL statement, it will cause a rebalance() call when StreamTableSource.getDataStream or BatchTableSource.getDataSet is invoked. In such case, we dont need to touch the parser or planner in Calcite. For the table API, a possible solution would be to add a rebalance() api in table API, but will need to close out the LogicalPlan every time rebalance() is called, so we won't need to touch the Calcite planner.

On Mon, Apr 16, 2018 at 5:41 AM, Fabian Hueske <[hidden email]> wrote:
Hi Darshan,

You are right. there's currently no rebalancing operation on the Table API.
I see that this might be a good feature, not sure though how easy it would be to integrate because we need to pass it through the Calcite optimizer and rebalancing is not a relational operation.

For now, converting to DataSet and back to Table is the only option.

Best, Fabian

2018-04-13 14:33 GMT+02:00 Darshan Singh <[hidden email]>:
Hi

I have a table and I want to rebalance the data so that each partition is equal. I cna convert to dataset and rebalance and then convert to table.

I couldnt find any rebalance on table api. Does anyone know any better idea to rebalance table data?

Thanks




--
"So you have to trust that the dots will somehow connect in your future."