A few questions about minibatch

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

A few questions about minibatch

Rex Fenley
Hi,

Our job was experiencing high write amplification on aggregates so we decided to give mini-batch a go. There's a few things I've noticed that are different from our previous job and I would like some clarification.

1) Our operators now say they have Watermarks. We never explicitly added watermarks, and our state is essentially unbounded across all time since it consumes from Debezium and reshapes our database data into another store. Why does it say we have Watermarks then?

2) In our sources I see MiniBatchAssigner(interval=[1000ms], mode=[ProcTime], what does that do?

3) I don't really see anything else different yet in the shape of our plan even though we've turned on
configuration.setString(
"table.optimizer.agg-phase-strategy",
"TWO_PHASE"
)
is there a way to check that this optimization is on? We use user defined aggregate functions, does it work for UDAF?

Thanks!

--

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com |  BLOG  |  FOLLOW US  |  LIKE US

Reply | Threaded
Open this post in threaded view
|

Re: A few questions about minibatch

Rex Fenley
Hello,

Does anyone have any more information here?

Thanks!

On Wed, Jan 20, 2021 at 9:13 PM Rex Fenley <[hidden email]> wrote:
Hi,

Our job was experiencing high write amplification on aggregates so we decided to give mini-batch a go. There's a few things I've noticed that are different from our previous job and I would like some clarification.

1) Our operators now say they have Watermarks. We never explicitly added watermarks, and our state is essentially unbounded across all time since it consumes from Debezium and reshapes our database data into another store. Why does it say we have Watermarks then?

2) In our sources I see MiniBatchAssigner(interval=[1000ms], mode=[ProcTime], what does that do?

3) I don't really see anything else different yet in the shape of our plan even though we've turned on
configuration.setString(
"table.optimizer.agg-phase-strategy",
"TWO_PHASE"
)
is there a way to check that this optimization is on? We use user defined aggregate functions, does it work for UDAF?

Thanks!

--

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com |  BLOG  |  FOLLOW US  |  LIKE US



--

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com |  BLOG  |  FOLLOW US  |  LIKE US

Reply | Threaded
Open this post in threaded view
|

Re: A few questions about minibatch

Dawid Wysakowicz-2

I am pulling Jark and Godfrey who are more familiar with the planner internals.

Best,

Dawid

On 22/01/2021 20:11, Rex Fenley wrote:
Hello,

Does anyone have any more information here?

Thanks!

On Wed, Jan 20, 2021 at 9:13 PM Rex Fenley <[hidden email]> wrote:
Hi,

Our job was experiencing high write amplification on aggregates so we decided to give mini-batch a go. There's a few things I've noticed that are different from our previous job and I would like some clarification.

1) Our operators now say they have Watermarks. We never explicitly added watermarks, and our state is essentially unbounded across all time since it consumes from Debezium and reshapes our database data into another store. Why does it say we have Watermarks then?

2) In our sources I see MiniBatchAssigner(interval=[1000ms], mode=[ProcTime], what does that do?

3) I don't really see anything else different yet in the shape of our plan even though we've turned on
configuration.setString(
"table.optimizer.agg-phase-strategy",
"TWO_PHASE"
)
is there a way to check that this optimization is on? We use user defined aggregate functions, does it work for UDAF?

Thanks!

--

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com |  BLOG  |  FOLLOW US  |  LIKE US



--

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com |  BLOG  |  FOLLOW US  |  LIKE US


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: A few questions about minibatch

Jark Wu-3
Hi Rex,

Could you share your query here? It would be helpful to identify the root cause if we have the query. 

1) watermark
The framework automatically adds a node (the MiniBatchAssigner) to generate watermark events as the mini-batch id to broadcast and trigger mini-batch in the pipeline. 

2) MiniBatchAssigner(interval=[1000ms], mode=[ProcTime]
It generates a new mini-batch id in an interval of 1000ms in system time. The mini-batch id is represented by the watermark event. 

3) TWO_PHASE optimization
If users want to have TWO_PHASE optimization, it requires the aggregate functions all support the merge() method and the mini-batch is enabled.

Best,
Jark




On Tue, 26 Jan 2021 at 19:01, Dawid Wysakowicz <[hidden email]> wrote:

I am pulling Jark and Godfrey who are more familiar with the planner internals.

Best,

Dawid

On 22/01/2021 20:11, Rex Fenley wrote:
Hello,

Does anyone have any more information here?

Thanks!

On Wed, Jan 20, 2021 at 9:13 PM Rex Fenley <[hidden email]> wrote:
Hi,

Our job was experiencing high write amplification on aggregates so we decided to give mini-batch a go. There's a few things I've noticed that are different from our previous job and I would like some clarification.

1) Our operators now say they have Watermarks. We never explicitly added watermarks, and our state is essentially unbounded across all time since it consumes from Debezium and reshapes our database data into another store. Why does it say we have Watermarks then?

2) In our sources I see MiniBatchAssigner(interval=[1000ms], mode=[ProcTime], what does that do?

3) I don't really see anything else different yet in the shape of our plan even though we've turned on
configuration.setString(
"table.optimizer.agg-phase-strategy",
"TWO_PHASE"
)
is there a way to check that this optimization is on? We use user defined aggregate functions, does it work for UDAF?

Thanks!

--

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com |  BLOG  |  FOLLOW US  |  LIKE US



--

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com |  BLOG  |  FOLLOW US  |  LIKE US

Reply | Threaded
Open this post in threaded view
|

Re: A few questions about minibatch

Rex Fenley
Thanks, that all makes sense!

On Wed, Jan 27, 2021 at 7:00 PM Jark Wu <[hidden email]> wrote:
Hi Rex,

Could you share your query here? It would be helpful to identify the root cause if we have the query. 

1) watermark
The framework automatically adds a node (the MiniBatchAssigner) to generate watermark events as the mini-batch id to broadcast and trigger mini-batch in the pipeline. 

2) MiniBatchAssigner(interval=[1000ms], mode=[ProcTime]
It generates a new mini-batch id in an interval of 1000ms in system time. The mini-batch id is represented by the watermark event. 

3) TWO_PHASE optimization
If users want to have TWO_PHASE optimization, it requires the aggregate functions all support the merge() method and the mini-batch is enabled.

Best,
Jark




On Tue, 26 Jan 2021 at 19:01, Dawid Wysakowicz <[hidden email]> wrote:

I am pulling Jark and Godfrey who are more familiar with the planner internals.

Best,

Dawid

On 22/01/2021 20:11, Rex Fenley wrote:
Hello,

Does anyone have any more information here?

Thanks!

On Wed, Jan 20, 2021 at 9:13 PM Rex Fenley <[hidden email]> wrote:
Hi,

Our job was experiencing high write amplification on aggregates so we decided to give mini-batch a go. There's a few things I've noticed that are different from our previous job and I would like some clarification.

1) Our operators now say they have Watermarks. We never explicitly added watermarks, and our state is essentially unbounded across all time since it consumes from Debezium and reshapes our database data into another store. Why does it say we have Watermarks then?

2) In our sources I see MiniBatchAssigner(interval=[1000ms], mode=[ProcTime], what does that do?

3) I don't really see anything else different yet in the shape of our plan even though we've turned on
configuration.setString(
"table.optimizer.agg-phase-strategy",
"TWO_PHASE"
)
is there a way to check that this optimization is on? We use user defined aggregate functions, does it work for UDAF?

Thanks!

--

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com |  BLOG  |  FOLLOW US  |  LIKE US



--

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com |  BLOG  |  FOLLOW US  |  LIKE US



--

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com |  BLOG  |  FOLLOW US  |  LIKE US