Multiple Async IO

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Multiple Async IO

Maxim Parkachov
Hi everyone,

I'm writing streaming job which needs to query Cassandra for each event multiple times, around 20. I would like to use Async IO for that but not sure which option to choose:

1. Implement One AsyncFunction with 20 queries inside
2. Implement 20 AsyncFunctions, each with 1 query inside

Taking into account that each event needs all queries. Reduce amount of queries for each record is not an option. 

In this case I would like to minimise processing time of event, even if throughput will suffer. Any advice or consideration is greatly appreciated.

Thanks,
Maxim.
 
Reply | Threaded
Open this post in threaded view
|

Re: Multiple Async IO

Ken Krugler
Hi Maxim,

If reducing latency is the goal, then option #1 seems better.

Though you’d need additional logic inside of your AsyncFunction to run all 20 queries in parallel.

I’d also consider a third option...

Use a FlatMapFunction to create 20 copies of the event (assuming it’s not large), with an additional field indicating which query should be made.

Follow that with a rebalance(), and a single AsyncFunction that makes the appropriate query for the event, based on this new field.

Then make sure you’ve got sufficient parallelism for your AsyncFunction to handle this fan-out.

This should let you run the queries for a single event in parallel.

— Ken


> On Apr 3, 2018, at 9:59 AM, Maxim Parkachov <[hidden email]> wrote:
>
> Hi everyone,
>
> I'm writing streaming job which needs to query Cassandra for each event multiple times, around 20. I would like to use Async IO for that but not sure which option to choose:
>
> 1. Implement One AsyncFunction with 20 queries inside
> 2. Implement 20 AsyncFunctions, each with 1 query inside
>
> Taking into account that each event needs all queries. Reduce amount of queries for each record is not an option.
>
> In this case I would like to minimise processing time of event, even if throughput will suffer. Any advice or consideration is greatly appreciated.
>
> Thanks,
> Maxim.
>  

--------------------------------------------
http://about.me/kkrugler
+1 530-210-6378

Reply | Threaded
Open this post in threaded view
|

Re: Multiple Async IO

Fabian Hueske-2
Hi Maxim,

I think Ken's approach is a good idea. However, you would need to a add a stateful operator to join the results of the individual queries if that is needed.
In order to join the results, you would need a unique id on which you can keyBy() to collect all 20 records that originated from the same input record.

Best, Fabian

2018-04-03 19:39 GMT+02:00 Ken Krugler <[hidden email]>:
Hi Maxim,

If reducing latency is the goal, then option #1 seems better.

Though you’d need additional logic inside of your AsyncFunction to run all 20 queries in parallel.

I’d also consider a third option...

Use a FlatMapFunction to create 20 copies of the event (assuming it’s not large), with an additional field indicating which query should be made.

Follow that with a rebalance(), and a single AsyncFunction that makes the appropriate query for the event, based on this new field.

Then make sure you’ve got sufficient parallelism for your AsyncFunction to handle this fan-out.

This should let you run the queries for a single event in parallel.

— Ken


> On Apr 3, 2018, at 9:59 AM, Maxim Parkachov <[hidden email]> wrote:
>
> Hi everyone,
>
> I'm writing streaming job which needs to query Cassandra for each event multiple times, around 20. I would like to use Async IO for that but not sure which option to choose:
>
> 1. Implement One AsyncFunction with 20 queries inside
> 2. Implement 20 AsyncFunctions, each with 1 query inside
>
> Taking into account that each event needs all queries. Reduce amount of queries for each record is not an option.
>
> In this case I would like to minimise processing time of event, even if throughput will suffer. Any advice or consideration is greatly appreciated.
>
> Thanks,
> Maxim.
>

--------------------------------------------
http://about.me/kkrugler
+1 530-210-6378