state size in relation to cluster size and processing speed

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

state size in relation to cluster size and processing speed

swiesman

Hi,

 

I’ve noticed something peculiar about the relationship between state size and cluster size and was wondering if anyone here knows of the reason. I am running a job with 1 hour tumbling event time windows which have an allowed lateness of 7 days. When I run on a 20-node cluster with FsState I can process approximately 1.5 days’ worth of data in an hour with the most recent checkpoint being ~20gb.  Now if I run the same job with the same configurations on a 40-node cluster I can process 2 days’ worth of data in 20 min (expected) but the state size is only ~8gb. Because allowed lateness is 7 days no windows should be purged yet and I would expect the larger cluster which has processed more data to have a larger state. Is there some why a slower running job or a smaller cluster would require more state?

 

This is more of a curiosity than an issue. Thanks’ in advance for any insights you may have.

 

Seth Wiesman

Reply | Threaded
Open this post in threaded view
|

Re: state size in relation to cluster size and processing speed

Aljoscha Krettek
Hi,
how are you generating your watermarks? Could it be that they advance faster when the job is processing more data?

Cheers,
Aljoscha

On Fri, 16 Dec 2016 at 21:01 Seth Wiesman <[hidden email]> wrote:

Hi,

 

I’ve noticed something peculiar about the relationship between state size and cluster size and was wondering if anyone here knows of the reason. I am running a job with 1 hour tumbling event time windows which have an allowed lateness of 7 days. When I run on a 20-node cluster with FsState I can process approximately 1.5 days’ worth of data in an hour with the most recent checkpoint being ~20gb.  Now if I run the same job with the same configurations on a 40-node cluster I can process 2 days’ worth of data in 20 min (expected) but the state size is only ~8gb. Because allowed lateness is 7 days no windows should be purged yet and I would expect the larger cluster which has processed more data to have a larger state. Is there some why a slower running job or a smaller cluster would require more state?

 

This is more of a curiosity than an issue. Thanks’ in advance for any insights you may have.

 

Seth Wiesman

Reply | Threaded
Open this post in threaded view
|

Re: state size in relation to cluster size and processing speed

swiesman

Watermarks are generated using the PeriodicWatermarkAssigner using a timestamp field from within the records. We are processing log data from an S3 bucket and logs are always processed in chronological order using a custom ContinuousFileMonitoringFunction but the standard ContinousFileReaderOperator. Certainly with a larger cluster splits would be processed more quickly and as such the watermark would advance at a quicker pace. Why do you think a more quickly advancing watermark would affect state size in this case?

 

Seth Wiesman

 

From: Aljoscha Krettek <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Friday, December 23, 2016 at 1:43 PM
To: "[hidden email]" <[hidden email]>
Subject: Re: state size in relation to cluster size and processing speed

 

Hi,

how are you generating your watermarks? Could it be that they advance faster when the job is processing more data?

 

Cheers,

Aljoscha

 

On Fri, 16 Dec 2016 at 21:01 Seth Wiesman <[hidden email]> wrote:

Hi,

 

I’ve noticed something peculiar about the relationship between state size and cluster size and was wondering if anyone here knows of the reason. I am running a job with 1 hour tumbling event time windows which have an allowed lateness of 7 days. When I run on a 20-node cluster with FsState I can process approximately 1.5 days’ worth of data in an hour with the most recent checkpoint being ~20gb.  Now if I run the same job with the same configurations on a 40-node cluster I can process 2 days’ worth of data in 20 min (expected) but the state size is only ~8gb. Because allowed lateness is 7 days no windows should be purged yet and I would expect the larger cluster which has processed more data to have a larger state. Is there some why a slower running job or a smaller cluster would require more state?

 

This is more of a curiosity than an issue. Thanks’ in advance for any insights you may have.

 

Seth Wiesman

Reply | Threaded
Open this post in threaded view
|

Re: state size in relation to cluster size and processing speed

Aljoscha Krettek
In reply to this post by swiesman
Hi Seth,
sorry for taking so long to get back to you on this. I think the watermark thing might have been misleading by me, I don't even know anymore what I was thinking back then.

Were you able to confirm that the results were in fact correct for the runs with the different parallelism? I know the results are not the same because you process different amounts of data, but still the correctness of the result can be confirmed.

Best,
Aljoscha

On Fri, 16 Dec 2016 at 21:01 Seth Wiesman <[hidden email]> wrote:

Hi,

 

I’ve noticed something peculiar about the relationship between state size and cluster size and was wondering if anyone here knows of the reason. I am running a job with 1 hour tumbling event time windows which have an allowed lateness of 7 days. When I run on a 20-node cluster with FsState I can process approximately 1.5 days’ worth of data in an hour with the most recent checkpoint being ~20gb.  Now if I run the same job with the same configurations on a 40-node cluster I can process 2 days’ worth of data in 20 min (expected) but the state size is only ~8gb. Because allowed lateness is 7 days no windows should be purged yet and I would expect the larger cluster which has processed more data to have a larger state. Is there some why a slower running job or a smaller cluster would require more state?

 

This is more of a curiosity than an issue. Thanks’ in advance for any insights you may have.

 

Seth Wiesman