Converting streaming to batch execution

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Converting streaming to batch execution

Nicholas Walton
Hi,

I’ve been working with a pipleline that was initially aimed at processing high speed sensor data, but for a proof of concept I’m feeding simulated data from a CSV file. Each row of the file is a sample across a number of time series, and I’ve been using the streaming environment to process each column of the file in parallel. The columns are each around 6 million samples in length. The outputs are being sunk into a Derby database table.

However, I’m not getting a very high throughput even running across two 32G Dell laptops with the database on a 16Mb Macbook. According to the metrics it seems I’m only dropping records on to the database at a rate of around 20-30 per second (I assume per parallel pipe of which I’m running 16 across the two laptops). The pipeline runs a couple of windowing operations one of length 10 and one of length 100 with a small amount of computation, but it does yield a considerable amount of output a 10G CSV file yielding a database of around 100Gb+.

I’m thinking that the slow rate is due to using stream process to process a batch. So I’ve I’ve been looking at the batch support in Flink (1.8) intending to move the code over from stream to batch execution. However, I can’t make head’n tail of the batch DataSet documentation. For example, in the streaming environment I was setting a watermark after I read each line of the file to keep the time series in order, and using keyBy to split up the individual time series by column number. I can’t find the equivalent operations in the batch interface.

Could someone guide me to some relevant online documentation,and before anyone says I have chatted with Dr Google for a good while to no satisfactory outcome

TIA

Nick Walton
Reply | Threaded
Open this post in threaded view
|

Re: Converting streaming to batch execution

Caizhi Weng
Hi Nick,

It seems to me that the slow part of the whole pipeline is the Derby sink. Could you change it into other sinks (for example, csv sink or even a "discard everything" sink) and see if the throughput improves?

If this is the case, are you using the JDBC connector? If yes, you might consider calling the `JDBCOutputFormatBuilder#setBatchInterval` method and make the batch interval to a larger value.

Thanks

Nicholas Walton <[hidden email]> 于2019年11月28日周四 上午1:57写道:
Hi,

I’ve been working with a pipleline that was initially aimed at processing high speed sensor data, but for a proof of concept I’m feeding simulated data from a CSV file. Each row of the file is a sample across a number of time series, and I’ve been using the streaming environment to process each column of the file in parallel. The columns are each around 6 million samples in length. The outputs are being sunk into a Derby database table.

However, I’m not getting a very high throughput even running across two 32G Dell laptops with the database on a 16Mb Macbook. According to the metrics it seems I’m only dropping records on to the database at a rate of around 20-30 per second (I assume per parallel pipe of which I’m running 16 across the two laptops). The pipeline runs a couple of windowing operations one of length 10 and one of length 100 with a small amount of computation, but it does yield a considerable amount of output a 10G CSV file yielding a database of around 100Gb+.

I’m thinking that the slow rate is due to using stream process to process a batch. So I’ve I’ve been looking at the batch support in Flink (1.8) intending to move the code over from stream to batch execution. However, I can’t make head’n tail of the batch DataSet documentation. For example, in the streaming environment I was setting a watermark after I read each line of the file to keep the time series in order, and using keyBy to split up the individual time series by column number. I can’t find the equivalent operations in the batch interface.

Could someone guide me to some relevant online documentation,and before anyone says I have chatted with Dr Google for a good while to no satisfactory outcome

TIA

Nick Walton