Hi,
I’ve been working with a pipleline that was initially aimed at processing high speed sensor data, but for a proof of concept I’m feeding simulated data from a CSV file. Each row of the file is a sample across a number of time series, and I’ve been using the streaming environment to process each column of the file in parallel. The columns are each around 6 million samples in length. The outputs are being sunk into a Derby database table. However, I’m not getting a very high throughput even running across two 32G Dell laptops with the database on a 16Mb Macbook. According to the metrics it seems I’m only dropping records on to the database at a rate of around 20-30 per second (I assume per parallel pipe of which I’m running 16 across the two laptops). The pipeline runs a couple of windowing operations one of length 10 and one of length 100 with a small amount of computation, but it does yield a considerable amount of output a 10G CSV file yielding a database of around 100Gb+. I’m thinking that the slow rate is due to using stream process to process a batch. So I’ve I’ve been looking at the batch support in Flink (1.8) intending to move the code over from stream to batch execution. However, I can’t make head’n tail of the batch DataSet documentation. For example, in the streaming environment I was setting a watermark after I read each line of the file to keep the time series in order, and using keyBy to split up the individual time series by column number. I can’t find the equivalent operations in the batch interface. Could someone guide me to some relevant online documentation,and before anyone says I have chatted with Dr Google for a good while to no satisfactory outcome TIA Nick Walton |
Hi Nick,
It seems to me that the slow part of the whole pipeline is the Derby sink. Could you change it into other sinks (for example, csv sink or even a "discard everything" sink) and see if the throughput improves? If this is the case, are you using the JDBC connector? If yes, you might consider calling the `JDBCOutputFormatBuilder#setBatchInterval` method and make the batch interval to a larger value. Thanks Nicholas Walton <[hidden email]> 于2019年11月28日周四 上午1:57写道: Hi, |
Free forum by Nabble | Edit this page |