(DEPRECATED) Apache Flink User Mailing List archive.

Execution environments for testing: local vs collection vs mini cluster

Classic

List

Threaded

5 messages Options

Juan Rodríguez Hortalá

Execution environments for testing: local vs collection vs mini cluster

Hi,

In https://ci.apache.org/projects/flink/flink-docs-stable/dev/local_execution.html and https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/runtime/minicluster/MiniCluster.html I see there are 3 ways to create an execution environment for testing:

StreamExecutionEnvironment.createLocalEnvironment and ExecutionEnvironment.createLocalEnvironment create an execution environment running on a single JVM using different threads.
CollectionEnvironment runs on a single JVM on a single thread.
I haven't found not much documentation on the Mini Cluster, but it sounds similar to the Hadoop MiniCluster. If that is then case, then it would run on many local JVMs, each of them running multiple threads.

Am I correct about the Mini Cluster? Is there any additional documentation about it? I discovered it looking at the source code of AbstractTestBase, that is mentioned on https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/testing.html#integration-testing. Also, it looks like launching the mini cluster registers it somewhere, so subsequent calls to `StreamExecutionEnvironment.getExecutionEnvironment` return an environment that uses the mini cluster. Is that performed by `executionEnvironment.setAsContext()` in https://github.com/apache/flink/blob/master/flink-test-utils-parent/flink-test-utils/src/main/java/org/apache/flink/test/util/MiniClusterWithClientResource.java#L56 ? Is that execution environment registration process documented anywhere?

Which test execution environment is recommended for each test use case? For example I don't see why would I use CollectionEnvironment when I have the local environment available and running on several threads, what is a good use case for CollectionEnvironment?

Are all these 3 environments supported equality, or maybe some of them is expected to be deprecated?

Are there any additional execution environments that could be useful for testing on a single host?

Thanks,

Juan

Biao Liu

Re: Execution environments for testing: local vs collection vs mini cluster

Hi Juan,

I'm not sure what you really want. Before giving some suggestions, could you answer the questions below first?

1. Do you want to write a unit test (or integration test) case for your project or for Flink? Or just want to run your job locally?

2. Which mode do you want to test? DataStream or DataSet?

Juan Rodríguez Hortalá <[hidden email]> 于2019年7月23日周二下午1:12写道：

Hi,

In https://ci.apache.org/projects/flink/flink-docs-stable/dev/local_execution.html and https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/runtime/minicluster/MiniCluster.html I see there are 3 ways to create an execution environment for testing:
StreamExecutionEnvironment.createLocalEnvironment and ExecutionEnvironment.createLocalEnvironment create an execution environment running on a single JVM using different threads.
CollectionEnvironment runs on a single JVM on a single thread.
I haven't found not much documentation on the Mini Cluster, but it sounds similar to the Hadoop MiniCluster. If that is then case, then it would run on many local JVMs, each of them running multiple threads.
Am I correct about the Mini Cluster? Is there any additional documentation about it? I discovered it looking at the source code of AbstractTestBase, that is mentioned on https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/testing.html#integration-testing. Also, it looks like launching the mini cluster registers it somewhere, so subsequent calls to `StreamExecutionEnvironment.getExecutionEnvironment` return an environment that uses the mini cluster. Is that performed by `executionEnvironment.setAsContext()` in https://github.com/apache/flink/blob/master/flink-test-utils-parent/flink-test-utils/src/main/java/org/apache/flink/test/util/MiniClusterWithClientResource.java#L56 ? Is that execution environment registration process documented anywhere?

Which test execution environment is recommended for each test use case? For example I don't see why would I use CollectionEnvironment when I have the local environment available and running on several threads, what is a good use case for CollectionEnvironment?

Are all these 3 environments supported equality, or maybe some of them is expected to be deprecated?

Are there any additional execution environments that could be useful for testing on a single host?

Thanks,

Juan

Juan Rodríguez Hortalá

Re: Execution environments for testing: local vs collection vs mini cluster

Hi Bao,

Thanks for your answer.

1. Integration tests for my project.

2. Both data stream and data sets

On Mon, Jul 22, 2019 at 11:44 PM Biao Liu <[hidden email]> wrote:

Hi Juan,

I'm not sure what you really want. Before giving some suggestions, could you answer the questions below first?

1. Do you want to write a unit test (or integration test) case for your project or for Flink? Or just want to run your job locally?
2. Which mode do you want to test? DataStream or DataSet?

Juan Rodríguez Hortalá <[hidden email]> 于2019年7月23日周二下午1:12写道：
Hi,

In https://ci.apache.org/projects/flink/flink-docs-stable/dev/local_execution.html and https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/runtime/minicluster/MiniCluster.html I see there are 3 ways to create an execution environment for testing:
StreamExecutionEnvironment.createLocalEnvironment and ExecutionEnvironment.createLocalEnvironment create an execution environment running on a single JVM using different threads.
CollectionEnvironment runs on a single JVM on a single thread.
I haven't found not much documentation on the Mini Cluster, but it sounds similar to the Hadoop MiniCluster. If that is then case, then it would run on many local JVMs, each of them running multiple threads.
Am I correct about the Mini Cluster? Is there any additional documentation about it? I discovered it looking at the source code of AbstractTestBase, that is mentioned on https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/testing.html#integration-testing. Also, it looks like launching the mini cluster registers it somewhere, so subsequent calls to `StreamExecutionEnvironment.getExecutionEnvironment` return an environment that uses the mini cluster. Is that performed by `executionEnvironment.setAsContext()` in https://github.com/apache/flink/blob/master/flink-test-utils-parent/flink-test-utils/src/main/java/org/apache/flink/test/util/MiniClusterWithClientResource.java#L56 ? Is that execution environment registration process documented anywhere?

Which test execution environment is recommended for each test use case? For example I don't see why would I use CollectionEnvironment when I have the local environment available and running on several threads, what is a good use case for CollectionEnvironment?

Are all these 3 environments supported equality, or maybe some of them is expected to be deprecated?

Are there any additional execution environments that could be useful for testing on a single host?

Thanks,

Juan

Biao Liu

Re: Execution environments for testing: local vs collection vs mini cluster

Hi Juan,

Sorry for the late reply.

1. the environments of data stream and data set are not same. An obvious difference is there always be a "stream" prefix of environment for data stream. For example, StreamExecutionEnvironment is for data stream, ExecutionEnvironment and CollectionEnvironment are for data set.

You could use "StreamExecutionEnvironment.createLocalEnvironment" to run or test a data stream job. Use ExecutionEnvironment.createLocalEnvironment or CollectionEnvironment to run or test a data set job.

Actually you could also use StreamExecutionEnvironment.getExecutionEnvironment or ExecutionEnvironment.getExecutionEnvironment. Because they would choose local environment automatically if you are running job standalone (in IDE or execute the main method directly).

2. Regarding to MiniCluster, IMO it's a bit internal. The MiniCluster runs as backend behind local environment. I think there is a subtle difference of the position between mini cluster of Flink and mini cluster of Hadoop.

3. I will try to answer your questions below.

> Which test execution environment is recommended for each test use case?

It depends on which mode you are testing, data stream or data set.

> For example I don't see why would I use CollectionEnvironment when I have the local environment available and running on several threads, what is a good use case for CollectionEnvironment?

In the official document, it says "CollectionEnvironment is a low-overhead approach for executing Flink programs". As I don't have much experience of data set, I just check the relevant codes. The CollectionEnvironment seems not to start a mini cluster. I believe it executes job in a lighter way. BTW, There is no such an equivalent environment for data stream.

> Are all these 3 environments supported equality, or maybe some of them is expected to be deprecated?

Obviously they are not same as mentioned above.

If a class is deprecated, it would be decorated by an annotation "Deprecated".

> Are there any additional execution environments that could be useful for testing on a single host?

I would suggest to follow the official documents [1][2] which you have discovered, even there might be some other ways which seem to be equivalent. Because if you depend on some internal implementation, it might be changed over time without any notification.

1. https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/testing.html#integration-testing

2. https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/local_execution.html

On Tue, Jul 23, 2019 at 11:30 PM Juan Rodríguez Hortalá <[hidden email]> wrote:

Hi Bao,

Thanks for your answer.

1. Integration tests for my project.
2. Both data stream and data sets

On Mon, Jul 22, 2019 at 11:44 PM Biao Liu <[hidden email]> wrote:
Hi Juan,

I'm not sure what you really want. Before giving some suggestions, could you answer the questions below first?

1. Do you want to write a unit test (or integration test) case for your project or for Flink? Or just want to run your job locally?
2. Which mode do you want to test? DataStream or DataSet?

Juan Rodríguez Hortalá <[hidden email]> 于2019年7月23日周二下午1:12写道：
Hi,

In https://ci.apache.org/projects/flink/flink-docs-stable/dev/local_execution.html and https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/runtime/minicluster/MiniCluster.html I see there are 3 ways to create an execution environment for testing:
StreamExecutionEnvironment.createLocalEnvironment and ExecutionEnvironment.createLocalEnvironment create an execution environment running on a single JVM using different threads.
CollectionEnvironment runs on a single JVM on a single thread.
I haven't found not much documentation on the Mini Cluster, but it sounds similar to the Hadoop MiniCluster. If that is then case, then it would run on many local JVMs, each of them running multiple threads.
Am I correct about the Mini Cluster? Is there any additional documentation about it? I discovered it looking at the source code of AbstractTestBase, that is mentioned on https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/testing.html#integration-testing. Also, it looks like launching the mini cluster registers it somewhere, so subsequent calls to `StreamExecutionEnvironment.getExecutionEnvironment` return an environment that uses the mini cluster. Is that performed by `executionEnvironment.setAsContext()` in https://github.com/apache/flink/blob/master/flink-test-utils-parent/flink-test-utils/src/main/java/org/apache/flink/test/util/MiniClusterWithClientResource.java#L56 ? Is that execution environment registration process documented anywhere?

Which test execution environment is recommended for each test use case? For example I don't see why would I use CollectionEnvironment when I have the local environment available and running on several threads, what is a good use case for CollectionEnvironment?

Are all these 3 environments supported equality, or maybe some of them is expected to be deprecated?

Are there any additional execution environments that could be useful for testing on a single host?

Thanks,

Juan

Juan Rodríguez Hortalá

Re: Execution environments for testing: local vs collection vs mini cluster

Hi,

Thanks for your answer. I hadn't noticed that the collection environment only works for the batch API. It's also nice to know that the mini cluster is more an internal tool. So that the local execution environments for batch and streaming are working very well for me, I was just curious, thanks for the clarifications.

Greetings,

Juan

On Fri, Jul 26, 2019 at 1:32 AM Biao Liu <[hidden email]> wrote:

Hi Juan,

Sorry for the late reply.

1. the environments of data stream and data set are not same. An obvious difference is there always be a "stream" prefix of environment for data stream. For example, StreamExecutionEnvironment is for data stream, ExecutionEnvironment and CollectionEnvironment are for data set.

You could use "StreamExecutionEnvironment.createLocalEnvironment" to run or test a data stream job. Use ExecutionEnvironment.createLocalEnvironment or CollectionEnvironment to run or test a data set job.

Actually you could also use StreamExecutionEnvironment.getExecutionEnvironment or ExecutionEnvironment.getExecutionEnvironment. Because they would choose local environment automatically if you are running job standalone (in IDE or execute the main method directly).

2. Regarding to MiniCluster, IMO it's a bit internal. The MiniCluster runs as backend behind local environment. I think there is a subtle difference of the position between mini cluster of Flink and mini cluster of Hadoop.

3. I will try to answer your questions below.

> Which test execution environment is recommended for each test use case?
It depends on which mode you are testing, data stream or data set.

> For example I don't see why would I use CollectionEnvironment when I have the local environment available and running on several threads, what is a good use case for CollectionEnvironment?
In the official document, it says "CollectionEnvironment is a low-overhead approach for executing Flink programs". As I don't have much experience of data set, I just check the relevant codes. The CollectionEnvironment seems not to start a mini cluster. I believe it executes job in a lighter way. BTW, There is no such an equivalent environment for data stream.

> Are all these 3 environments supported equality, or maybe some of them is expected to be deprecated?
Obviously they are not same as mentioned above.
If a class is deprecated, it would be decorated by an annotation "Deprecated".

> Are there any additional execution environments that could be useful for testing on a single host?
I would suggest to follow the official documents [1][2] which you have discovered, even there might be some other ways which seem to be equivalent. Because if you depend on some internal implementation, it might be changed over time without any notification.

1. https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/testing.html#integration-testing
2. https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/local_execution.html

On Tue, Jul 23, 2019 at 11:30 PM Juan Rodríguez Hortalá <[hidden email]> wrote:
Hi Bao,

Thanks for your answer.

1. Integration tests for my project.
2. Both data stream and data sets

On Mon, Jul 22, 2019 at 11:44 PM Biao Liu <[hidden email]> wrote:
Hi Juan,

I'm not sure what you really want. Before giving some suggestions, could you answer the questions below first?

1. Do you want to write a unit test (or integration test) case for your project or for Flink? Or just want to run your job locally?
2. Which mode do you want to test? DataStream or DataSet?

Juan Rodríguez Hortalá <[hidden email]> 于2019年7月23日周二下午1:12写道：
Hi,

In https://ci.apache.org/projects/flink/flink-docs-stable/dev/local_execution.html and https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/runtime/minicluster/MiniCluster.html I see there are 3 ways to create an execution environment for testing:
StreamExecutionEnvironment.createLocalEnvironment and ExecutionEnvironment.createLocalEnvironment create an execution environment running on a single JVM using different threads.
CollectionEnvironment runs on a single JVM on a single thread.
I haven't found not much documentation on the Mini Cluster, but it sounds similar to the Hadoop MiniCluster. If that is then case, then it would run on many local JVMs, each of them running multiple threads.
Am I correct about the Mini Cluster? Is there any additional documentation about it? I discovered it looking at the source code of AbstractTestBase, that is mentioned on https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/testing.html#integration-testing. Also, it looks like launching the mini cluster registers it somewhere, so subsequent calls to `StreamExecutionEnvironment.getExecutionEnvironment` return an environment that uses the mini cluster. Is that performed by `executionEnvironment.setAsContext()` in https://github.com/apache/flink/blob/master/flink-test-utils-parent/flink-test-utils/src/main/java/org/apache/flink/test/util/MiniClusterWithClientResource.java#L56 ? Is that execution environment registration process documented anywhere?

Which test execution environment is recommended for each test use case? For example I don't see why would I use CollectionEnvironment when I have the local environment available and running on several threads, what is a good use case for CollectionEnvironment?

Are all these 3 environments supported equality, or maybe some of them is expected to be deprecated?

Are there any additional execution environments that could be useful for testing on a single host?

Thanks,

Juan