Why does flink-quickstart-scala suggests adding connector dependencies in the default scope, while Flink Hive integration docs suggest the opposite

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Why does flink-quickstart-scala suggests adding connector dependencies in the default scope, while Flink Hive integration docs suggest the opposite

Yik San Chan
The question is cross-posted on Stack Overflow https://stackoverflow.com/questions/67001326/why-does-flink-quickstart-scala-suggests-adding-connector-dependencies-in-the-de.

## Connector dependencies should be in default scope

This is what [flink-quickstart-scala](https://github.com/apache/flink/blob/d12eeedfac6541c3a0711d1580ce3bd68120ca90/flink-quickstart/flink-quickstart-scala/src/main/resources/archetype-resources/pom.xml#L84) suggests:

```
<!-- Add connector dependencies here. They must be in the default scope (compile). -->

<!-- Example:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
-->
```

It also aligns with [Flink project configuration](https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/project-configuration.html#adding-connector-and-library-dependencies):

> We recommend packaging the application code and all its required dependencies into one jar-with-dependencies which we refer to as the application jar. The application jar can be submitted to an already running Flink cluster, or added to a Flink application container image.
>
> Important: For Maven (and other build tools) to correctly package the dependencies into the application jar, these application dependencies must be specified in scope compile (unlike the core dependencies, which must be specified in scope provided).

## Hive connector dependencies should be in provided scope

However, [Flink Hive Integration docs](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/hive/#program-maven) suggests the opposite:

> If you are building your own program, you need the following dependencies in your mvn file. It’s recommended not to include these dependencies in the resulting jar file. You’re supposed to add dependencies as stated above at runtime.

## Why?

Thanks!

Best,
Yik San
Reply | Threaded
Open this post in threaded view
|

Re: Why does flink-quickstart-scala suggests adding connector dependencies in the default scope, while Flink Hive integration docs suggest the opposite

Till Rohrmann
Hi Yik San,

for future reference, I copy my answer from the SO here:

The reason for this difference is that for Hive it is recommended to start the cluster with the respective Hive dependencies. The documentation [1] states that it's best to put the dependencies into the lib directory before you start the cluster. That way the cluster is enabled to run jobs which use Hive. At the same time, you don't have to bundle this dependency in the user jar which reduces its size. However, there shouldn't be anything preventing you from bundling the Hive dependency with your user code if you want to.


Cheers,
Till

On Thu, Apr 8, 2021 at 11:41 AM Yik San Chan <[hidden email]> wrote:
The question is cross-posted on Stack Overflow https://stackoverflow.com/questions/67001326/why-does-flink-quickstart-scala-suggests-adding-connector-dependencies-in-the-de.

## Connector dependencies should be in default scope

This is what [flink-quickstart-scala](https://github.com/apache/flink/blob/d12eeedfac6541c3a0711d1580ce3bd68120ca90/flink-quickstart/flink-quickstart-scala/src/main/resources/archetype-resources/pom.xml#L84) suggests:

```
<!-- Add connector dependencies here. They must be in the default scope (compile). -->

<!-- Example:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
-->
```

It also aligns with [Flink project configuration](https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/project-configuration.html#adding-connector-and-library-dependencies):

> We recommend packaging the application code and all its required dependencies into one jar-with-dependencies which we refer to as the application jar. The application jar can be submitted to an already running Flink cluster, or added to a Flink application container image.
>
> Important: For Maven (and other build tools) to correctly package the dependencies into the application jar, these application dependencies must be specified in scope compile (unlike the core dependencies, which must be specified in scope provided).

## Hive connector dependencies should be in provided scope

However, [Flink Hive Integration docs](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/hive/#program-maven) suggests the opposite:

> If you are building your own program, you need the following dependencies in your mvn file. It’s recommended not to include these dependencies in the resulting jar file. You’re supposed to add dependencies as stated above at runtime.

## Why?

Thanks!

Best,
Yik San
Reply | Threaded
Open this post in threaded view
|

Re: Why does flink-quickstart-scala suggests adding connector dependencies in the default scope, while Flink Hive integration docs suggest the opposite

Yik San Chan
Hi Till, I have 2 follow-ups.

(1) Why is Hive special, while for connectors such as kafka, the docs suggest simply bundling the kafka connector dependency with my user code?

(2) it seems the document misses the "before you start the cluster" part - does it always require a cluster restart whenever the /lib directory changes?


Thanks.

Best,
Yik San

On Fri, Apr 9, 2021 at 1:07 AM Till Rohrmann <[hidden email]> wrote:
Hi Yik San,

for future reference, I copy my answer from the SO here:

The reason for this difference is that for Hive it is recommended to start the cluster with the respective Hive dependencies. The documentation [1] states that it's best to put the dependencies into the lib directory before you start the cluster. That way the cluster is enabled to run jobs which use Hive. At the same time, you don't have to bundle this dependency in the user jar which reduces its size. However, there shouldn't be anything preventing you from bundling the Hive dependency with your user code if you want to.


Cheers,
Till

On Thu, Apr 8, 2021 at 11:41 AM Yik San Chan <[hidden email]> wrote:
The question is cross-posted on Stack Overflow https://stackoverflow.com/questions/67001326/why-does-flink-quickstart-scala-suggests-adding-connector-dependencies-in-the-de.

## Connector dependencies should be in default scope

This is what [flink-quickstart-scala](https://github.com/apache/flink/blob/d12eeedfac6541c3a0711d1580ce3bd68120ca90/flink-quickstart/flink-quickstart-scala/src/main/resources/archetype-resources/pom.xml#L84) suggests:

```
<!-- Add connector dependencies here. They must be in the default scope (compile). -->

<!-- Example:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
-->
```

It also aligns with [Flink project configuration](https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/project-configuration.html#adding-connector-and-library-dependencies):

> We recommend packaging the application code and all its required dependencies into one jar-with-dependencies which we refer to as the application jar. The application jar can be submitted to an already running Flink cluster, or added to a Flink application container image.
>
> Important: For Maven (and other build tools) to correctly package the dependencies into the application jar, these application dependencies must be specified in scope compile (unlike the core dependencies, which must be specified in scope provided).

## Hive connector dependencies should be in provided scope

However, [Flink Hive Integration docs](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/hive/#program-maven) suggests the opposite:

> If you are building your own program, you need the following dependencies in your mvn file. It’s recommended not to include these dependencies in the resulting jar file. You’re supposed to add dependencies as stated above at runtime.

## Why?

Thanks!

Best,
Yik San
Reply | Threaded
Open this post in threaded view
|

Re: Why does flink-quickstart-scala suggests adding connector dependencies in the default scope, while Flink Hive integration docs suggest the opposite

Till Rohrmann
Hi Yik San,

(1) You could do the same with Kafka. For Hive I believe that the dependency is simply quite large so that it hurts more if you bundle it with your user code.

(2) If you change the content in the lib directory, then you have to restart the cluster.

Cheers,
Till

On Fri, Apr 9, 2021 at 4:02 AM Yik San Chan <[hidden email]> wrote:
Hi Till, I have 2 follow-ups.

(1) Why is Hive special, while for connectors such as kafka, the docs suggest simply bundling the kafka connector dependency with my user code?

(2) it seems the document misses the "before you start the cluster" part - does it always require a cluster restart whenever the /lib directory changes?


Thanks.

Best,
Yik San

On Fri, Apr 9, 2021 at 1:07 AM Till Rohrmann <[hidden email]> wrote:
Hi Yik San,

for future reference, I copy my answer from the SO here:

The reason for this difference is that for Hive it is recommended to start the cluster with the respective Hive dependencies. The documentation [1] states that it's best to put the dependencies into the lib directory before you start the cluster. That way the cluster is enabled to run jobs which use Hive. At the same time, you don't have to bundle this dependency in the user jar which reduces its size. However, there shouldn't be anything preventing you from bundling the Hive dependency with your user code if you want to.


Cheers,
Till

On Thu, Apr 8, 2021 at 11:41 AM Yik San Chan <[hidden email]> wrote:
The question is cross-posted on Stack Overflow https://stackoverflow.com/questions/67001326/why-does-flink-quickstart-scala-suggests-adding-connector-dependencies-in-the-de.

## Connector dependencies should be in default scope

This is what [flink-quickstart-scala](https://github.com/apache/flink/blob/d12eeedfac6541c3a0711d1580ce3bd68120ca90/flink-quickstart/flink-quickstart-scala/src/main/resources/archetype-resources/pom.xml#L84) suggests:

```
<!-- Add connector dependencies here. They must be in the default scope (compile). -->

<!-- Example:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
-->
```

It also aligns with [Flink project configuration](https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/project-configuration.html#adding-connector-and-library-dependencies):

> We recommend packaging the application code and all its required dependencies into one jar-with-dependencies which we refer to as the application jar. The application jar can be submitted to an already running Flink cluster, or added to a Flink application container image.
>
> Important: For Maven (and other build tools) to correctly package the dependencies into the application jar, these application dependencies must be specified in scope compile (unlike the core dependencies, which must be specified in scope provided).

## Hive connector dependencies should be in provided scope

However, [Flink Hive Integration docs](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/hive/#program-maven) suggests the opposite:

> If you are building your own program, you need the following dependencies in your mvn file. It’s recommended not to include these dependencies in the resulting jar file. You’re supposed to add dependencies as stated above at runtime.

## Why?

Thanks!

Best,
Yik San
Reply | Threaded
Open this post in threaded view
|

Re: Why does flink-quickstart-scala suggests adding connector dependencies in the default scope, while Flink Hive integration docs suggest the opposite

Yik San Chan
Thank you Till!

On Fri, Apr 9, 2021 at 4:25 PM Till Rohrmann <[hidden email]> wrote:
Hi Yik San,

(1) You could do the same with Kafka. For Hive I believe that the dependency is simply quite large so that it hurts more if you bundle it with your user code.

(2) If you change the content in the lib directory, then you have to restart the cluster.

Cheers,
Till

On Fri, Apr 9, 2021 at 4:02 AM Yik San Chan <[hidden email]> wrote:
Hi Till, I have 2 follow-ups.

(1) Why is Hive special, while for connectors such as kafka, the docs suggest simply bundling the kafka connector dependency with my user code?

(2) it seems the document misses the "before you start the cluster" part - does it always require a cluster restart whenever the /lib directory changes?


Thanks.

Best,
Yik San

On Fri, Apr 9, 2021 at 1:07 AM Till Rohrmann <[hidden email]> wrote:
Hi Yik San,

for future reference, I copy my answer from the SO here:

The reason for this difference is that for Hive it is recommended to start the cluster with the respective Hive dependencies. The documentation [1] states that it's best to put the dependencies into the lib directory before you start the cluster. That way the cluster is enabled to run jobs which use Hive. At the same time, you don't have to bundle this dependency in the user jar which reduces its size. However, there shouldn't be anything preventing you from bundling the Hive dependency with your user code if you want to.


Cheers,
Till

On Thu, Apr 8, 2021 at 11:41 AM Yik San Chan <[hidden email]> wrote:
The question is cross-posted on Stack Overflow https://stackoverflow.com/questions/67001326/why-does-flink-quickstart-scala-suggests-adding-connector-dependencies-in-the-de.

## Connector dependencies should be in default scope

This is what [flink-quickstart-scala](https://github.com/apache/flink/blob/d12eeedfac6541c3a0711d1580ce3bd68120ca90/flink-quickstart/flink-quickstart-scala/src/main/resources/archetype-resources/pom.xml#L84) suggests:

```
<!-- Add connector dependencies here. They must be in the default scope (compile). -->

<!-- Example:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
-->
```

It also aligns with [Flink project configuration](https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/project-configuration.html#adding-connector-and-library-dependencies):

> We recommend packaging the application code and all its required dependencies into one jar-with-dependencies which we refer to as the application jar. The application jar can be submitted to an already running Flink cluster, or added to a Flink application container image.
>
> Important: For Maven (and other build tools) to correctly package the dependencies into the application jar, these application dependencies must be specified in scope compile (unlike the core dependencies, which must be specified in scope provided).

## Hive connector dependencies should be in provided scope

However, [Flink Hive Integration docs](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/hive/#program-maven) suggests the opposite:

> If you are building your own program, you need the following dependencies in your mvn file. It’s recommended not to include these dependencies in the resulting jar file. You’re supposed to add dependencies as stated above at runtime.

## Why?

Thanks!

Best,
Yik San