Interact with different S3 buckets from a shared Flink cluster

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Interact with different S3 buckets from a shared Flink cluster

Ricardo Cardante-2

Hi!


We are working in a use case where we have a shared Flink cluster to deploy multiple jobs from different teams. With this strategy, we are facing a challenge regarding the interaction with S3. Given that we already configured S3 for the state backend (through flink-conf.yaml) every time we use API functions that communicate with the file system (e.g., DataStream readFile) the applicational configurations appear to be overridden by those of the cluster while attempting to communicate with external S3 buckets. What we've thought so far:


1. Provide a core-site.xml resource file targeting the external S3 buckets we want to interact with. We've tested, and the credentials ultimately seem to be ignored in behalf of the IAM roles that are pre-loaded with the instances;

2. Load the cluster instances with multiple IAM roles. The problem with this is that we would allow each job to interact with out-of-scope buckets;

3. Spin multiple clusters with different configurations - we would like to avoid this since we started from the premise of sharing a single cluster per context;


What would be a clean/recommended solution to interact with multiple S3 buckets with different security policies from a shared Flink cluster? 


Thanks in advance.
Reply | Threaded
Open this post in threaded view
|

Re: Interact with different S3 buckets from a shared Flink cluster

Arvid Heise-3
Hi Ricardo,

one option is to use s3p for checkpointing (Presto) and s3a for custom applications and attach different configurations.

In general, I'd recommend to use a cluster per application to exactly avoid such issues. I'd use K8s and put the respective IAM roles on each application pod (e.g. with kiam).

On Thu, Jun 18, 2020 at 1:46 AM Ricardo Cardante <[hidden email]> wrote:

Hi!


We are working in a use case where we have a shared Flink cluster to deploy multiple jobs from different teams. With this strategy, we are facing a challenge regarding the interaction with S3. Given that we already configured S3 for the state backend (through flink-conf.yaml) every time we use API functions that communicate with the file system (e.g., DataStream readFile) the applicational configurations appear to be overridden by those of the cluster while attempting to communicate with external S3 buckets. What we've thought so far:


1. Provide a core-site.xml resource file targeting the external S3 buckets we want to interact with. We've tested, and the credentials ultimately seem to be ignored in behalf of the IAM roles that are pre-loaded with the instances;

2. Load the cluster instances with multiple IAM roles. The problem with this is that we would allow each job to interact with out-of-scope buckets;

3. Spin multiple clusters with different configurations - we would like to avoid this since we started from the premise of sharing a single cluster per context;


What would be a clean/recommended solution to interact with multiple S3 buckets with different security policies from a shared Flink cluster? 


Thanks in advance.


--

Arvid Heise | Senior Java Developer


Follow us @VervericaData

--

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng   
Reply | Threaded
Open this post in threaded view
|

Re: Interact with different S3 buckets from a shared Flink cluster

Steven Wu

Internally, we have our own ConfigurableCredentialsProvider. Based on the config in core-site.xml, it does assume-role with the proper IAM credentials using STSAssumeRoleSessionCredentialsProvider. We just need to grant permission for the instance credentials to be able to assume the IAM role for bucket access. We have a single core-site.xml that lays out all the mapping.

  <property>
    <name>aws.iam.role.arn.${BUCKET_NAME}</name>
    <value>arn:aws:iam::${ACCOUNT_NUMBER}:role/${BUCKET_ROLE_NAME}</value>
  </property>

On Mon, Jun 22, 2020 at 7:07 AM Arvid Heise <[hidden email]> wrote:
Hi Ricardo,

one option is to use s3p for checkpointing (Presto) and s3a for custom applications and attach different configurations.

In general, I'd recommend to use a cluster per application to exactly avoid such issues. I'd use K8s and put the respective IAM roles on each application pod (e.g. with kiam).

On Thu, Jun 18, 2020 at 1:46 AM Ricardo Cardante <[hidden email]> wrote:

Hi!


We are working in a use case where we have a shared Flink cluster to deploy multiple jobs from different teams. With this strategy, we are facing a challenge regarding the interaction with S3. Given that we already configured S3 for the state backend (through flink-conf.yaml) every time we use API functions that communicate with the file system (e.g., DataStream readFile) the applicational configurations appear to be overridden by those of the cluster while attempting to communicate with external S3 buckets. What we've thought so far:


1. Provide a core-site.xml resource file targeting the external S3 buckets we want to interact with. We've tested, and the credentials ultimately seem to be ignored in behalf of the IAM roles that are pre-loaded with the instances;

2. Load the cluster instances with multiple IAM roles. The problem with this is that we would allow each job to interact with out-of-scope buckets;

3. Spin multiple clusters with different configurations - we would like to avoid this since we started from the premise of sharing a single cluster per context;


What would be a clean/recommended solution to interact with multiple S3 buckets with different security policies from a shared Flink cluster? 


Thanks in advance.


--

Arvid Heise | Senior Java Developer


Follow us @VervericaData

--

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng