Using Azure Blob Storage with Flink

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Using Azure Blob Storage with Flink

Joshua Griffith
I’m attempting to write to Azure Blob Storage using Flink's FileOutputFormat. I’ve included hadoop-azure within the jar I submit to Flink and configured the paths to be prefixed with wasb://{CONTAINERNAME}@{ACCOUNTNAME}.blob.core.windows.net/.

When the file output format initializes, I get the following error: ERROR ROOT - Run 4bfb099a-8d07-11e7-8d3a-fb4d07562cc0 failed with error: 'org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Cannot initialize task 'DataSink (/out/data)': No file system found with scheme wasb, referenced in file URI '<a href="wasb://blob@" class="">wasb://blob@{ACCOUNTNAME}.blob.core.windows.net/out/data’.

Can I register the format programmatically from within the job (without putting credentials into a core-site.xml file on the task manager)? Can I still use Flink’s FileOutputFormat or should I be using a Hadoop OutputFormat?

Thanks,

Joshua
Reply | Threaded
Open this post in threaded view
|

Re: Using Azure Blob Storage with Flink

Ted Yu
Was hadoop-azure jar on the classpath ?


The built jar file, named hadoop-azure.jar, also declares transitive dependencies on the additional artifacts it requires, notably the Azure Storage SDK for Java.

On Tue, Aug 29, 2017 at 3:24 PM, Joshua Griffith <[hidden email]> wrote:
I’m attempting to write to Azure Blob Storage using Flink's FileOutputFormat. I’ve included hadoop-azure within the jar I submit to Flink and configured the paths to be prefixed with wasb://{CONTAINERNAME}@{ACCOUNTNAME}.blob.core.windows.net/.

When the file output format initializes, I get the following error: ERROR ROOT - Run 4bfb099a-8d07-11e7-8d3a-fb4d07562cc0 failed with error: 'org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Cannot initialize task 'DataSink (/out/data)': No file system found with scheme wasb, referenced in file URI 'wasb://blob@{ACCOUNTNAME}.blob.core.windows.net/out/data’.

Can I register the format programmatically from within the job (without putting credentials into a core-site.xml file on the task manager)? Can I still use Flink’s FileOutputFormat or should I be using a Hadoop OutputFormat?

Thanks,

Joshua

Reply | Threaded
Open this post in threaded view
|

Re: Using Azure Blob Storage with Flink

Joshua Griffith
Yes, hadoop-azure and azure-storage are both on the classpath. hadoop-azure is declared as a dependency in my build.sbt file and I’m using assembly to copy all of the dependencies into a single jar which is submitted to Flink. I suspect the wasb format needs to be explicitly registered with Hadoop. I think that’s accomplished by inserting the following into core-site.xml (I’m not that familiar with Hadoop):

<property>
  <name>fs.AbstractFileSystem.wasb.Impl</name>
  <value>org.apache.hadoop.fs.azure.Wasb</value>
</property>
However, I’m wondering if it’s possible to achieve the same result from within the job since it’s difficult to modify files on the task manager in our configuration.

On Aug 29, 2017, at 5:32 PM, Ted Yu <[hidden email]> wrote:

Was hadoop-azure jar on the classpath ?


The built jar file, named hadoop-azure.jar, also declares transitive dependencies on the additional artifacts it requires, notably the Azure Storage SDK for Java.

On Tue, Aug 29, 2017 at 3:24 PM, Joshua Griffith <[hidden email]> wrote:
I’m attempting to write to Azure Blob Storage using Flink's FileOutputFormat. I’ve included hadoop-azure within the jar I submit to Flink and configured the paths to be prefixed with wasb://{CONTAINERNAME}@{ACCOUNTNAME}.blob.core.windows.net/.

When the file output format initializes, I get the following error: ERROR ROOT - Run 4bfb099a-8d07-11e7-8d3a-fb4d07562cc0 failed with error: 'org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Cannot initialize task 'DataSink (/out/data)': No file system found with scheme wasb, referenced in file URI 'wasb://blob@{ACCOUNTNAME}.blob.core.windows.net/out/data’.

Can I register the format programmatically from within the job (without putting credentials into a core-site.xml file on the task manager)? Can I still use Flink’s FileOutputFormat or should I be using a Hadoop OutputFormat?

Thanks,

Joshua


Reply | Threaded
Open this post in threaded view
|

Re: Using Azure Blob Storage with Flink

Ted Yu
There is HADOOP-14753 which is still Open.

FYI

On Tue, Aug 29, 2017 at 3:41 PM, Joshua Griffith <[hidden email]> wrote:
Yes, hadoop-azure and azure-storage are both on the classpath. hadoop-azure is declared as a dependency in my build.sbt file and I’m using assembly to copy all of the dependencies into a single jar which is submitted to Flink. I suspect the wasb format needs to be explicitly registered with Hadoop. I think that’s accomplished by inserting the following into core-site.xml (I’m not that familiar with Hadoop):

<property>
  <name>fs.AbstractFileSystem.wasb.Impl</name>
  <value>org.apache.hadoop.fs.azure.Wasb</value>
</property>
However, I’m wondering if it’s possible to achieve the same result from within the job since it’s difficult to modify files on the task manager in our configuration.

On Aug 29, 2017, at 5:32 PM, Ted Yu <[hidden email]> wrote:

Was hadoop-azure jar on the classpath ?


The built jar file, named hadoop-azure.jar, also declares transitive dependencies on the additional artifacts it requires, notably the Azure Storage SDK for Java.

On Tue, Aug 29, 2017 at 3:24 PM, Joshua Griffith <[hidden email]> wrote:
I’m attempting to write to Azure Blob Storage using Flink's FileOutputFormat. I’ve included hadoop-azure within the jar I submit to Flink and configured the paths to be prefixed with wasb://{CONTAINERNAME}@{ACCOUNTNAME}.blob.core.windows.net/.

When the file output format initializes, I get the following error: ERROR ROOT - Run 4bfb099a-8d07-11e7-8d3a-fb4d07562cc0 failed with error: 'org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Cannot initialize task 'DataSink (/out/data)': No file system found with scheme wasb, referenced in file URI 'wasb://blob@{ACCOUNTNAME}.blob.core.windows.net/out/data’.

Can I register the format programmatically from within the job (without putting credentials into a core-site.xml file on the task manager)? Can I still use Flink’s FileOutputFormat or should I be using a Hadoop OutputFormat?

Thanks,

Joshua