Problem with flink-orc and hive

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem with flink-orc and hive

Сергей Чернов
Hello,
 
My situation is following:
  1. I write data in ORC format by Flink into HDFS:
    • I implements Vectorizer interface for processing my data and converting it into VectorizedRowBatch
    •  I create OrcBulkWriter:
      OrcBulkWriterFactory<MyData> orcBulkWriterFactory = new OrcBulkWriterFactory<>(new MyVectorizerImpl(orcSchemaString));
    • I configure StreamingFileSink:
      StreamingFileSink.forBulkFormat(hdfsPath, orcBulkWriterFactory) .withBucketAssigner(new BaseBucketAssigner<>()).build();
    • I deploy my job into Flink cluster and in hdfsPath catalog a see ORC file
  1. I create Hive table by the following command:
CREATE TABLE flink_orc_test(STRING a, BIGINT b) STORED AS ORC 'hdfsPath';
  1. I try to execute query: 
    SELECT * FROM flink_orc_test LIMIT 10;
  2. I have an error:
    Bad status for request TFetchResultsReq(fetchType=0, operationHandle=TOperationHandle(hasResultSet=True, modifiedRowCount=None, operationType=0,
    operationId=THandleIdentifier(secret='a\x08\xc3U\xbb\xa7I\xce\x96\xa6\xdb\x82\xa4\xa9\xd1x', guid='\xcc:\xca\xcb\x08\xa5KI\x8a}7\x95\xc5\xcd\xd2\xf0')),
    orientation=4, maxRows=100): TFetchResultsResp(status=TStatus(errorCode=0, errorMessage='java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 6',
    sqlState=None, infoMessages=['*org.apache.hive.service.cli.HiveSQLException:java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 6:25:24',
    'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:496',
    'org.apache.hive.service.cli.operation.OperationManager:getOperationNextRowSet:OperationManager.java:297',
    'org.apache.hive.service.cli.session.HiveSessionImpl:fetchResults:HiveSessionImpl.java:868',
    'sun.reflect.GeneratedMethodAccessor25:invoke::-1', 'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43',
    'java.lang.reflect.Method:invoke:Method.java:498',
    'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78',
    'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36', 'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63',
    'java.security.AccessController:doPrivileged:AccessController.java:-2',
    'javax.security.auth.Subject:doAs:Subject.java:422',
    'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1731',
    'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59',
    'com.sun.proxy.$Proxy37:fetchResults::-1',
    'org.apache.hive.service.cli.CLIService:fetchResults:CLIService.java:507',
    'org.apache.hive.service.cli.thrift.ThriftCLIService:FetchResults:ThriftCLIService.java:708',
    'org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1717',
    'org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1702',
    'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39',
    'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39',
    'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56',
    'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286',
    'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149',
    'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624',
    'java.lang.Thread:run:Thread.java:748',
    '*java.io.IOException:java.lang.ArrayIndexOutOfBoundsException: 6:29:4',
    'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:521', 'org.apache.hadoop.hive.ql.exec.FetchOperator:pushRow:FetchOperator.java:428',
    'org.apache.hadoop.hive.ql.exec.FetchTask:fetch:FetchTask.java:146',
    'org.apache.hadoop.hive.ql.Driver:getResults:Driver.java:2277', 'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:491', '*java.lang.ArrayIndexOutOfBoundsException:6:37:8', 'org.apache.orc.OrcFile$WriterVersion:from:OrcFile.java:145',
    'org.apache.orc.impl.OrcTail:getWriterVersion:OrcTail.java:74',
    'org.apache.orc.impl.ReaderImpl:<init>:ReaderImpl.java:385',
    'org.apache.hadoop.hive.ql.io.orc.ReaderImpl:<init>:ReaderImpl.java:62',
    'org.apache.hadoop.hive.ql.io.orc.OrcFile:createReader:OrcFile.java:89', 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat:getRecordReader:OrcInputFormat.java:1690', 'org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit:getRecordReader:FetchOperator.java:695',
    'org.apache.hadoop.hive.ql.exec.FetchOperator:getRecordReader:FetchOperator.java:333',
    'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:459'], statusCode=3), results=None, hasMoreRows=None)
Dependencies:
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-client</artifactId>
  <version>3.0.0-cdh6.1.1</version>
  <scope>provided</scope>
</dependency>
<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-orc_2.12</artifactId>
  <version>1.11.2</version>
</dependency>
 

I think that the general problem that flink-orc uses orc-core dependency, but for correct working with hive i need a hive-orc.

Actually, I can not replace orc-core with hive-orc because this is incompatible with flink-orc classes.

 

How can I solve it?

 

It is preferable for me use StreamingFileSink for writing ORC files, not Flink Table API with Hive.

 

Hive version: 2.1.1-cdh6.1.1

Flink Version: 1.11.2

 
--
Best regards,
Sergei Chernov
Reply | Threaded
Open this post in threaded view
|

Re: Problem with flink-orc and hive

Arvid Heise-3
For cross-referencing, here is the SO thread[1]. Unfortunately, I don't have a good answer for you, except try to align the ORC versions somehow.


On Fri, Dec 4, 2020 at 9:00 AM Сергей Чернов <[hidden email]> wrote:
Hello,
 
My situation is following:
  1. I write data in ORC format by Flink into HDFS:
    • I implements Vectorizer interface for processing my data and converting it into VectorizedRowBatch
    •  I create OrcBulkWriter:
      OrcBulkWriterFactory<MyData> orcBulkWriterFactory = new OrcBulkWriterFactory<>(new MyVectorizerImpl(orcSchemaString));
    • I configure StreamingFileSink:
      StreamingFileSink.forBulkFormat(hdfsPath, orcBulkWriterFactory) .withBucketAssigner(new BaseBucketAssigner<>()).build();
    • I deploy my job into Flink cluster and in hdfsPath catalog a see ORC file
  1. I create Hive table by the following command:
CREATE TABLE flink_orc_test(STRING a, BIGINT b) STORED AS ORC 'hdfsPath';
  1. I try to execute query: 
    SELECT * FROM flink_orc_test LIMIT 10;
  2. I have an error:
    Bad status for request TFetchResultsReq(fetchType=0, operationHandle=TOperationHandle(hasResultSet=True, modifiedRowCount=None, operationType=0,
    operationId=THandleIdentifier(secret='a\x08\xc3U\xbb\xa7I\xce\x96\xa6\xdb\x82\xa4\xa9\xd1x', guid='\xcc:\xca\xcb\x08\xa5KI\x8a}7\x95\xc5\xcd\xd2\xf0')),
    orientation=4, maxRows=100): TFetchResultsResp(status=TStatus(errorCode=0, errorMessage='java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 6',
    sqlState=None, infoMessages=['*org.apache.hive.service.cli.HiveSQLException:java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 6:25:24',
    'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:496',
    'org.apache.hive.service.cli.operation.OperationManager:getOperationNextRowSet:OperationManager.java:297',
    'org.apache.hive.service.cli.session.HiveSessionImpl:fetchResults:HiveSessionImpl.java:868',
    'sun.reflect.GeneratedMethodAccessor25:invoke::-1', 'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43',
    'java.lang.reflect.Method:invoke:Method.java:498',
    'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78',
    'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36', 'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63',
    'java.security.AccessController:doPrivileged:AccessController.java:-2',
    'javax.security.auth.Subject:doAs:Subject.java:422',
    'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1731',
    'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59',
    'com.sun.proxy.$Proxy37:fetchResults::-1',
    'org.apache.hive.service.cli.CLIService:fetchResults:CLIService.java:507',
    'org.apache.hive.service.cli.thrift.ThriftCLIService:FetchResults:ThriftCLIService.java:708',
    'org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1717',
    'org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1702',
    'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39',
    'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39',
    'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56',
    'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286',
    'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149',
    'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624',
    'java.lang.Thread:run:Thread.java:748',
    '*java.io.IOException:java.lang.ArrayIndexOutOfBoundsException: 6:29:4',
    'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:521', 'org.apache.hadoop.hive.ql.exec.FetchOperator:pushRow:FetchOperator.java:428',
    'org.apache.hadoop.hive.ql.exec.FetchTask:fetch:FetchTask.java:146',
    'org.apache.hadoop.hive.ql.Driver:getResults:Driver.java:2277', 'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:491', '*java.lang.ArrayIndexOutOfBoundsException:6:37:8', 'org.apache.orc.OrcFile$WriterVersion:from:OrcFile.java:145',
    'org.apache.orc.impl.OrcTail:getWriterVersion:OrcTail.java:74',
    'org.apache.orc.impl.ReaderImpl:<init>:ReaderImpl.java:385',
    'org.apache.hadoop.hive.ql.io.orc.ReaderImpl:<init>:ReaderImpl.java:62',
    'org.apache.hadoop.hive.ql.io.orc.OrcFile:createReader:OrcFile.java:89', 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat:getRecordReader:OrcInputFormat.java:1690', 'org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit:getRecordReader:FetchOperator.java:695',
    'org.apache.hadoop.hive.ql.exec.FetchOperator:getRecordReader:FetchOperator.java:333',
    'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:459'], statusCode=3), results=None, hasMoreRows=None)
Dependencies:
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-client</artifactId>
  <version>3.0.0-cdh6.1.1</version>
  <scope>provided</scope>
</dependency>
<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-orc_2.12</artifactId>
  <version>1.11.2</version>
</dependency>
 

I think that the general problem that flink-orc uses orc-core dependency, but for correct working with hive i need a hive-orc.

Actually, I can not replace orc-core with hive-orc because this is incompatible with flink-orc classes.

 

How can I solve it?

 

It is preferable for me use StreamingFileSink for writing ORC files, not Flink Table API with Hive.

 

Hive version: 2.1.1-cdh6.1.1

Flink Version: 1.11.2

 
--
Best regards,
Sergei Chernov


--

Arvid Heise | Senior Java Developer


Follow us @VervericaData

--

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng   
Reply | Threaded
Open this post in threaded view
|

Re: Problem with flink-orc and hive

Sivaprasanna
Let me try this out on my standalone Hive. I remember reading something similar on SO[1]. In this case, it was an external ORC generated by Spark and an external table was created using CDH. The OP answered referring to a community post[2] on Cloudera. It may be worth checking.

[1] https://stackoverflow.com/questions/62791744/problem-of-compatibility-of-an-external-orc-and-claudera-s-hive 

On Fri, Dec 4, 2020 at 4:47 PM Arvid Heise <[hidden email]> wrote:
For cross-referencing, here is the SO thread[1]. Unfortunately, I don't have a good answer for you, except try to align the ORC versions somehow.


On Fri, Dec 4, 2020 at 9:00 AM Сергей Чернов <[hidden email]> wrote:
Hello,
 
My situation is following:
  1. I write data in ORC format by Flink into HDFS:
    • I implements Vectorizer interface for processing my data and converting it into VectorizedRowBatch
    •  I create OrcBulkWriter:
      OrcBulkWriterFactory<MyData> orcBulkWriterFactory = new OrcBulkWriterFactory<>(new MyVectorizerImpl(orcSchemaString));
    • I configure StreamingFileSink:
      StreamingFileSink.forBulkFormat(hdfsPath, orcBulkWriterFactory) .withBucketAssigner(new BaseBucketAssigner<>()).build();
    • I deploy my job into Flink cluster and in hdfsPath catalog a see ORC file
  1. I create Hive table by the following command:
CREATE TABLE flink_orc_test(STRING a, BIGINT b) STORED AS ORC 'hdfsPath';
  1. I try to execute query: 
    SELECT * FROM flink_orc_test LIMIT 10;
  2. I have an error:
    Bad status for request TFetchResultsReq(fetchType=0, operationHandle=TOperationHandle(hasResultSet=True, modifiedRowCount=None, operationType=0,
    operationId=THandleIdentifier(secret='a\x08\xc3U\xbb\xa7I\xce\x96\xa6\xdb\x82\xa4\xa9\xd1x', guid='\xcc:\xca\xcb\x08\xa5KI\x8a}7\x95\xc5\xcd\xd2\xf0')),
    orientation=4, maxRows=100): TFetchResultsResp(status=TStatus(errorCode=0, errorMessage='java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 6',
    sqlState=None, infoMessages=['*org.apache.hive.service.cli.HiveSQLException:java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 6:25:24',
    'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:496',
    'org.apache.hive.service.cli.operation.OperationManager:getOperationNextRowSet:OperationManager.java:297',
    'org.apache.hive.service.cli.session.HiveSessionImpl:fetchResults:HiveSessionImpl.java:868',
    'sun.reflect.GeneratedMethodAccessor25:invoke::-1', 'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43',
    'java.lang.reflect.Method:invoke:Method.java:498',
    'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78',
    'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36', 'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63',
    'java.security.AccessController:doPrivileged:AccessController.java:-2',
    'javax.security.auth.Subject:doAs:Subject.java:422',
    'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1731',
    'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59',
    'com.sun.proxy.$Proxy37:fetchResults::-1',
    'org.apache.hive.service.cli.CLIService:fetchResults:CLIService.java:507',
    'org.apache.hive.service.cli.thrift.ThriftCLIService:FetchResults:ThriftCLIService.java:708',
    'org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1717',
    'org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1702',
    'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39',
    'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39',
    'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56',
    'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286',
    'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149',
    'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624',
    'java.lang.Thread:run:Thread.java:748',
    '*java.io.IOException:java.lang.ArrayIndexOutOfBoundsException: 6:29:4',
    'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:521', 'org.apache.hadoop.hive.ql.exec.FetchOperator:pushRow:FetchOperator.java:428',
    'org.apache.hadoop.hive.ql.exec.FetchTask:fetch:FetchTask.java:146',
    'org.apache.hadoop.hive.ql.Driver:getResults:Driver.java:2277', 'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:491', '*java.lang.ArrayIndexOutOfBoundsException:6:37:8', 'org.apache.orc.OrcFile$WriterVersion:from:OrcFile.java:145',
    'org.apache.orc.impl.OrcTail:getWriterVersion:OrcTail.java:74',
    'org.apache.orc.impl.ReaderImpl:<init>:ReaderImpl.java:385',
    'org.apache.hadoop.hive.ql.io.orc.ReaderImpl:<init>:ReaderImpl.java:62',
    'org.apache.hadoop.hive.ql.io.orc.OrcFile:createReader:OrcFile.java:89', 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat:getRecordReader:OrcInputFormat.java:1690', 'org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit:getRecordReader:FetchOperator.java:695',
    'org.apache.hadoop.hive.ql.exec.FetchOperator:getRecordReader:FetchOperator.java:333',
    'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:459'], statusCode=3), results=None, hasMoreRows=None)
Dependencies:
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-client</artifactId>
  <version>3.0.0-cdh6.1.1</version>
  <scope>provided</scope>
</dependency>
<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-orc_2.12</artifactId>
  <version>1.11.2</version>
</dependency>
 

I think that the general problem that flink-orc uses orc-core dependency, but for correct working with hive i need a hive-orc.

Actually, I can not replace orc-core with hive-orc because this is incompatible with flink-orc classes.

 

How can I solve it?

 

It is preferable for me use StreamingFileSink for writing ORC files, not Flink Table API with Hive.

 

Hive version: 2.1.1-cdh6.1.1

Flink Version: 1.11.2

 
--
Best regards,
Sergei Chernov


--

Arvid Heise | Senior Java Developer


Follow us @VervericaData

--

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng