when run progrm on big data customer 2.5GB orders 5GB disply error why

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

when run progrm on big data customer 2.5GB orders 5GB disply error why

hagersaleh
when run progrm on big data customer 2.5GB orders 5GB disply error why


DataSource (at getCustomerDataSet(TPCHQuery3.java:252) (org.apache.flink.api.java.io.CsvInputFormat)) (1/1) switched to FAILED
org.apache.flink.api.common.io.ParseException: Row too short: 14999999|Customer#014999999|3emQ49UZtlfjeK
at org.apache.flink.api.common.io.GenericCsvInputFormat.parseRecord(GenericCsvInputFormat.java:283)
at org.apache.flink.api.java.io.CsvInputFormat.readRecord(CsvInputFormat.java:236)
at org.apache.flink.api.java.io.CsvInputFormat.readRecord(CsvInputFormat.java:42)
at org.apache.flink.api.common.io.DelimitedInputFormat.nextRecord(DelimitedInputFormat.java:489)
at org.apache.flink.api.java.io.CsvInputFormat.nextRecord(CsvInputFormat.java:204)
at org.apache.flink.api.java.io.CsvInputFormat.nextRecord(CsvInputFormat.java:42)
at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:148)
at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:257) at java.lang.Thread.run(Thread.java:724)


ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

DataSet<Customer> customers = getCustomerDataSet(env,mask);

DataSet<Orders> orders= getOrdersDataSet(env,maskorder);

DefaultJoin<Customer,Orders> result = customers.join(orders).where(0).equalTo(1);

result.print();
result .writeAsCsv("/home/hadoop/Desktop/Dataset/inoptmization.csv", "\n", "|",WriteMode.OVERWRITE);
env.execute();
}
public static class Customer extends Tuple3<Long,String,String> {

}
public static class Orders extends Tuple2<Long,Long> {

}
private static DataSet<Customer> getCustomerDataSet(ExecutionEnvironment env,String mask) {
return env.readCsvFile("/home/hadoop/Desktop/Dataset/customer.csv")
.fieldDelimiter('|')
.includeFields(mask).ignoreFirstLine()
.tupleType(Customer.class);


}

Reply | Threaded
Open this post in threaded view
|

Re: when run progrm on big data customer 2.5GB orders 5GB disply error why

rmetzger0
Hi,

the problem is that Flink is trying to parse the input data as CSV, but there seem to be rows in the data which do not conform to the specified schema.


On Thu, Jun 18, 2015 at 12:51 PM, hagersaleh <[hidden email]> wrote:
when run progrm on big data customer 2.5GB orders 5GB disply error why


DataSource (at getCustomerDataSet(TPCHQuery3.java:252)
(org.apache.flink.api.java.io.CsvInputFormat)) (1/1) switched to FAILED
org.apache.flink.api.common.io.ParseException: Row too short:
14999999|Customer#014999999|3emQ49UZtlfjeK
at
org.apache.flink.api.common.io.GenericCsvInputFormat.parseRecord(GenericCsvInputFormat.java:283)
at
org.apache.flink.api.java.io.CsvInputFormat.readRecord(CsvInputFormat.java:236)
at
org.apache.flink.api.java.io.CsvInputFormat.readRecord(CsvInputFormat.java:42)
at
org.apache.flink.api.common.io.DelimitedInputFormat.nextRecord(DelimitedInputFormat.java:489)
at
org.apache.flink.api.java.io.CsvInputFormat.nextRecord(CsvInputFormat.java:204)
at
org.apache.flink.api.java.io.CsvInputFormat.nextRecord(CsvInputFormat.java:42)
at
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:148)
at
org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:257)
at java.lang.Thread.run(Thread.java:724)


ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

DataSet<Customer> customers = getCustomerDataSet(env,mask);

DataSet<Orders> orders= getOrdersDataSet(env,maskorder);

DefaultJoin<Customer,Orders> result =
customers.join(orders).where(0).equalTo(1);

result.print();
result .writeAsCsv("/home/hadoop/Desktop/Dataset/inoptmization.csv", "\n",
"|",WriteMode.OVERWRITE);
env.execute();
}
public static class Customer extends Tuple3<Long,String,String> {

}
public static class Orders extends Tuple2<Long,Long> {

}
private static DataSet<Customer> getCustomerDataSet(ExecutionEnvironment
env,String mask) {
return env.readCsvFile("/home/hadoop/Desktop/Dataset/customer.csv")
.fieldDelimiter('|')
.includeFields(mask).ignoreFirstLine()
.tupleType(Customer.class);


}





--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/when-run-progrm-on-big-data-customer-2-5GB-orders-5GB-disply-error-why-tp1699.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.