streaming using DeserializationSchema

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

streaming using DeserializationSchema

Martin Neumann
Hej,

I have a stream program reading data from Kafka where the data is in avro. I have my own DeserializationSchema to deal with it.

For testing reasons I want to read a dump from hdfs instead, is there a way to use the same DeserializationSchema to read from an avro file stored on hdfs?

cheers Martin
Reply | Threaded
Open this post in threaded view
|

Re: streaming using DeserializationSchema

Gyula Fóra
Hey,

A very simple thing you could do is to set up a simple kafka producer in a java program that will feed the data into a topic. This also has the additional benefit that you are actually testing against kafka.

Cheers,
Gyula

Martin Neumann <[hidden email]> ezt írta (időpont: 2016. febr. 12., P, 0:20):
Hej,

I have a stream program reading data from Kafka where the data is in avro. I have my own DeserializationSchema to deal with it.

For testing reasons I want to read a dump from hdfs instead, is there a way to use the same DeserializationSchema to read from an avro file stored on hdfs?

cheers Martin
Reply | Threaded
Open this post in threaded view
|

Re: streaming using DeserializationSchema

Martin Neumann
Its not only about testing, I will also need to run things against different datasets. I want to reuse as much of the code as possible to load the same data from a file instead of kafka.

Is there a simple way of loading the data from a File using the same conversion classes that I would use to transfrom them when I read them from kafka or do I have to write a new avro deserializer (InputFormat).

On Fri, Feb 12, 2016 at 2:06 AM, Gyula Fóra <[hidden email]> wrote:
Hey,

A very simple thing you could do is to set up a simple kafka producer in a java program that will feed the data into a topic. This also has the additional benefit that you are actually testing against kafka.

Cheers,
Gyula

Martin Neumann <[hidden email]> ezt írta (időpont: 2016. febr. 12., P, 0:20):
Hej,

I have a stream program reading data from Kafka where the data is in avro. I have my own DeserializationSchema to deal with it.

For testing reasons I want to read a dump from hdfs instead, is there a way to use the same DeserializationSchema to read from an avro file stored on hdfs?

cheers Martin

Reply | Threaded
Open this post in threaded view
|

Re: streaming using DeserializationSchema

Nick Dimiduk-2
Hi Martin,

I have the same usecase. I wanted to be able to load from dumps of data in the same format as is on the kafak queue. I created a new application main, call it the "job" instead of the "flow". I refactored my code a bit for building the flow so all that can be reused via factory method. I then implemented a MapFunction that simply calls my existing deserializer. Create a new DataStream from flat file and tack on the MapFunction step. The resulting DataStream is then type-compatible with the Kakfa consumer that starts the "flow" application, so I pass it into the factory method. Tweak the ParameterTools options for the "job" application, et voilà!

Sorry I don't have example code for you; this would be a good example to contribute back to the community's example library though.

Good luck!
-n

On Fri, Feb 12, 2016 at 2:25 AM, Martin Neumann <[hidden email]> wrote:
Its not only about testing, I will also need to run things against different datasets. I want to reuse as much of the code as possible to load the same data from a file instead of kafka.

Is there a simple way of loading the data from a File using the same conversion classes that I would use to transfrom them when I read them from kafka or do I have to write a new avro deserializer (InputFormat).

On Fri, Feb 12, 2016 at 2:06 AM, Gyula Fóra <[hidden email]> wrote:
Hey,

A very simple thing you could do is to set up a simple kafka producer in a java program that will feed the data into a topic. This also has the additional benefit that you are actually testing against kafka.

Cheers,
Gyula

Martin Neumann <[hidden email]> ezt írta (időpont: 2016. febr. 12., P, 0:20):
Hej,

I have a stream program reading data from Kafka where the data is in avro. I have my own DeserializationSchema to deal with it.

For testing reasons I want to read a dump from hdfs instead, is there a way to use the same DeserializationSchema to read from an avro file stored on hdfs?

cheers Martin


Reply | Threaded
Open this post in threaded view
|

Re: streaming using DeserializationSchema

Martin Neumann
I'm trying the same thing now.

I guess you need to read the file as byte arrays somehow to make it work. What read function did you use? The mapper is not hard to write but the byte array stuff gives me a headache.

cheers Martin




On Fri, Feb 12, 2016 at 9:12 PM, Nick Dimiduk <[hidden email]> wrote:
Hi Martin,

I have the same usecase. I wanted to be able to load from dumps of data in the same format as is on the kafak queue. I created a new application main, call it the "job" instead of the "flow". I refactored my code a bit for building the flow so all that can be reused via factory method. I then implemented a MapFunction that simply calls my existing deserializer. Create a new DataStream from flat file and tack on the MapFunction step. The resulting DataStream is then type-compatible with the Kakfa consumer that starts the "flow" application, so I pass it into the factory method. Tweak the ParameterTools options for the "job" application, et voilà!

Sorry I don't have example code for you; this would be a good example to contribute back to the community's example library though.

Good luck!
-n

On Fri, Feb 12, 2016 at 2:25 AM, Martin Neumann <[hidden email]> wrote:
Its not only about testing, I will also need to run things against different datasets. I want to reuse as much of the code as possible to load the same data from a file instead of kafka.

Is there a simple way of loading the data from a File using the same conversion classes that I would use to transfrom them when I read them from kafka or do I have to write a new avro deserializer (InputFormat).

On Fri, Feb 12, 2016 at 2:06 AM, Gyula Fóra <[hidden email]> wrote:
Hey,

A very simple thing you could do is to set up a simple kafka producer in a java program that will feed the data into a topic. This also has the additional benefit that you are actually testing against kafka.

Cheers,
Gyula

Martin Neumann <[hidden email]> ezt írta (időpont: 2016. febr. 12., P, 0:20):
Hej,

I have a stream program reading data from Kafka where the data is in avro. I have my own DeserializationSchema to deal with it.

For testing reasons I want to read a dump from hdfs instead, is there a way to use the same DeserializationSchema to read from an avro file stored on hdfs?

cheers Martin



Reply | Threaded
Open this post in threaded view
|

Re: streaming using DeserializationSchema

Nick Dimiduk-2
My input file contains newline-delimited JSON records, one per text line. The records on the Kafka topic are JSON blobs encoded to UTF8 and written as bytes.

On Fri, Feb 12, 2016 at 1:41 PM, Martin Neumann <[hidden email]> wrote:
I'm trying the same thing now.

I guess you need to read the file as byte arrays somehow to make it work. What read function did you use? The mapper is not hard to write but the byte array stuff gives me a headache.

cheers Martin




On Fri, Feb 12, 2016 at 9:12 PM, Nick Dimiduk <[hidden email]> wrote:
Hi Martin,

I have the same usecase. I wanted to be able to load from dumps of data in the same format as is on the kafak queue. I created a new application main, call it the "job" instead of the "flow". I refactored my code a bit for building the flow so all that can be reused via factory method. I then implemented a MapFunction that simply calls my existing deserializer. Create a new DataStream from flat file and tack on the MapFunction step. The resulting DataStream is then type-compatible with the Kakfa consumer that starts the "flow" application, so I pass it into the factory method. Tweak the ParameterTools options for the "job" application, et voilà!

Sorry I don't have example code for you; this would be a good example to contribute back to the community's example library though.

Good luck!
-n

On Fri, Feb 12, 2016 at 2:25 AM, Martin Neumann <[hidden email]> wrote:
Its not only about testing, I will also need to run things against different datasets. I want to reuse as much of the code as possible to load the same data from a file instead of kafka.

Is there a simple way of loading the data from a File using the same conversion classes that I would use to transfrom them when I read them from kafka or do I have to write a new avro deserializer (InputFormat).

On Fri, Feb 12, 2016 at 2:06 AM, Gyula Fóra <[hidden email]> wrote:
Hey,

A very simple thing you could do is to set up a simple kafka producer in a java program that will feed the data into a topic. This also has the additional benefit that you are actually testing against kafka.

Cheers,
Gyula

Martin Neumann <[hidden email]> ezt írta (időpont: 2016. febr. 12., P, 0:20):
Hej,

I have a stream program reading data from Kafka where the data is in avro. I have my own DeserializationSchema to deal with it.

For testing reasons I want to read a dump from hdfs instead, is there a way to use the same DeserializationSchema to read from an avro file stored on hdfs?

cheers Martin




Reply | Threaded
Open this post in threaded view
|

Re: streaming using DeserializationSchema

Martin Neumann
I ended up not using the DeserializationSchema and instead going for a AvrioInputFormat in case of reading From file. I would have preferred to keep the code simpler but the map solution was a lot more complicated since the raw data I have is in Avro binary format so I cannot just read it and map it later.
 
cheers Martin


On Fri, Feb 12, 2016 at 10:47 PM, Nick Dimiduk <[hidden email]> wrote:
My input file contains newline-delimited JSON records, one per text line. The records on the Kafka topic are JSON blobs encoded to UTF8 and written as bytes.

On Fri, Feb 12, 2016 at 1:41 PM, Martin Neumann <[hidden email]> wrote:
I'm trying the same thing now.

I guess you need to read the file as byte arrays somehow to make it work. What read function did you use? The mapper is not hard to write but the byte array stuff gives me a headache.

cheers Martin




On Fri, Feb 12, 2016 at 9:12 PM, Nick Dimiduk <[hidden email]> wrote:
Hi Martin,

I have the same usecase. I wanted to be able to load from dumps of data in the same format as is on the kafak queue. I created a new application main, call it the "job" instead of the "flow". I refactored my code a bit for building the flow so all that can be reused via factory method. I then implemented a MapFunction that simply calls my existing deserializer. Create a new DataStream from flat file and tack on the MapFunction step. The resulting DataStream is then type-compatible with the Kakfa consumer that starts the "flow" application, so I pass it into the factory method. Tweak the ParameterTools options for the "job" application, et voilà!

Sorry I don't have example code for you; this would be a good example to contribute back to the community's example library though.

Good luck!
-n

On Fri, Feb 12, 2016 at 2:25 AM, Martin Neumann <[hidden email]> wrote:
Its not only about testing, I will also need to run things against different datasets. I want to reuse as much of the code as possible to load the same data from a file instead of kafka.

Is there a simple way of loading the data from a File using the same conversion classes that I would use to transfrom them when I read them from kafka or do I have to write a new avro deserializer (InputFormat).

On Fri, Feb 12, 2016 at 2:06 AM, Gyula Fóra <[hidden email]> wrote:
Hey,

A very simple thing you could do is to set up a simple kafka producer in a java program that will feed the data into a topic. This also has the additional benefit that you are actually testing against kafka.

Cheers,
Gyula

Martin Neumann <[hidden email]> ezt írta (időpont: 2016. febr. 12., P, 0:20):
Hej,

I have a stream program reading data from Kafka where the data is in avro. I have my own DeserializationSchema to deal with it.

For testing reasons I want to read a dump from hdfs instead, is there a way to use the same DeserializationSchema to read from an avro file stored on hdfs?

cheers Martin