Accessing RDF triples using Flink

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Accessing RDF triples using Flink

Ritesh Kumar Singh
Hi,

I need some suggestions regarding accessing RDF triples from flink. I'm trying to integrate flink in a pipeline where the input for flink comes from SPARQL query on a Jena model. And after modification of triples using flink, I will be performing SPARQL update using Jena to save my changes.
  • Are there any recommended input format for loading the triples to flink?
  • Will this use case be classified as a flink streaming job or a batch processing job?
  • How will loading of the dataset vary with the input size?
  • Are there any recommended packages/ projects for these type of projects?
Any suggestion will be of great help.

Regards,
Ritesh
Reply | Threaded
Open this post in threaded view
|

Re: Accessing RDF triples using Flink

Flavio Pompermaier

Ho Ritesh,
I have sone experience with Rdf and Flink. What do you mean for accessing a Jena model? How do you create it?

From my experience reading triples from jena models is evil because it has some problems with garbage collection.

On 6 Apr 2016 00:51, "Ritesh Kumar Singh" <[hidden email]> wrote:
Hi,

I need some suggestions regarding accessing RDF triples from flink. I'm trying to integrate flink in a pipeline where the input for flink comes from SPARQL query on a Jena model. And after modification of triples using flink, I will be performing SPARQL update using Jena to save my changes.
  • Are there any recommended input format for loading the triples to flink?
  • Will this use case be classified as a flink streaming job or a batch processing job?
  • How will loading of the dataset vary with the input size?
  • Are there any recommended packages/ projects for these type of projects?
Any suggestion will be of great help.

Regards,
Ritesh
Reply | Threaded
Open this post in threaded view
|

Re: Accessing RDF triples using Flink

Ritesh Kumar Singh
Hi Flavio,
  1. How do you access your rdf dataset via flink? Are you reading it as a normal input file and splitting the records or you have some wrappers in place to convert the rdf data into triples? Can you please share some code samples if possible?
  2. I am using Jena TDB command line utilities to make queries against the dataset in order to avoid java garbage collection issues. I am also using Jena java APIs as a dependency but command line utils are way faster (Though it comes with an extra requirement to have Jena command line utils installed in the system). Main reason for this approach being able to pass the string output from the command line to Flink as part of my pipeline. Can you tell me your approach to this?
  3. Should I dump my query output to a file and then consume it as a normal input source for Flink?

Basically, any help regarding this will be helpful.

Regards,
Ritesh


On Wed, Apr 6, 2016 at 2:45 PM, Flavio Pompermaier <[hidden email]> wrote:

Ho Ritesh,
I have sone experience with Rdf and Flink. What do you mean for accessing a Jena model? How do you create it?

From my experience reading triples from jena models is evil because it has some problems with garbage collection.

On 6 Apr 2016 00:51, "Ritesh Kumar Singh" <[hidden email]> wrote:
Hi,

I need some suggestions regarding accessing RDF triples from flink. I'm trying to integrate flink in a pipeline where the input for flink comes from SPARQL query on a Jena model. And after modification of triples using flink, I will be performing SPARQL update using Jena to save my changes.
  • Are there any recommended input format for loading the triples to flink?
  • Will this use case be classified as a flink streaming job or a batch processing job?
  • How will loading of the dataset vary with the input size?
  • Are there any recommended packages/ projects for these type of projects?
Any suggestion will be of great help.

Regards,
Ritesh
<img width="0" height="0" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7">

Reply | Threaded
Open this post in threaded view
|

Re: Accessing RDF triples using Flink

Flavio Pompermaier
Hi Ritesh,
Jena could store triples in NQuadsInputFormat that is an HadoopInputFormat so that you can read data in effiient way with Flink. Unfortunately I rembember that I had some problem usign it so I just export my Jena model as NQuads so then I can parse it efficiently with Flink as a text file.
However the parsing with sesame 4 is more efficient in terms of speed and garbage collection. 

What I do is to convert every quad into a tuple5, group triples/quads by subject and then apply some logic. The quads grouped by subject is what we call "entiton atom" and combining them leads to an "entiton molecule" (i.e. a graph rooted in some entiton atom).

We presented our work at FlinkForward 2015 in Berlin:
If you need some code that reads the nquads with Flink I can give you some code, just write me in private!

Best,
Flavio

On Wed, Apr 6, 2016 at 3:57 PM, Ritesh Kumar Singh <[hidden email]> wrote:
Hi Flavio,
  1. How do you access your rdf dataset via flink? Are you reading it as a normal input file and splitting the records or you have some wrappers in place to convert the rdf data into triples? Can you please share some code samples if possible?
  2. I am using Jena TDB command line utilities to make queries against the dataset in order to avoid java garbage collection issues. I am also using Jena java APIs as a dependency but command line utils are way faster (Though it comes with an extra requirement to have Jena command line utils installed in the system). Main reason for this approach being able to pass the string output from the command line to Flink as part of my pipeline. Can you tell me your approach to this?
  3. Should I dump my query output to a file and then consume it as a normal input source for Flink?

Basically, any help regarding this will be helpful.

Regards,
Ritesh


On Wed, Apr 6, 2016 at 2:45 PM, Flavio Pompermaier <[hidden email]> wrote:

Ho Ritesh,
I have sone experience with Rdf and Flink. What do you mean for accessing a Jena model? How do you create it?

From my experience reading triples from jena models is evil because it has some problems with garbage collection.

On 6 Apr 2016 00:51, "Ritesh Kumar Singh" <[hidden email]> wrote:
Hi,

I need some suggestions regarding accessing RDF triples from flink. I'm trying to integrate flink in a pipeline where the input for flink comes from SPARQL query on a Jena model. And after modification of triples using flink, I will be performing SPARQL update using Jena to save my changes.
  • Are there any recommended input format for loading the triples to flink?
  • Will this use case be classified as a flink streaming job or a batch processing job?
  • How will loading of the dataset vary with the input size?
  • Are there any recommended packages/ projects for these type of projects?
Any suggestion will be of great help.

Regards,
Ritesh