Reading from sockets using dataset api

Posted by kaansancak on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Reading-from-sockets-using-dataset-api-tp34560.html

Hi,

I have been running some experiments on  large graph data, smallest graph I have been using is around ~70 billion edges. I have a graph generator, which generates the graph in parallel and feeds to the running system. However, it takes a lot of time to read the edges, because even though the graph generation process is parallel, in Flink I can only listen from master node (correct me if I am wrong). Another option is dumping the generated data to a file and reading with readFromCsv, however this is not feasible in terms of storage management.

What I want to do is, invoking my graph generator, using ipc/tcp protocols  and reading the generated data from the sockets. Since the graph data is also generated parallel in each node, I want to make use of ipc, and read the data in parallel at each node. I made some online digging  but couldn’t find something similar using dataset api. I would be glad if you have some similar use cases or examples.

Is it possible to use streaming environment to create the data in parallel and switch to dataset api?

Thanks in advance!

Best
Kaan