(DEPRECATED) Apache Flink User Mailing List archive.

Ingesting data from an API

Classic

List

Threaded

2 messages Options

Aarti Gupta

Ingesting data from an API

Hi,

I have an API that emits output that I want to use as a data source for Flink.

I have written a custom source function that is as follows -

public class DynamicRuleSource extends AlertingRuleSource {
    private ArrayList<Rule> rules = new ArrayList<Rule>();


    public void run(SourceContext<Rule> ctx) throws Exception {
        System.out.println("In run ");
        while(true) {
            while (!rules.isEmpty()) {
                Rule rule = rules.remove(0);
                ctx.collectWithTimestamp(rule, 0);
                ctx.emitWatermark(new Watermark(0));
            }
            Thread.sleep(1000);
        }
    }

    public void addRule(Rule rule) {
        rules.add(rule);
    }

    @Override
    public void cancel() {
    }
}

When the API is invoked, it calls the addRule method in my CustomSource function.

The run method in CustomSource polls for any data to be ingested.

The same object instance is shared with the API and the Flink Execution environment, however, the output of the API does not get ingested into the Flink DataStream.

Is this the right pattern to use, or is Kafka the recommended way of streaming data into Flink ?

--Aarti

Aarti Gupta

Director, Engineering, Correlation

[hidden email]

Qualys, Inc. – Blog | Community | Twitter

Tzu-Li (Gordon) Tai

Re: Ingesting data from an API

Hi Aarti,

I would imagine that the described approach (sharing the same object instance with the API and the Flink runtime) would only work in toy executions, such as executing the job within the IDE.

Moreover, you would not be able to have exactly-once semantics with this source, which for most users, is one main advantage of choosing Kafka as the source for Flink jobs.
With the way how Flink checkpointing works [1], Flink's exactly-once guarantees rely on the fact that Kafka records are replayable from a deterministic record offset.

Please let me know if you have any further questions.

Cheers,
Gordon

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.6/internals/stream_checkpointing.html

On Mon, Nov 12, 2018 at 3:16 PM Aarti Gupta <[hidden email]> wrote:

Hi,

I have an API that emits output that I want to use as a data source for Flink.

I have written a custom source function that is as follows -
public class DynamicRuleSource extends AlertingRuleSource {
    private ArrayList<Rule> rules = new ArrayList<Rule>();


    public void run(SourceContext<Rule> ctx) throws Exception {
        System.out.println("In run ");
        while(true) {
            while (!rules.isEmpty()) {
                Rule rule = rules.remove(0);
                ctx.collectWithTimestamp(rule, 0);
                ctx.emitWatermark(new Watermark(0));
            }
            Thread.sleep(1000);
        }
    }

    public void addRule(Rule rule) {
        rules.add(rule);
    }

    @Override
    public void cancel() {
    }
}
When the API is invoked, it calls the addRule method in my CustomSource function.
The run method in CustomSource polls for any data to be ingested.

The same object instance is shared with the API and the Flink Execution environment, however, the output of the API does not get ingested into the Flink DataStream.

Is this the right pattern to use, or is Kafka the recommended way of streaming data into Flink ?

--Aarti

--

Aarti Gupta

Director, Engineering, Correlation

[hidden email]

T

Qualys, Inc. – Blog | Community | Twitter