Timeout Exception When Producing/Consuming Messages to Hundreds of Topics

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Timeout Exception When Producing/Consuming Messages to Hundreds of Topics

Claude Murad
Hello,

I'm trying to run an experiment w/ two flink jobs:
  • A producer producing messages to hundreds of topics
  • A consumer consuming the messages from all the topics
After the job runs after a few minutes, it will fail w/ following error:

Caused by: org.apache.kafka.common.errors.TimeoutException: Topic <topic-name> not present in metadata after 60000 ms

If I run the job w/ a few topics, it will work.  I have tried setting the following properties in the job but still encounter the problem:

properties.setProperty("retries", "20");
properties.setProperty("request.timeout.ms", "300000");
properties.setProperty("metadata.fetch.timeout.ms", "300000");

Any ideas about this?

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Timeout Exception When Producing/Consuming Messages to Hundreds of Topics

Piotr Nowojski-4
Hi,

Sorry, I don't know. I've heard that this kind of pattern is discouraged by Confluent. At least it used to be.

Maybe someone else from the community will be able to help from his experience, however keep in mind that under the hood Flink is just simply using KafkaConsumer and KafkaProducer provided by Apache Kafka project. Only in a distributed fashion. You might have better luck trying to find an answer on Kafka user mailing list, or searching how to tune Kafka to work with hundreds of topics or how to solve timeout problems.

Unless this is a well known and common issue, you will need to dig out more information on your own. Maybe for starters try investigating if the problem is caused by overloading Kafka clients running in Flink, or whether the problem is in overloaded Kafka brokers. Start with standard Java things like:
- cpu overloaded
- memory pressure/GC pauses
- machines swapping
- blocking disk IO (iostat)
- single thread overloaded
- network issues?

Once you identify the source of the problem, it will be easier to find a solution. But again, most likely you will find your answer quicker asking on Kafka mailing lists.

Best,
Piotrek

pon., 1 mar 2021 o 18:01 Claude M <[hidden email]> napisał(a):
Hello,

I'm trying to run an experiment w/ two flink jobs:
  • A producer producing messages to hundreds of topics
  • A consumer consuming the messages from all the topics
After the job runs after a few minutes, it will fail w/ following error:

Caused by: org.apache.kafka.common.errors.TimeoutException: Topic <topic-name> not present in metadata after 60000 ms

If I run the job w/ a few topics, it will work.  I have tried setting the following properties in the job but still encounter the problem:

properties.setProperty("retries", "20");
properties.setProperty("request.timeout.ms", "300000");
properties.setProperty("metadata.fetch.timeout.ms", "300000");

Any ideas about this?

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Timeout Exception When Producing/Consuming Messages to Hundreds of Topics

Arvid Heise-4
Hi Claude,

Are you sure that the level of abstraction is fitting to your application? Maybe you could describe your use case a bit.

Since you use one consumer, I'm assuming that all topics have the same schema. Then, for me, hundreds of similarly structured topics with X partitions sound counter-intuitive. Couldn't you just put it into the same topic with X*100 partitions and use a composite key?

On Thu, Mar 4, 2021 at 9:42 AM Piotr Nowojski <[hidden email]> wrote:
Hi,

Sorry, I don't know. I've heard that this kind of pattern is discouraged by Confluent. At least it used to be.

Maybe someone else from the community will be able to help from his experience, however keep in mind that under the hood Flink is just simply using KafkaConsumer and KafkaProducer provided by Apache Kafka project. Only in a distributed fashion. You might have better luck trying to find an answer on Kafka user mailing list, or searching how to tune Kafka to work with hundreds of topics or how to solve timeout problems.

Unless this is a well known and common issue, you will need to dig out more information on your own. Maybe for starters try investigating if the problem is caused by overloading Kafka clients running in Flink, or whether the problem is in overloaded Kafka brokers. Start with standard Java things like:
- cpu overloaded
- memory pressure/GC pauses
- machines swapping
- blocking disk IO (iostat)
- single thread overloaded
- network issues?

Once you identify the source of the problem, it will be easier to find a solution. But again, most likely you will find your answer quicker asking on Kafka mailing lists.

Best,
Piotrek

pon., 1 mar 2021 o 18:01 Claude M <[hidden email]> napisał(a):
Hello,

I'm trying to run an experiment w/ two flink jobs:
  • A producer producing messages to hundreds of topics
  • A consumer consuming the messages from all the topics
After the job runs after a few minutes, it will fail w/ following error:

Caused by: org.apache.kafka.common.errors.TimeoutException: Topic <topic-name> not present in metadata after 60000 ms

If I run the job w/ a few topics, it will work.  I have tried setting the following properties in the job but still encounter the problem:

properties.setProperty("retries", "20");
properties.setProperty("request.timeout.ms", "300000");
properties.setProperty("metadata.fetch.timeout.ms", "300000");

Any ideas about this?

Thanks