Scale up REGEX pipeline

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Scale up REGEX pipeline

Antón Rodríguez Yuste
Hi community,

I'm working in a pipeline which needs to apply several REGEX expressions for matching. I have around 10K Regex expressions but, depending on some metadata in the message, I only need to apply 5-10 for that specific message.

I've being doing some research with java.util.regex, com.google.re2j and org.apache.regexp.RE but any of them fit my requirements:
  • If I don't compile them, it takes too much time per operation.
  • I compile them, they occupy a huge amount of memory which makes the process too expensive.
It's ok if I skip some messages, so I was considering implementing a lazy cache so I only keep in memory the REGEX compiled patterns which are "hot". This solution seems quite complex and not ideal.

Do you know any other alternative / idea to tackle this?

Cheers,

Antón