Lookahead and Lookbehind Zero-Length Assertions
Robotic Online Intelligence Ltd

Lookahead and Lookbehind Zero-Length Assertions

Text. Patterns. Search. The Power of Regular Expressions.

Very few people in the business world would be familiar with Regular Expressions (or regex), yet it may be one of the most useful things in searching, filtering and extracting information.  

In a simplified way, we think of it as a text search method focused on the patterns of characters in the text, as opposed to just keywords.

Technically, it would be a library you would load in e.g. Python or any major language, and deploy a syntax specific to that language. For a proper intro to regex – there are plenty of references online, blogs, books.

Textbooks typically introduce regex in the context of checks whether a given string of characters is a valid email address or a phone number, based on the patterns of characters that would be normal for such cases.

At Robotic Online Intelligence (ROI), we heavily use regex to detect the nature of the text, the topics it covers - ultimately for text classification, tagging and scoring for relevance - as we do on Kubro™ platform.

For example, say the objective is to identify headlines or tweets that are likely about one company acquiring another. We would require then that the text have some indication of an action of an acquisition or a deal (acquired, purchased, merged or one of 50 other words or expressions), some notion of numerical value as typically stated for such cases, e.g. 20, 250, and some notion of currency or indication that the figure relates to value, e.g. $, USD, billion, million – in all kinds of variations.

Human expertise is required to define such cases well, but we find that in B2B and finance domains, this works well as the ‘user’ in a particular role would know his/her domain well enough to define the logic.

In Python, the regex for the ‘value figure’ could look like this:

No alt text provided for this image

Source: Robotic Online Intelligence

In the above, we also have an additional condition that the text length should be at least 20 characters. You can visualize the same regex as:

No alt text provided for this image

Source: Generated on https://www.debuggex.com/

For example, here’s how this expression would pick up (highlights in green) the relevant figures in some tweets on Kubro™:

No alt text provided for this image

Source: Robotic Online Intelligence

Why isn’t regex that widely known and used from a business perspective? The flexibility of regex comes at a cost of complexity, once you go beyond the simple use cases. Besides, running it at scale is not practical (you cannot use regex on Google, Bing).

However, when applied repetitively in a focused manner on a specific domain, regex can be amazingly powerful. At Robotic Online Intelligence, we are big fans of regex.

What about the “Lookahead and Lookbehind Zero-Length Assertions”? If you are curious about that title, check a comprehensive explanation here.

--- END ---

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics