How to get the string/part/word/text within brackets in Python using Regex

PROBLEM DEFINITION

For example, you have a string like the following:

[lagril] L.A. Girl Pro. Setting HD Matte Finish Spray

While you are scanning the line, you would like to extract the following word from it ‘lagril’, which you are interested in. How to do that?

GETTING TEXT WITHIN BRACKETS USING REGEX IN PYTHON

Our problem falls into a common string extraction problem we face in software engineering. We usually do this using Regular Expressions. Let’s build the regular expression logic first, using regex101.com

We need to find a string that starts with ‘[‘ bracket and ends with ‘]’ bracket, and in the middle, we expect alphanumeric word with small or capital letters, and they can be anything from 0 to any. So, this should be as simple as the following:

\[[A-Za-z0-9]*\]

Now, this should help us target the words that comes within bracket in a sentence/large string. But the trick to grab the text within the bracket is to group them. To use group in regex, we use () brackets without back slash in front. So if the regex is as following:

\[([A-Za-z0-9]*)\]

This will put the matching string in group 1. Now, how can you get what is in the group 1 of a regular expression engine? Let’s dive into python now:

# let's import regular expression engine first
import re

# our string
txt = '[lagril] L.A. Girl Pro. Setting HD Matte Finish Spray'

# our regex search would be as following:
x = re.search(r"\[([A-Za-z0-9]*)\]", txt)

# we know this will put the inner text in group 1. regex object that returned by re.search, has a method called 'group()' to catch the groups matches regex. You may use the following

x.group(1) # prints lagril

Grep Non ASCII Character Sets

I had an interesting challenge today about filtering a list using grep with a set like the following:

senegalese footballer|সেনেগালীয় ফুটবলার
species of insect|কীটপতঙ্গের প্রজাতি
indian cricket player|ভারতীয় ক্রিকেটার
Ajit Manohar Pai|অজিত পৈ|অজিত মনোহর পাই|অজিত ভরদ্বাজ পাই
ajit pai|অজিত পাই

You would need to grep to match them based on pipe. My target was to match lines that had multiple pipes, at least 2. I took a bit greedy approach for this to understand and find how to match Bengali characters using Grep. So, I started matching Alphanumeric first with a Pipe and then Bengali Characters with a Pipe, instead of just counting how many pipes I have at least.

If you are aware, conventional regex can detect and match unicode character sets like, if you want to match a ‘Greek’ set, you can do \p{Greek} in regular expressions. But for some reason, this wasn’t matching the Bengali in the following grep:

grep -Ei "\p{Bengali}" test.txt

I then looked at the grep manual and found a key information. Grep by default uses POSIX regex, and -E is just the extended version of POSIX grep. Unfortunately, this regex engine does not support PCRE, which is basically used to grep the unicode sets here. POSIX can only work with the HEX boundaries, which may sometimes get pretty difficult to match range boundaries of non ascii characters. To make it simpler, you can use PERL Regex that is a PCRE supporting engine. To use that, you may do the following:

grep -Pi "\p{Bengali}" test.txt

To get all the unicode that are available with a set in a PCRE supported Regex engine, you may check the following:

Regex Unicode Scripts

Now, let’s come to the original matching, what we have to match at least 2 pipes, the first one being the basic alphanumeric with whitespace being the simpler one:

grep -Pi "^[A-Za-z0-9\s]+\|" test.txt

Then, we need to add the First Bengali part with whitespace and a pipe

grep -Pi "^[A-Za-z0-9\s]+\|[\p{Bengali}\s]+\|" test.txt

This should suffice our purpose here in matching first being alphanumeric with a pipe, and second being the Bengali unicode set with a pipe at least.