Grep Non ASCII Character Sets

I had an interesting challenge today about filtering a list using grep with a set like the following:

senegalese footballer|সেনেগালীয় ফুটবলার
species of insect|কীটপতঙ্গের প্রজাতি
indian cricket player|ভারতীয় ক্রিকেটার
Ajit Manohar Pai|অজিত পৈ|অজিত মনোহর পাই|অজিত ভরদ্বাজ পাই
ajit pai|অজিত পাই

You would need to grep to match them based on pipe. My target was to match lines that had multiple pipes, at least 2. I took a bit greedy approach for this to understand and find how to match Bengali characters using Grep. So, I started matching Alphanumeric first with a Pipe and then Bengali Characters with a Pipe, instead of just counting how many pipes I have at least.

If you are aware, conventional regex can detect and match unicode character sets like, if you want to match a ‘Greek’ set, you can do \p{Greek} in regular expressions. But for some reason, this wasn’t matching the Bengali in the following grep:

grep -Ei "\p{Bengali}" test.txt

I then looked at the grep manual and found a key information. Grep by default uses POSIX regex, and -E is just the extended version of POSIX grep. Unfortunately, this regex engine does not support PCRE, which is basically used to grep the unicode sets here. POSIX can only work with the HEX boundaries, which may sometimes get pretty difficult to match range boundaries of non ascii characters. To make it simpler, you can use PERL Regex that is a PCRE supporting engine. To use that, you may do the following:

grep -Pi "\p{Bengali}" test.txt

To get all the unicode that are available with a set in a PCRE supported Regex engine, you may check the following:

Regex Unicode Scripts

Now, let’s come to the original matching, what we have to match at least 2 pipes, the first one being the basic alphanumeric with whitespace being the simpler one:

grep -Pi "^[A-Za-z0-9\s]+\|" test.txt

Then, we need to add the First Bengali part with whitespace and a pipe

grep -Pi "^[A-Za-z0-9\s]+\|[\p{Bengali}\s]+\|" test.txt

This should suffice our purpose here in matching first being alphanumeric with a pipe, and second being the Bengali unicode set with a pipe at least.