Regex for selecting words with accents (diacritics)

Recently came across a curious problem.

I have a simple regex to extract plural and singular terms from a string. This string is translated to multiple languages.

The terms to be pluralized were marked with parentheses and separated by a pipe character:

(cake|cakes)

The regex was the following:

/\((\w*)\|(\w*)\)/

This way I can capture the terms that need pluralization from the string and display the correct word.

For now, let us assume the number is always greater than one and that we are only dealing with languages with one plural form. If you require more plural forms you could just add them between the parentheses (e.g. (ciasteczko|ciasteczka|ciasteczek).

This is a generic implementation of the concept in JavaScript:

const sentenceTemplate = "I am going to eat %quantity% (cake|cakes)";
const quantity = 10;

let finalSentence = sentenceTemplate.replace( '%quantity%', quantity );
finalSentence = finalSentence.replace( /\((\w*)\|(\w*)\)/, ( match, sin, plu ) => {
    return ( quantity > 1 ) ? plu : sin;
} );

The problem arose when the translated sentence had special characters. For instance in french: Je vais manger %quantity% (gâteau|gâteaux).

In this case, the \w+ regex character class will not match the “â” in “gâteau”.

Luckily, regex in JavaScript allows us to define ranges of characters and consulting an ASCII table we can come up with the following line:

/\(([A-ú]*)\|([A-ú]*)\)/

Be aware that the range [A-ú] includes many special characters like brackets and the pipe character, but is a much more simple way to capture most diacritics in many languages that use the Latin alphabet.

You could be more strict and use multiple ranges instead:

[A-Za-zÀ-ú]

Check the ASCII code reference to create the range that better suits you particular needs.

References

ASCII Code Reference

References

Leave a Comment Cancel Reply