A python regex to validate roman numerals

Note: This is my first post, I hope you'll like it :)

I'm not gonna lie to you... I LOVE REGEX!

As a kid, I grew up playing adventure games full of puzzles and riddles. Looking for the solution was a personal quest, a treasure hunt. Finding it was so exciting, but not as much as jumping on another riddle!

When I discovered regular expressions on my journey to become a (good) python programmer, I felt that same excitement. I was blown away by the countless possibilities these were offering. Deciphering one was like suddenly being able to read hieroglyph, writing one was like discovering I could speak a foreign language. Although I know they should be used with caution and in special cases only, I keep pushing myself to use them everywhere I can.

That's why, when Codewars challenged me to write a function to convert roman numerals from/to Arabic numbers, I could not resist writing a regex to help me solve that problem.

Alt Text

Enough chit-chat, let's get our hands dirty.

My first task was to validate if the user input was a valid roman numeral. To sum up, roman numerals consists of the following symbols:Source: Wikipedia

It seems that the thousands unit [M] does not extend past [MMM], which means that the biggest roman number would be [MMMCMXCIX], or 3999. I'm not sure if numbers could go higher than that and why the limit, anyway for the sake of this problem I limited myself with numbers between 1 and 3999.

Now the trick is that the symbols placement is very important. If you don't put them in the right order, the resulting number would be invalid and unreadable. As listed in the table up there, numerals should start with thousands [M] 1000, followed by hundreds [D/C] 500/100, then dozens [L/X] 50/10, and finally units [V/1] 5/1.

BUT, that's not it! Numerals can only repeat 3 times like [CCC] 300 before switching to a combo of two numerals like [CD] 400. So you can still have a [C] 100 before an [M] 1000, like in [CM] 900 for instance.

Bit confusing, isn't it?

Alright, let's recap our conditions:

Roman numbers are ranging from [I] 1 to [MMMCMXCIX] 3999
Numerals should follow a precise order: [M] 1000 / [D] 500 / [C] 100 / [L] 50 / [X] 10 / [V] 5 / [I] 1
A numeral cannot repeat more than 3 times, it then uses a pair
The following pairs are allowed: [CM] 900 / [CD] 400 / [XC] 90 / [XL] 40 / [IX] 9 / [IV] 4

Do you start to see our REGEX showing up? :)

Let's translate this into code.

For this we are going to use a tag I find really helpful when writing regex is the verbose one (re.VERBOSE or re.X) It allows you to spread your pattern on multiple lines and be more readable. Let's try it!

import re

def is_roman_number(num):

    pattern = re.compile(r"""    
                                ^M{0,3}
                                (CM|CD|D?C{0,3})?
                                (XC|XL|L?X{0,3})?
                                (IX|IV|V?I{0,3})?$
            """, re.VERBOSE)

    if re.match(pattern, num):
        return True

    return False

Wow, that looks amazing already! Let's take a closer look at these 4 lines:

^M{0,3} = Between 0 and 3 [M] at the beginning [^] of the string
(CM|CD|D?C{0,3})? = One pair [CM] or one pair [CD] or [D], followed by up to 3 [C]. Each element is optional [?], as well as the whole block [()?]
(XC|XL|L?X{0,3})? = One pair [XC] or one pair [XL] or [L], followed by up to 3 [X]. Each element is optional [?], as well as the whole block [()?]
(IX|IV|V?I{0,3})?$ = One pair [IX] or one pair [IV] or [V], followed by up to 3 [I]. Each element is optional [?], as well as the whole block [()?], which should be at the end of the string [$]

Let's test our code

I'm using a simple fstring calling my function and comparing the string against our pattern to validate the numeral or not:


num_valid = 'MMDCCLXXIII'
num_invalid = 'CCCMMVIIVV'

print(f"{num_valid} is {'not' if not is_roman_number(num_valid) else ''}a roman number")
print(f"{num_invalid} is {'not ' if not is_roman_number(num_invalid) else ''}a roman number")

# Output:
# MMDCCLXXIII is a roman number
# CCCMMVIIVV is not a roman number

That wasn't so bad after all! Now look at this and tell me it's not the most beautiful thing you've seen in your life:

^M{0,3}(CM|CD|D?C{0,3})?(XC|XL|L?X{0,3})?(IX|IV|V?I{0,3})?$'

Alt Text

That's all folks! Let me know if you are interested by the second part of the challenge: converting roman numerals from/to Arabic numbers and I will share my solution.

Stay safe out there and read you soon :)