String Matching in OCR for Biometrics
Biometrics in the Identity space has become more and more prevalent. Face Match, liveness, and OCR (optical character recognition) are all really cool technology — however, it does have it’s limitations.
OCR Challenges in Biometric Onboarding
The Issue
Recently, one of my customers who had implemented Biometrics into their onboarding flow encountered a significant hurdle: a high volume of OCR errors.
Users would submit photos of their documents, only to encounter errors due to poor lighting, glare (especially on Australian documents with holograms), shaky hands, and blurry images. These issues often led to incorrect OCR readings, such as mistaking an “F” for an “E” or “d” instead of “cl” in names on a Driver's Licence or Passport.
This situation led to a TONNE of manual review workload for their team. Imagine if 40–60% of all onboarding attempts get flagged due to issues, it’s going to cost you a lot — both time and money.
While humans can usually easily spot and correct these discrepancies, I’m just not confident enough with machines just yet ( although I am super keen to see what AI does here in this space).
General Improvements
Now the general process for most OCR journeys is to help fix user error at the source.
The two main ideas here are:
- Enhancing User Experience (UX): Directing users on how to take clearer photos.
- Detail Verification: Presenting the extracted information back to the user for verification or correction, similar to how we double-check our email addresses when signing up for services. This step introduces minimal friction into the process.
The question of fraud tolerance arises: is changing a letter or two acceptable? While one or two letter changes might be fine, the line becomes blurrier with three or four modifications.
How do we create a highly efficient onboarding process, whilst still maintaining a pretty good grip on the reigns for what is clearly an OCR typo.
This is where string matching/similarity algorithms can help come into play.
Introducing String Similarity Algorithms
Some Background
String matching algorithms are far from the mundane exercises you might have been faced with in an Intro to Algorithms class — and yes, I used to think they were boring too.
Surprise, surprise, they’re pretty darn common. From search engines to protein and DNA sequence matching, even the spell check I used to write this blog used a form of string similarity/matching (thak god 4 spell chek) .
They range from simple exact matches to complex analyses based on ‘edit distance’ — a way to measure how two strings diverge by counting the steps needed to transform one into the other.
Comparing Common Algorithms - Apples to Apples? 🍏🍎
Look there are so many different types of these algorithms so I’m just going to name a few common ones. Let’s do a quick comparison to explain how some of them work.
Levenshtein Distance
It measures the minimum number of edits needed to change one word into another — insertions, deletions, and substitutions of letters
If you’re comparing “Apple” to “Appel,” the Levenshtein will of course note that it’s a small change. Essentially, it sees “Appel” as just two edits away from “Apple” because we’d need to individually swap the ‘l’ and ‘e’
ie
Step 1: APPEL → APPLEL (insertion of 'L')
Step 2: APPLEL → APPLE (deletion of 'E' )
Damerau-Levenshtein Distance :
Similar to Levenshtein, Damerau–Levenshtein also adds the ability to swap adjacent letters — ie common typos like “ei” to “ie”. Makes it super useful for detecting typos and misspellings.
In contrast to the above “Apple” and “Appel” , Damerau-Levenshein recognises that the letters ‘l’ and ‘e’ are just swapped. This algorithm is particularly handy here because it specifically allows for swapping adjacent letters, counting it as just one simple edit rather than two. So, “Appel” is a single swap away from being “Apple”
ie
Step 1: APPEL → APPLE (swap 'E' and 'L')
Jaro-Winkler Distance:
This one is super useful for matching strings that have a high similarity at the start of the string. Pretty common in cases where some people get the start of strings correctly but not the middle sections.
Comparing “Apple” to “Appel” is a bit different as it adds more weight to the beginnings of the strings. Since both words start with “App,” this is obvious for us to see. Whereas “Paple” to “Apple” is considered less similar even though it’s the same edits. In Levenshtein, they would have the same scores.
APPEL (Input String)
|||
APPLE (Target String)
We can see the first 3 characters match
PAPLE (Input String)
|||
APPLE (Target String)
We can see the first 1 characters do not match
Addressing OCR Limitations
OCR technology, despite its advancements, can struggle with accuracy due to various factors. These include but are not limited to image quality, physical condition of the document itself, glare, poor lighting etc etc. Misidentifications — such as confusing ‘0’ with ‘O’ or even a “d” to a “cl” — pose significant challenges with data quality for verification.
It’s not hard to see why OCR engines might have trouble perfectly extracting documents when the images look like this:
Now of course you can always ask the user to take a better photo, which in some cases is the best course of action. However, insisting on a perfect photo — requiring users to take 10, 20, or even 100 attempts — is unrealistic.
That would be a frustratingly horrendous user experience in a general onboarding process.
SO, I guess we do have to make some flexibility and tolerances here to accommodate users with shaky hands or terrible photography skill
(🙋♂️me. I am the user that we speak of)
String Matching to Mitigate OCR Issues
Now, I mentioned that a pretty common process is to OCR whatever you’ve captured from the document and re-represent it back to the user so they can fix up any issues.
ie something like this:
This general process would assume to either allow or disallow these changes. For something as simple as the above — seems acceptable no? As a human, we can generally see that it’s close visually.
And for some people, these types of automatic approvals are fine. But on the other hand, if you’re stricter you would probably flag this attempt for review, and then someone would have to go check it out.
However, would you approve something like this?
It’s rarely a simple yes or no decision; more often, it’s a “depends” situation.
Interestingly, even with these visual disparities, something like the Levenshtein algorithm would assign both instances a Score of 3, highlighting the nuances in OCR corrections.
If only there was a way to further provide flexibility and tolerances whilst still maintaining a strong hand on the reigns. (guess what? there is!)
Weighted String Matching: Practical Applications
As we can see — not all errors/changes are created equally. Weighted string matching allows us to assign different ‘costs’ to errors based on their impact or likelihood. This approach is especially useful in OCR contexts, where visually similar characters (‘E’ and ‘F’, for example) are less severely penalised.
For the above examples for Mr Walter White here’s how it might look in a Python script.
Basic Example of a Weighted Levenshtein Algorithm
def weighted_levenshtein(s1, s2, weight_matrix):
rows = len(s1) + 1
cols = len(s2) + 1
dist = [[0 for _ in range(cols)] for _ in range(rows)]
for i in range(rows):
dist[i][0] = i
for j in range(cols):
dist[0][j] = j
for i in range(1, rows):
for j in range(1, cols):
if s1[i - 1] == s2[j - 1]:
cost = 0
else:
# Here we see if the letters are part of the matrix else the cost is set to 1
cost = weight_matrix.get((s1[i - 1], s2[j - 1]), 1)
dist[i][j] = min(dist[i - 1][j] + 1, # deletion
dist[i][j - 1] + 1, # insertion
dist[i - 1][j - 1] + cost) # substitution
return dist[-1][-1]
# Super simple weighted matrix example
# These weights are arbitrary for the point of providing an example
weight_matrix = {
('V', 'W'): 0.25, ('W', 'V'): 0.25,
('v', 'w'): 0.25, ('w', 'v'): 0.25,
('E', 'C'): 0.25, ('C', 'E'): 0.25,
('e', 'c'): 0.25, ('c', 'e'): 0.25,
# Add otherweights for other visually similar characters
}
# Sample strings
s1 = "VVhitc"
s2 = "Chase"
target = "White"
# Calculate weighted Levenshtein distances
distance_s1 = weighted_levenshtein(s1, target, weight_matrix)
distance_s2 = weighted_levenshtein(s2, target, weight_matrix)
print(f"Distance between '{s1}' and '{target}' is {distance_s1}")
print(f"Distance between '{s2}' and '{target}' is {distance_s2}")
When running this we get the following output:
Distance between 'VVhitc' and 'White' is 1.5
Distance between 'Chase' and 'White' is 3
This honestly is much better for OCR extraction comparison and provides a “more accurate” and nuanced representation of these changes.
Wrapping up
Look, I’m not trying to sell you a miracle solution here. And to be honest, this approach isn’t the right fit for every scenario out there — in most cases this doesn’t really fix the root cause of the problem.
They might sound a bit nerdy or niche, but these algos definitely have their place. What I’ve shared with you is essentially a straightforward and effective approach to a particular problem we faced, demonstrating that often a simple solution is all that’s required.
Hopefully, you’ve picked up something new about these algorithms along the way. They’re a great demonstration of the idea that sometimes the simplest tools can bring about significant improvements — provided you’re not trying to use a hammer when what you really need is a scalpel 🔨
It’s all about finding those clever workarounds that make life just a little bit easier 🤷♂️