Using the Rabin Karp Algorithm to speed up your code[Technique Tuesdays]

Despite all the hype around Text-Based AI, this technique is still king in many cases.

Apr 19, 2023

Hey, it’s your favorite cult leader here 🐱‍👤

On Tuesdays, I will cover problem-solving techniques that show up in software engineering, computer science, and Leetcode Style Questions📚📚📝📝.

To get access to all my articles and support my crippling chocolate milk addiction, consider subscribing if you haven’t already!

Help me buy chocolate milk

p.s. you can learn more about the paid plan here.

Text Processing is a challenge that has a lot of potential for future growth. ChatGPT, PaLM, and Llama are all AI Models that gained a lot of attention. They are also primarily text-based.

However, these AI Models are very expensive and have a whole host of problems that you don’t want to deal with (more on that here). So, in practice, it makes sense to implement other algorithms when possible to reduce overhead and improve the safety of our systems. Today’s topic will be about one such algorithm, one that has helped me a lot. Today, we will talk about the Rabin Karp algorithm, a string-searching algorithm that uses hashing to find patterns in texts.

colored pencil lined up on top of white surface — Photo by Jess Bailey on Unsplash

Highlights

What is the Rabin Karp algorithm?- The Rabin Karp algorithm is a string-searching algorithm created by Richard M. Karp and Michael O. Rabin in 1987 that uses hashing to find an exact match of a pattern string in a text. It uses a rolling hash to quickly filter out positions of the text that cannot match the pattern and then checks for a match at the remaining positions. Generalizations of the same idea can be used to find more than one match of a single pattern, or to find matches for more than one pattern.
Why is it useful? The Rabin Karp algorithm is useful for applications that require searching for multiple patterns in large texts, such as detecting plagiarism, finding keywords in documents, or searching for DNA sequences. The algorithm can perform faster than naive string-matching algorithms, especially when the pattern is long or there are many patterns to search for. The algorithm can also handle different alphabets and character sets easily by using appropriate hash functions.
How does it work? - The Rabin Karp algorithm works by computing, at each position of the text, the hash value of a string starting at that position with the same length as the pattern. If this hash value equals the hash value of the pattern, it performs a full comparison at that position. The key to the algorithm's performance is the efficient computation of hash values of the successive substrings of the text. The algorithm uses a rolling hash function that can update the hash value from one substring to the next by adding and subtracting some terms, without having to recompute the whole hash value from scratch.
How do we implement the Rabin-Karp Algorithm?- To implement the Rabin Karp algorithm, we need to choose a hash function that suits our problem. A common choice is to use a polynomial hash function that assigns a numerical value to each character in the alphabet and computes the hash value as a weighted sum of these values modulo some prime number. For example, if we use the first ten letters of the alphabet (A-J) and assign them values from 1 to 10, we can compute the hash value of a string as follows:
```
hash("ABC") = (1 * 10^2 + 2 * 10^1 + 3 * 10^0) mod 13 = 6
```
To update the hash value from one substring to the next, we can use this formula:
```
hash("BCD") = (10 * (hash("ABC") - 1 * 10^2) + 4 * 10^0) mod 13 = 12
```
We also need to store the hash value of the pattern and compare it with each substring's hash value. If they match, we need to verify that the characters also match. Here is a pseudocode implementation of the algorithm:

n = length of text
m = length of pattern
d = number of characters in alphabet
q = prime number
h = d^(m-1) mod q
p = 0 // hash value for pattern
t = 0 // hash value for text window
for i = 1 to m
    p = (d * p + pattern[i]) mod q
    t = (d * t + text[i]) mod q
for s = 0 to n - m
    if p == t // hash values match
        if pattern[1..m] == text[s+1..s+m] // characters match
            print "pattern found at position" s
    if s < n-m // update hash value for next window
        t = (d * (t - text[s+1] * h) + text[s+m+1]) mod q

That's all for today's newsletter. We hope you enjoyed learning about the Rabin Karp algorithm and how to implement it. Stay tuned for more algorithm spotlights in the future!

That is it for this piece. I appreciate your time. As always, if you’re interested in reaching out to me or checking out my other work, links will be at the end of this email/post. If you like my writing, I would really appreciate an anonymous testimonial. You can drop it here. And if you found value in this write-up, I would appreciate you sharing it with more people. It is word-of-mouth referrals like yours that help me grow.

Upgrade your tech career with a premium subscription ‘Tech Made Simple’! Stay ahead of the curve in AI, software engineering, and tech industry with expert insights, tips, and resources. 20% off for new subscribers by clicking this link. Subscribe now and simplify your tech journey!

Using this discount will drop the prices-

800 INR (10 USD) → 640 INR (8 USD) per Month

8000 INR (100 USD) → 6400INR (80 USD) per year

Get 20% off for 1 year