To learn more about the newsletter, check our detailed About Page + FAQs
To help me understand you better, please fill out this anonymous, 2-min survey. If you liked this post, make sure you hit the heart icon in this email.
Recommend this publication to Substack over here
Take the next step by subscribing here
We’re almost at another weekend aren’t we,
That one thought makes me happier than anything else.
Also, to those of you that reached out to me about the payment problems you had trying to subscribe- I have reached out to Substack and Stripe. I will resolve this issue ASAP. There’s a lot going on right now, so just give me some time.
Problem
Given a string and a pattern, find the starting indices of all occurrences of the pattern in the string. For example, given the string "abracadabra" and the pattern "abr", you should return [0, 7]
.
Step 1: Simple Solution
This is a relatively straightforward problem to understand. Therefore, we can generally go straight into the brute force solution. This just saves some time. I recommend this for problems that are very clear with what they’re asking.
Here, the obvious approach is to just go over the string and check for the pattern in every slice of the same size. When we find a match, we append the starting index to a list of matches.
In the example above, we would compare "abr" against "abr", "bra", rac", "aca", and so on. If k
is the length of the pattern, we are making up to k
comparisons for each slice, so our time complexity would O(k * N).
Once you have the brute force solution sketched out in your mind, I suggest moving on to the optimization step. Don’t waste time coding up anything, until that is the solution you are sure is the best. Just explain your algorithm to the interviewer and move on.
There is one exception to this, however- if you can’t figure out the optimal solution. In that case, it makes sense to code the brute-force solution. Remember, a bad solution is better than no solution. And this way, you at least get to show off your coding skills. Too many people make the mistake of completely shutting down and giving up in their interviews once they hit a roadblock. Don’t make this mistake. Handling adversity is an important skill in itself.
Lastly, if you’re someone who struggles with coding- code this solution up while you’re practicing. It will help reinforce muscle memory and will help you get better at coding quickly.
With that out of the way, let’s move on to the next optimizing step.
So… how can we do better?
Step 2: Looking at extra steps
Here is one of my favorite ways of identifying what can be done to optimize your solution- look at your solution. All you do is look at the solution you’ve created, and identify what aspect is slowing it down. This sounds very simple, but it helps with something very important: it gives you a clear plan to attack next. Don’t underestimate the power of clarity.
In our current solution, the slowdown comes from the fact that we make a lot of comparisons, even when the slice is nothing like the pattern. To avoid such an issue, it would be nice if we could have a way to check if two strings were similar, and compare them only if they were. So this gives us the plan- write a way to judge if strings are similar and compare them if they are.
So what does it mean for strings to be similar to each other? Let’s figure that out next.
Step 3: Defining Similarity
There are multiple ways to get here. We’re going to go with a very simple definition- two strings are similar if they have the same characters in the same locations.
So is there a way of encoding the strings to match this? Funny that you ask. If only there was a way to convert strings to numbers based on some properties/rules.
Remember, we are sliding across a string. Thus we don’t need to rehash from scratch. We just need to drop the leftmost character value from the hash and add the value created by the rightmost value. This idea is called the rolling hash, which is also covered in the above post. Attaching it here for those of you too lazy to read that post. You’re welcome. Yes, I love you too.
Rolling Hash Explained
To explain this idea, suppose we wanted to match the pattern "cat" against the string "scatter". First, let's assign each letter in the alphabet a value: a = 1, b = 2, ..., z = 26
. Now let's make a very simple hash function that adds up the values of each letter in the hash. For example, simple_hash("cat")
would be 3 + 1 + 20 = 24
.
To find our pattern matches, we start by comparing "cat" to "sca". Since simple_hash("sca")
= 19 + 3 + 1 = 23
, we know we have not found a match, and we can continue. This is where the rolling part comes in. Instead of computing the hash value for the next window from scratch, there is a constant-time operation we can do: subtract the value of "s" and add the value of "t". And in fact, we find that 23 - value(s) + value(t) = 24
, so we have a match!
Continuing in this way, we slide along the string, computing the hash of each slice and comparing it to the pattern hash. If the two values are equal, we check character by character to ensure that there is indeed a match and add the appropriate index to our result.
The problem with our current hash function is that it will generate too many false positives. Both “cat” and “tca” will have the same hash values. All anagrams in fact will have identical hash values, making it an inefficient proposition. So how do we better?
Step 4: Enter the Rabin-Karp Algorithm
Rabin-Karp is a rolling hash function that is used for string searching.
Suppose we have a string of length k
. Each letter in the string is first mapped to 1 - 26
as above, then multiplied by a power of 26
. The first character is multiplied by 26^(k-1)
, the second by 26^(k-2)
, and so on, all the way to the k
th letter, which is multiplied by 26^0
. Finally, we add all these values together to get the hash of the string.
So for example, "cat" will evaluate to: (3 * 26^2)+ (1 * 26^1)+ (21 * 26^0)= 2075
.
Now suppose we are sliding a rolling window over the word "cats", and we next want to find the hash value of "ats". Instead of computing the value from scratch, we carry out the following steps:
subtract the value of the first letter of the current hash (
3 * 26^2
)multiply the current hash value by
26
add the value of the last letter in the shifted window (
19
)
Using these steps, the value of "ats" will be (2075 - 3 * 26^2) * 26 + 19 = 1241
.
This works because we are essentially shifting the powers leftward. It may be easier to see algebraically that these two equations are equal:
((3 * 262+ 1 * 261+ 21 * 260) - 3 * 262) * 26 + 19
1 * 262+ 21 * 262+ 19 * 260
The calculation is inspired by how number systems work. For example, take the number 132. It can be written as (1 *10^2) + (3* 10^1) + (2* 10^0). We are doing the same thing. Except we are using a base 26 and using letters to represent quantity instead of numbers.
With that done, let’s start coding things up.
Step 5: Coding
Our solution now looks like this:
def value(letter, power):
return (26 ** power) * (ord(letter) - 96) #we use 96 because the #ASCII value of 'a' is 97
def rabin_hash(prev, start, new, k):
return (prev - value(start, k - 1)) * 26 + value(new, 0)
def find_matches(pattern, string):
matches = []
k = len(pattern)
pattern_val = 0
for i, char in enumerate(pattern):
pattern_val += value(char, k - i - 1)
hash_val = 0
for i, char in enumerate(string[:k]):
hash_val += value(char, k - i - 1)
if pattern_val == hash_val:
if string[:k] == pattern:
matches.append(0)
for i in range(len(string) - k):
hash_val = rabin_hash(hash_val, string[i], string[i + k], k)
if pattern_val == hash_val:
if string[i + 1 : i + k + 1] == pattern:
matches.append(i + 1)
return matches
In the worst case, if our hash function produces many false positives, we will still have to check each substring against the pattern, so this would take O(k * N)
.
Practically, however, we should not see too many false positives. In the average case, since our algorithm takes O(k)
to compute the hash of the pattern and the start of the string, and O(N)
to roll the hash over the rest of the string, our running time should be O(k + N).
Time Complexity- O(k+N).
Space Complexity- O(max(M,k)), where M is the length of the return.
Loved the solution? Hate it? Want to talk to me about your crypto portfolio? Get my predictions on the UCL or MMA PPVs? Want an in-depth analysis of the best chocolate milk brands? Reach out to me by replying to this email, in the comments, or using the links below.
Stay Woke. I’ll see you living the dream.
Go kill all,
Devansh <3
Reach out to me on:
Instagram: https://www.instagram.com/iseethings404/
Message me on Twitter: https://twitter.com/Machine01776819
My LinkedIn: https://www.linkedin.com/in/devansh-devansh-516004168/
My content:
Read my articles: https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Get a free stock on Robinhood. No risk to you, so not using the link is losing free money: https://join.robinhood.com/fnud75