Problem 2: Find all occurrences of Substring Solution [Microsoft] (Example of Solutions in Paid Version)
Medium Difficulty
Following is an example of the solutions you can get by becoming a subscriber to this newsletter. To gain access to a library of high class solutions, make sure you subscribe here
Question:
This problem was asked by Microsoft.
Given a string and a pattern, find the starting indices of all occurrences of the pattern in the string. For example, given the string "abracadabra" and the pattern "abr", you should return [0, 7]
.
Step 1: Simple Solution
Such problems can be solved in steps. When there is an obvious solution, there’s no harm in going for it. We can do the optimization later, and having a brute force solution gives us something to build from.
Here, a fairly obvious approach is to just go over the string and check for the pattern in every slice of the same size. When we find a match, we append the starting index to a list of matches.
In the example above, we would compare "abr" against "abr", "bra", rac", "aca", and so on. If k
is the length of the pattern, we are making up to k
comparisons for each slice, so our time complexity is O(k * N)
The code here would look like this:
def find_matches(pattern, string):
matches = []
k = len(pattern)
for i in range(len(string) - k + 1):
if string[i : i + k] == pattern:
matches.append(i)
return matches
This code isn’t hard to come up with. And that’s kind of the point. We now have a solution that works without spending too much time. This takes some pressure off our minds (remember an inefficient working solution is much better than no solution).
Step 2: Optimization
The next thing we need to truly nail those interviews is the optimal solution. Optimizing is something that scares a lot of people, but there is a method to the madness. Often, we can optimize problems by just taking care out the redundant parts of your code.
In our current solution, the problem comes from the fact that we make a lot of comparisons, even when the slice is nothing like the pattern. To avoid such an issue, it would be nice if we could have a way to check if two strings were similar, and compare them only if they were. Luckily, there is hashing. Furthermore, since we are sliding across a string, we don’t need to rehash from scratch. We just need to drop the leftmost character value from the hash and add the value created by the rightmost value. This idea is called the rolling hash.
Rolling Hash Explained
To explain this idea, suppose we wanted to match the pattern "cat" against the string "scatter". First, let's assign each letter in the alphabet a value: a = 1, b = 2, ..., z = 26
. Now let's make a very simple hash function that adds up the values of each letter in the hash. For example, simple_hash("cat")
would be 3 + 1 + 20 = 24
.
To find our pattern matches, we start by comparing "cat" to "sca". Since simple_hash("sca")
= 19 + 3 + 1 = 23
, we know we have not found a match, and we can continue. This is where the rolling part comes in. Instead of computing the hash value for the next window from scratch, there is a constant-time operation we can do: subtract the value of "s" and add the value of "t". And in fact, we find that 23 - value(s) + value(t) = 24
, so we have a match!
Continuing in this way, we slide along the string, computing the hash of each slice and comparing it to the pattern hash. If the two values are equal, we check character by character to ensure that there is indeed a match and add the appropriate index to our result.
The problem with our current hash function is that it will generate too many false positives. Both “cat” and “tca” will have the same hash values. All anagrams in fact will have identical hash values, making it an inefficient proposition. So how do we better?
Enter the Rabin-Karp Algorithm
Rabin-Karp is a rolling hash function that is used for string searching.
Suppose we have a string of length k
. Each letter in the string is first mapped to 1 - 26
as above, then multiplied by a power of 26
. The first character is multiplied by 26^(k-1)
, the second by 26^(k-2)
, and so on, all the way to the k
th letter, which is multiplied by 26^0
. Finally, we add all these values together to get the hash of the string.
So for example, "cat" will evaluate to: (3 * 26^2)+ (1 * 26^1)+ (21 * 26^0)= 2075
.
Now suppose we are sliding a rolling window over the word "cats", and we next want to find the hash value of "ats". Instead of computing the value from scratch, we carry out the following steps:
subtract the value of the first letter of the current hash (
3 * 26^2
)multiply the current hash value by
26
add the value of the last letter in the shifted window (
19
)
Using these steps, the value of "ats" will be (2075 - 3 * 26^2) * 26 + 19 = 1241
.
This works because we are essentially shifting the powers leftward. It may be easier to see algebraically that these two equations are equal:
((3 * 262+ 1 * 261+ 21 * 260) - 3 * 262) * 26 + 19
1 * 262+ 21 * 262+ 19 * 260
The calculation is inspired by how number systems work. For example, take the number 132. It can be written as (1 *10^2) + (3* 10^1) + (2* 10^0). We are doing the same thing. Except we are using a base 26 and using letters to represent quantity instead of numbers.
Our solution now looks like this:
def value(letter, power):
return (26 ** power) * (ord(letter) - 96) #we use 96 because the #ASCII value of 'a' is 97
def rabin_hash(prev, start, new, k):
return (prev - value(start, k - 1)) * 26 + value(new, 0)
def find_matches(pattern, string):
matches = []
k = len(pattern)
pattern_val = 0
for i, char in enumerate(pattern):
pattern_val += value(char, k - i - 1)
hash_val = 0
for i, char in enumerate(string[:k]):
hash_val += value(char, k - i - 1)
if pattern_val == hash_val:
if string[:k] == pattern:
matches.append(0)
for i in range(len(string) - k):
hash_val = rabin_hash(hash_val, string[i], string[i + k], k)
if pattern_val == hash_val:
if string[i + 1 : i + k + 1] == pattern:
matches.append(i + 1)
return matches
In the worst case, if our hash function produces many false positives, we will still have to check each substring against the pattern, so this would take O(k * N)
.
Practically, however, we should not see too many false positives. In the average case, since our algorithm takes O(k)
to compute the hash of the pattern and the start of the string, and O(N)
to roll the hash over the rest of the string, our running time should be O(k + N).