Woudn't it be even faster to use a dictionary with 10 keys m={0:1, 1:2, 2:2, 3:2, 4:2, 5:2, 6:3, 7:3, 8:4,9:4} and access it with m[int(r *10)]? Sthg like
def solution2(nums, probs):
times = int(input("Enter number of random numbers generated"))
plist = [n for plist in [[nums[i]]*int(p*10) for i,p in enumerate(probs)] for n in plist]
This is an interesting approach. You've clearly put a lot of thought into this. However, I don't fully understand what you've done here. Can you elaborate? What is the significance of the m dictionary? What are the keys and values there?
Also as a general piece of advice (in both coding and in your interviews), try to avoid lines like this- plist = [n for plist in [[nums[i]]*int(p*10) for i,p in enumerate(probs)] for n in plist]. It's very cleverly written, but that also means it's not obvious what it does/why it works. You want individual lines of code to be as bland as possible, because that makes it easy to review them. It'll also save you a lot of effort in your interviews since you won't have to explain what you're doing. How would you explain this line in your interview? Can you prove this line would work for every set of input values?
In general, what do you think the time complexity of this algorithm is? What about the space requirements?
I agree I should have done the plist comprehension in two steps. This was just for you ;-), but thx for the advice.
Initially, the dictionary was using floats as keys {0.0:1, 0.1:2 etc}. The purpose is to divide the probabilities range (rounded to the first decimal) in 10 buckets, and have the values equal to the nums.
If a num must be sampled with a prob of 0.5, we place it 5 times in the dictionary (the keys actually don't matter as long as the value is present 5 times) and there is an easy way to access them using random().
So, if you generate 10 random numbers, floor them to the first decimal, and use that as keys, you will inevitably hit 5 times the value thas is there 5 times, on average.
Then, I realized it was even easier if I use int keys 0,1,2,3 so i can just access them by doing int(random() * 10), so the final form of the dictionary is the m I defined above (ie rnd_smpl in the code), where the value 1 is present 1 time, 2 is present 5 times, 3 & 4 are present 2 times each.
So, since you avoid the binary tree search, I think the complexity is O(n + k) assuming the access to the dictionary is O(1), and memory is O(n).
Now, if probabilities have 2 decimals, you'd have to split the probability range in 100 'buckets' and use 2 digits for the keys. So the complexity does actually depend on how many decimals we use for the probabilities. If k is high, I think it's still worth it.
It just occurred to me that I could have described it differently.
Instead of using len(nums) buckets like you did and use binary search (thx for bisect functions that I didn't know about btw) to find the bucket to which random() corresponds, I create a new virtual uniform distribution (1,2,2,2,2,2,3,3,4,4), in which all values have equal probability 1/10.
Every value with prob 0.n is replicated n times, so now you can sample it randomly without any further processing. Inevitably, sampling this "distribution" 10 times will produce the value n times on average (because ∑n = 10).
The dictionary is simply a mapping function between any possible random number and a given value, such that any random number, once formatted, corresponds to one of the keys. You're actually sampling the keys, but this correspondence means you're also sampling the virtual distribution.
The only drawback, as I mentioned, is if you use probabilities with, say, 3 decimals, you have to create a dictionary with a thousand keys. From what I've seen of ML algorithms, I don't think probs with 5 decimals make sense, even with algorithms such as particle filters imo.
Woudn't it be even faster to use a dictionary with 10 keys m={0:1, 1:2, 2:2, 3:2, 4:2, 5:2, 6:3, 7:3, 8:4,9:4} and access it with m[int(r *10)]? Sthg like
def solution2(nums, probs):
times = int(input("Enter number of random numbers generated"))
plist = [n for plist in [[nums[i]]*int(p*10) for i,p in enumerate(probs)] for n in plist]
rnd_smpl = {i:n for i,n in enumerate(plist)}
ma = {n:0 for n in nums}
for _ in range(times):
ma[rnd_smpl[int(random()*10)]] += 1
print(ma)
This is an interesting approach. You've clearly put a lot of thought into this. However, I don't fully understand what you've done here. Can you elaborate? What is the significance of the m dictionary? What are the keys and values there?
Also as a general piece of advice (in both coding and in your interviews), try to avoid lines like this- plist = [n for plist in [[nums[i]]*int(p*10) for i,p in enumerate(probs)] for n in plist]. It's very cleverly written, but that also means it's not obvious what it does/why it works. You want individual lines of code to be as bland as possible, because that makes it easy to review them. It'll also save you a lot of effort in your interviews since you won't have to explain what you're doing. How would you explain this line in your interview? Can you prove this line would work for every set of input values?
In general, what do you think the time complexity of this algorithm is? What about the space requirements?
I agree I should have done the plist comprehension in two steps. This was just for you ;-), but thx for the advice.
Initially, the dictionary was using floats as keys {0.0:1, 0.1:2 etc}. The purpose is to divide the probabilities range (rounded to the first decimal) in 10 buckets, and have the values equal to the nums.
If a num must be sampled with a prob of 0.5, we place it 5 times in the dictionary (the keys actually don't matter as long as the value is present 5 times) and there is an easy way to access them using random().
So, if you generate 10 random numbers, floor them to the first decimal, and use that as keys, you will inevitably hit 5 times the value thas is there 5 times, on average.
Then, I realized it was even easier if I use int keys 0,1,2,3 so i can just access them by doing int(random() * 10), so the final form of the dictionary is the m I defined above (ie rnd_smpl in the code), where the value 1 is present 1 time, 2 is present 5 times, 3 & 4 are present 2 times each.
So, since you avoid the binary tree search, I think the complexity is O(n + k) assuming the access to the dictionary is O(1), and memory is O(n).
Now, if probabilities have 2 decimals, you'd have to split the probability range in 100 'buckets' and use 2 digits for the keys. So the complexity does actually depend on how many decimals we use for the probabilities. If k is high, I think it's still worth it.
I kinda see what you're doing, but it doesn't fully click for me. Maybe we can get on call and you can explain this to me?
Sure, always a pleasure to talk to you.
It just occurred to me that I could have described it differently.
Instead of using len(nums) buckets like you did and use binary search (thx for bisect functions that I didn't know about btw) to find the bucket to which random() corresponds, I create a new virtual uniform distribution (1,2,2,2,2,2,3,3,4,4), in which all values have equal probability 1/10.
Every value with prob 0.n is replicated n times, so now you can sample it randomly without any further processing. Inevitably, sampling this "distribution" 10 times will produce the value n times on average (because ∑n = 10).
The dictionary is simply a mapping function between any possible random number and a given value, such that any random number, once formatted, corresponds to one of the keys. You're actually sampling the keys, but this correspondence means you're also sampling the virtual distribution.
The only drawback, as I mentioned, is if you use probabilities with, say, 3 decimals, you have to create a dictionary with a thousand keys. From what I've seen of ML algorithms, I don't think probs with 5 decimals make sense, even with algorithms such as particle filters imo.
I'll message you on LinkedIn. I am very interested in understanding your solution in more detail. Appreciate your inputs here.