Dan Rubio Dan Rubio - 6 months ago 11
Ruby Question

How can I use an efficent alogirthm to find the difference between two very large arrays?

I have a problem concerning efficiency and algorithms when it comes to finding the difference between two very large arrays. I'm hoping someone with a good understanding of algorithms can point me to the right direction in how to solve this as my current implementations is taking an extremely long amount of time.

Problem:

I have two very large arrays. One contains a list of emails that have invalid domain names and the other is a mixed list that I need to check against the first array.

accounts_with_failed_email_domains = [279,000 records in here]

unchecked_account_domains = [149,000 records in here]


What I need to do is I need to go through the list of
unchecked_account_domains
and then compare each entry to see if there is a match or not in the
accounts_with_failed_email_domains
. If there is, I need to keep track of it and if not, then I'll leave it.

I hate to admit that I didn't spend any time with algorithms and now its cleary showing, hence I need the help. How can I efficiently write something that can quickly check through these accounts. Here is what I have tried so far.

unchecked_account_domains = [really big array]
unchecked_account_domains = unchecked_account_domains.sort

accounts_with_failed_email_domains = [another huge array].sort

unchecked_account_domains.keep_if do |email|
accounts_with_failed_email_domains.any? { |failed_email| email == failed_email }
end

# Count to see how many accounts are left
puts unchecked_account_domains.count


This above implementation has been running forever. Here is the second attempt which still proved not to be any better.

unchecked_account_domains = [really big array]
unchecked_account_domains = unchecked_account_domains.sort

accounts_with_failed_email_domains = [another huge array].sort

unchecked_account_domains.each do |email|
accounts_with_failed_email_domains.bsearch do |failed_email|
final_check << email if email == failed_email
end
end

# Count to see how many accounts are left
puts final_check.count


bsearch
seemed to be promisint, but I'm pretty sure I'm not using this correctly. Also, I tried looking into this question comparing large lists but this is in python and there doesn't seem to be a ruby equivalent for
set
. Does anyone have any ideas I could try or reasearch? Perhaps a gem, library, or something?

Answer

It seems like you can use Array#-:

result = unchecked_account_domains - accounts_with_failed_email_domains