Megatron Megatron - 1 month ago 16
C# Question

Check if one collection of values contains another

Suppose I have two collections as follows:

Collection1:
"A1"
"A1"
"M1"
"M2"

Collection2:
"M2"
"M3"
"M1"
"A1"
"A1"
"A2"

all the values are string values. I want to know if all the elements in Collection1 are contained in Collection2, but I have no guarantee on the order and a set may have multiple entries with the same value. In this case, Collection2 does contain Collection1 because Collection2 has two A1's, M1 and M2. Theres the obvious way: sorting both collections and popping off values as i find matches, but I was wondering if there's a faster more efficient way to do this. Again with the initial collections I have no guarantee on the order or how many times a given value will appear

EDIT: Changed set to collection just to clear up that these aren't sets as they can contain duplicate values

Answer

Yes, there is a faster way, provided you're not space-constrained. (See space/time tradeoff.)

The algorithm:

Just insert all the elements in Set2 into a hashtable (in C# 3.5, that's a HashSet<string>), and then go through all the elements of Set1 and check if they're in the hashtable. This method is faster (Θ(m + n) time complexity), but uses O(n) space.

Alternatively, just say:

bool isSuperset = new HashSet<string>(set2).IsSupersetOf(set1);

Edit 1:

For those people concerned about the possibility of duplicates (and hence the misnomer "set"), the idea can easily be extended:

Just make a new Dictionary<string, int> representing the count of each word in the super-list (add one to the count each time you see an instance of an existing word, and add the word with a count of 1 if it's not in the dictionary), and then go through the sub-list and decrement the count each time. If every word exists in the dictionary and the count is never zero when you try to decrement it, then the subset is in fact a sub-list; otherwise, you had too many instances of a word (or it didn't exist at all), so it's not a real sub-list.


Edit 2:

If the strings are very big and you're concerned about space efficiency, and an algorithm that works with (very) high probability works for you, then try storing a hash of each string instead. It's technically not guaranteed to work, but the probability of it not working is pretty darn low.