Codje Codje - 7 months ago 8
Python Question

Efficient way to loop on comparing string list element to a string list sub-element

I am currently struggling to find an efficient way to compare part of a string element attached to a list, to another string element. The current code computation is very long (1 hour with 4,8 millions elements in first list and 5000 elements in second one).

What I need to do: If 8 first characters of the first string element is equal to the full second element, a third list is updated with the full first element. Once it is found, we test another element of the first list.

Here is the code:

for first_element in first_List :
for second_element in second_List:
if first_element[:8] == second_element :
third_List.append(first_element)
break


I know those kinds of loops are not the best way to deal with very big lists. The number of
if
tests is really huge.
I was wondering if there is an efficient way to do this.

I think intersection with sets won't work since I'm comparing a part of an element to a full one and I need to copy the full first element in a third list.

Do you have some suggestions or ideas please?

Answer

This works:

second_set = set(second_list)
third_list = [value for value in first_list if value[:8] in second_set]

Example:

>>> first_list = ['abcdfghij', 'xyzxyzxyz', 'fjgjgggjhhh']
>>> second_list = ['abcdfghi', 'xyzxyzxy', 'xxx']
>>> second_set = set(second_list)
>>> third_list = [value for value in first_list if value[:8] in second_set]
>>> third_list
['abcdfghij', 'xyzxyzxyz']

This should be much more efficient. The conversion of the list second_list into the set is O(n). There is one loop over first_list that is O(n). The lookup in the set, i.e. in second_set is O(1).

Comments