nik nik - 5 months ago 8
Ruby Question

highlight strings from one df to another df

I have a txt file like below. As an example it has 6 rows. each raw have a string or multiple strings. for example the first row has only one string but the second one has two (they are separated with a comma). I also put the number of rows to make the example clear

1 P41182
2 P41152,Q9UQL6
3 P41172
4 Q92793,Q09472,Q9Y6Q9
5 Q15021,TQ9472
6 Q15021,Q9BPX3,Q15003,O95347,Q9NTJ3


I have another text which looks like below. The same structure. for example the first row has only one string but the second has two

1 P41182
2 P41152,Q9UYIU
3 P41172
4 Q9IO93,Q9Y6IT
5 P30561
6 Q15021,Q9BPX3,Q15003,O95347,Q9NTJ3
7 HT8971
8 HLI872


I want to know the index of some of the strings from the second data that are similar to the first data. There are some roles as follows:

If only 1 string is in the first txt and matched to second txt, I don't want to know the index. If there is more than one string in first txt and one of them or some of them are similar to the second txt file, then I want to know the index of them for example the output should look like the following

df3

1 P41182
2 P41152_2_1,Q9UYIU
3 P41172
4 Q9IO93,Q9Y6IT
5 P30561
6 Q15021_5_1_6_1,Q9BPX3_6_2,Q15003_6_3,O95347_6_4,Q9NTJ3_6_5
7 HT8971
8 HLI872


The first string of second data is similar to a string of first data with only 1 member (strings in each element are separated by a comma) so I leave it as it is and I don't want the index.

The second string in second txt file is similar to the second row and first string of that row in the first txt file, so it gets 2_1

The sixth string in second txt is similar to the fifth row and first one of first txt and also it is similar to the sixth row and first string of the first txt file so it gets 5_1 and 6_1

etc etc .

Answer

I guess the following ruby code should work: Make sure you have df1.txt and df2.txt as comma separated. You will have the output in df3.txt. Please see the sample txt files below.

df1_hash = {}
df1_term_positions_hash = Hash.new([])
File.readlines("df1.txt").each_with_index do |line, i|
    df1_hash[i+1] = line.strip.split(",")
    for x in line.strip.split(",")
        df1_term_positions_hash[x] += [i+1]
    end
end

df2_hash = {}
File.readlines("df2.txt").each_with_index do |line, i|
    df2_hash[i+1] = line.strip.split(",")
end

df2_size = df2_hash.size
df3_hash = {}
for i in (1..df2_size)
    df3_hash[i] = df2_hash[i].each_with_index.map do |term, intermediate_index|
        number_of_repetitions = df1_term_positions_hash[term].size      
        updated_term = term.dup
        df1_term_positions_hash[term].each_with_index do |repetition_position, index|
            if (df1_hash[repetition_position].size <= 1 rescue true )
                updated_term = term


            else
                additional_tail = "#{repetition_position}_#{df1_hash[repetition_position].index(term)+1}"               
                updated_term = updated_term + "_" + additional_tail
            end
        end
        updated_term
    end

end

File.open("df3.txt", "w") do |file|
    for i in (1..df2_size)
        file.puts df3_hash[i].join(",")
    end
end

df1.txt

P41182
P41152,Q9UQL6
P41172
Q92793,Q09472,Q9Y6Q9
Q15021,TQ9472 
Q15021,Q9BPX3,Q15003,O95347,Q9NTJ3

df2.txt

P41182
P41152,Q9UYIU
P41172
Q9IO93,Q9Y6IT
P30561
Q15021,Q9BPX3,Q15003,O95347,Q9NTJ3
HT8971
HLI872

output in df3.txt

P41182
P41152_2_1,Q9UYIU
P41172
Q9IO93,Q9Y6IT
P30561
Q15021_5_1_6_1,Q9BPX3_6_2,Q15003_6_3,O95347_6_4,Q9NTJ3_6_5
HT8971
HLI872

sorry for little messy code, but it works. Hope it helps : )