bg49ag bg49ag - 19 days ago 5
R Question

Combining headerless .csv's by UIDs and removing duplicates within rows in PowerShell or RStudio

I have some large (300,000+ line) .csv's I've been working on; they're too big to import into Excel to filter / get unique values and have been taking several hours each to concatenate in PowerShell. I've been making progress on them but have run into two problems, one to do merging them and the second to do with removing duplicates within (independent) rows; any help with either would be great! I should say for anyone concerned, these are not confirmed side effects listed below, they may not even be related, but thought it'd be helpful to include them to fully illustrate the formatting.

They're now in the format:

drugs.csv

10005323,rituximab,rituximab,rituximab,tocilizumab,methotrexate,methotrexate,prednisone,abatacept,certolizumab pegol,etanercept
10005479,rituximab,tocilizumab
11848415,(1-43)-(18-2)-blood-coagulation factor viii (synthetic human) fusion protein with immunoglobulin g/1 (synthetic human fc domain fragment)?,(1-73)-(18-22)-blood-coagulation factor/viii (synthetic human) fusion protein with immunoglobulin g/1 (synthetic human fc domain fragment)?


sideeffects.csv

10005323,asthenia,bronchitis,cholecystitis,cholelithiasis,gastroenteritis,hypertension,lung disorder,pterygium
10005513,dengue fever,feeling hot,headache,osteoarthritis,pain,pruritus,pyrexia,rash macular,weight increased,wrong technique in product usage process
10005479,chest pain,larynx irritation,urticaria


Problem 1, merging (combining) by unique identifier, without headers

The number at the start of each line is a unique ID for each entry, followed by a list of drugs. They're "," delimited, but there are also quite a few other characters in there. There aren't any headers.

The first problem is to do with merging the .csv's together; I need to tack the line of side effects from the second.csv onto the associated line of drugs. I've found lots of examples regarding how to do this but they all seem to relate to combining using header names. There aren't any headers present here and the length of each line varies from case to case. Is there a way to combine an entire row in PowerShell by specifying the first delimited value and then the rest of the line, rather than by specifying individual headers to match? E.g. tacking side effects of case 10005323 onto list of drugs 10005323 to give:

Example output 1

10005323,rituximab,rituximab,rituximab,tocilizumab,methotrexate,methotrexate,prednisone,abatacept,certolizumab,pegol,etanercept,asthenia,bronchitis,cholecystitis,cholelithiasis,gastroenteritis,hypertension,lung disorder,pterygium


Problem 2, removing duplicates present in independent rows

The second problem is each line currently isn't sorted and contains multiple entries; e.g. the line starting 10005323 contains rituximab a few times.

Most of the similar solutions I've looked at will try removing duplicates from ALL lines simultaneously (e.g. they'd also remove rituximab from the second line; 10005479), or they are related to deleting entire rows, or require headers to be specified. But I only need to get the unique entries for each line, independently of the others. Retaining the unique ID, after the files are merged, is not entirely necessary, but it would be handy if it could be retained in the first position of each line. I think this will be a "sort | get-unique" operation but I can't figure out how to apply it line by line to give:

Example output 2

10005323,abatacept,asthenia,bronchitis,certolizumab,cholecystitis,cholelithiasis,etanercept,gastroenteritis,hypertension,lung disorder,methotrexate,pegol,prednisone,pterygium,rituximab,tocilizumab
10005479,chest pain,larynx irritation,rituximab,tocilizumab,urticaria

Answer

In R:

# read data
dr=read.csv("drugs.csv", header = F, stringsAsFactors = F)
se=read.csv("sideeffects.csv", header=F,stringsAsFactors = F)

# rename first column as id
colnames(dr)[1] <- "id";
colnames(se)[1] <- "id";

#load needed libs
library(reshape2)
library(sqldf)

#transpose data from columns to rows
drColsToRows=melt(dr, id=c("id"))
seColsToRows=melt(se, id=c("id"))

#remove empty rows and  deduplicate/concat and order by (to get alphabetical order of elements)
drDedup=sqldf("select id, group_concat(distinct value) dr from (select * from drColsToRows order by value) where value<>'' group by id")
seDedup=sqldf("select id, group_concat(distinct value) se from (select * from seColsToRows order by value) where value<> '' group by id")

# join two data sets on id column
merged=merge(drDedup, seDedup, by="id", all=TRUE)

#concatentate drugs with sideeffect
mergedConcat=sqldf("select id, trim(ifnull(dr,'')||','||ifnull(se,''),',') from merged")

#save output
write.table(mergedConcat, "out.txt",  row.names=FALSE, col.names=FALSE, sep=",",quote = F)
Comments