shashwat vajpeyi shashwat vajpeyi - 1 month ago 8
R Question

Transpose data using an identifier

I have text file (blast software output) with a single column and about 40,000 rows, which looks as below.

Essentially, I would like to use R or terminal to convert this to multiple columns with first columns containing Query name and other columns containing query hits with each hit appended to a new column

Input is this:

Query1
result1
result2
result3

Query2
result1
result2
result3
result4
result5

Query3
result1
result2
result3
result4


Expected output

Query1 result1 result2 result3
Query2 result1 result2 result3 result4 result5
Query3 result1 result2 result3 result4

Answer

Consider running readLines() to read the text file line by line, building a large list of character vectors. Below also iteratively maps the section header (i.e. Query1, Query2) to names of the individual character vectors:

con <- file("/path/to/text/file.txt", open="r")

datalist <-  c()
while (length(line <- readLines(con, n=1, warn = FALSE)) > 0) {

  if (grepl("Query", line)==TRUE){
    query <- c()                                              # RESET VECTOR
    qName <- line                                             # CAPTURE QUERY NAME
  }
  else if (grepl("([A-Za-z])", line)==TRUE){
    query <- c(query, line)                                   # APPEND LINE TO VECTOR
  }
  else if (line == ""){
    datalist <- c(datalist, setNames(list(query), qName))     # APPEND NAMED VECTOR TO LIST
  }
}

datalist <- c(datalist, setNames(list(query), qName))         # REMAINING LAST SECTION
close(con)

datalist

# $Query1
# [1] "result1" "result2" "result3"

# $Query2
# [1] "result1" "result2" "result3" "result4" "result5"

# $Query3
# [1] "result1" "result2" "result3" "result4"
Comments