user3723011 user3723011 - 7 months ago 35
Python Question

Searching titles in medline database with entrez and biopython

I am trying to search for papers with specific words in the title. More precisely, the word viral or virus in papers published between 2010 and 2015. Here is the code I have:

import re
from Bio import Medline

handle = Entrez.esearch(db="pubmed", # database to search
term="2010[Date - Publication]:2015[Date - Publication]"
)
record = Entrez.read(handle)
handle.close()

pmid_list = record["IdList"] #list of records

handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="medline", retmode="text")
records = Medline.parse(handle)

titles = [] # start with empty list of titles
for record in records:
ti_list = record['TI'] #titles
for title in ti_list:
if title == "virus" and title not in titles: #searching viral/virus
titles.append(title)

print('Publications with viral or virus in the title:')
for record in records:
print(" ", title)


If I simply print(record['TI'], then I get a list of all titles in my search query. However, I'm not able to search the specific word. I think my mistake may be in the "if title == "virus" (because obviously no paper will be titled with that word alone).

I am pretty stuck. Is there a better way to be searching for this word in the titles of the papers I've queried?

Thanks.

Edit: Updated code (and still no luck)

import re
from Bio import Medline

handle = Entrez.esearch(db="pubmed", # database to search
term="2010[Date - Publication]:2015[Date - Publication]"
)
record = Entrez.read(handle)
handle.close()

pmid_list = record["IdList"] #list of records

from Bio import Medline
handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="medline", retmode="text")
records = Medline.parse(handle)

r = re.compile(r"\bvir(al|us)\b")
titles = set() # start with empty list of titles
for record in records:
ti_list = record['TI'] # titles
for title in ti_list:
if r.search(title): #
titles.add(title)

print('Publications with viral or virus in the title:')
for record in records:
print(" ", title)


New code:

import re
from Bio import Medline
handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="medline", retmode="text",
term="2010[Date - Publication]:2015[Date - Publication]")
records = Medline.parse(handle)
titles = []
for record in records:
ti_list = record['TI']
for title in ti_list:
titles.append(title)
handle.close()
for title in titles:
print(title)

Answer

If you want to match substrings use in to see if any of the words are contained in the title:

words  = ("viral","virus")
if any(w in title for w in words) and title not in titles: #

But you seem to want to filter the records getting any record title that contains viral or virus:

st  = {"viral","virus"}

filtered_records = [ record for record in records if any(w in st for w in record['TI'] )]

If you want to match substrings and use a pattern then you actually need to make it a regex, "vir(al|us)" is just a string in your code:

import re

r = re.compile("vir(al|us)")
filtered_records = [record for record in records if any(r.search(w) for w in record['TI'])]

The regex in your own loop would go where your if is:

import re

r = re.compile(r"vir(al|us)")
if r.search(title) and title not in titles: 
      .......

If you don't want viruses etc.. to match then use a word boundary for your regex:

r = re.compile(r"\bvir(al|us)\b")

You should also make titles a set which cannot have dupes, a working example using your own code:

r = re.compile(r"\bvir(al|us)\b")
titles = set()  # start with empty list of titles
for record in records:
    ti_list = record['TI']  # titles
    for title in ti_list:
        if r.search(title):  #
            titles.add(title)

Which can become a set comprehension:

r = re.compile(r"\bvir(al|us)\b")

titles = {title for record in records for title in record['TI']  if r.search(title)} # titles

Since record['TI'] returns a string and not a list:

r = re.compile(r"\bvir(al|us)\b")
titles = set() 
for record in records:
    title = record['TI']  # title is a str not a list
    if r.search(title):  #
           titles.add(title)

Do the same with the set comp or any other example.

Comments