which_command which_command - 20 days ago 5
Python Question

python: removing "vancouver style" of referencing from text

I am copying and pasting the text from a scientific journal into a text file where I would like to remove the references. The style of referencing is "Vancouver". Hence when copied and paster into a text file the text looks as follows:


The problem was solved by distance geometry12 or classical
multidimensional scaling13,14. However previously in the context of 3C
experiments1,10,11 and other presented evidence, a confidence rate of
20 was given to the…


My desired output is:


The problem was solved by distance geometry or classical
multidimensional scaling. However previously in the context of 3C
experiments and other presented evidence, a confidence rate of 20 was
given to the…


I have tried the following based on previous posts:

file=open("paper.txt", "r")
mystring = file.read()
x= file.read()
x = re.sub(r'[-\d,]+', '', x)


However this exclusively gets rid of all the digits (including '3C' and '20') in the text when all I want is to remove the reference numbers e.g.:

geometry12... scaling13,14... experiments1,10,11 -> geometry... scaling... experiments


So How can I remove the reference numbers that immediately procede words without removing the normal numbers?

Answer

We'll look for a character that is neither a digit nor a space (to preserve "words" that are only digits), followed by digits and then some quantity of a comma followed by digits. We will substitute all of that for whatever first character was found.

s = '''The problem was solved by distance geometry12 or classical multidimensional scaling13,14. However previously in the context of 3C experiments1,10,11 and other presented evidence, a confidence rate of 20 was given to the...'''

print(re.sub(r'([^ 0-9])(\d+(?:,\d+)*)', r'\1', s))

Result:

The problem was solved by distance geometry or classical multidimensional scaling. However previously in the context of 3C experiments and other presented evidence, a confidence rate of 20 was given to the...