utkarsh13 utkarsh13 - 4 months ago 16
Python Question

Efficient way to extract data within double quotes

I need to extract the data within double quotes from a string.

Input:

<a href="Networking-denial-of-service.aspx">Next Page →</a>


Output:

Networking-denial-of-service.aspx


Currently, I am using following method to do this and it is running fine.

atag = '<a href="Networking-denial-of-service.aspx">Next Page →</a>'
start = 0
end = 0

for i in range(len(atag)):
if atag[i] == '"' and start==0:
start = i
elif atag[i] == '"' and end==0:
end = i

nxtlink = atag[start+1:end]


So, my question is that is there any other efficient way to do this task.

Thankyou.

Answer

I am taking the question exactly as written - how to get data between two double quotes. I agree with the comments that an HTMLParser might be better...

Using regular expression might help, particularly if you want to find more than one. For example, this is a possible set of code

import re
string_with_quotes = 'Some "text" "with inverted commas"\n "some text \n with a line break"'

Find_double_quotes = re.compile('"([^"]*)"', re.DOTALL|re.MULTILINE|re.IGNORECASE) # Ignore case not needed here, but can be useful.

list_of_quotes = Find_double_quotes.findall(string_with_quotes)

list_of_quotes

['text', 'with inverted commas', 'some text \n with a line break']

If you have an odd number of double quotes, then the last double quote is ignored. If none are found, then an empty list is produced.

Various references

http://www.regular-expressions.info/ is really good for learning regular expressions

Regex - Does not contain certain Characters gave me how not to do a character

https://docs.python.org/2/library/re.html#re.MULTILINE tells you what re.MULTILINE and re.DOTALL (underneath) do.

Comments