jcmetz21 jcmetz21 - 2 months ago 7
Python Question

Split a string into segments in python

I'm trying to split a molecule as a string into it's individual atom components. Each atom starts at a capital letter and ends at the last number.

For example,

'SO4'
would become
['S', 'O4']
.

And
'C6H12O6'
would become
['C6', 'H12', 'O6']
.

Pretty sure I need to use the regex module. This answer is close to what I'm looking for: Python: Split a string at uppercase letters

Answer

Use re.findall() with the pattern:

[A-Z][a-z]?\d*
  • [A-Z] matches any uppercase character

  • [a-z]? matches zero or one lowercase character

  • \d* matches zero or more digits

Based on your example this should work, although you should look out for any specific library for this purpose.

Example:

>>> re.findall(r'[A-Z][a-z]?\d*', 'C6H12O6')
['C6', 'H12', 'O6']

>>> re.findall(r'[A-Z][a-z]?\d*', 'SO4')
['S', 'O4']

>>> re.findall(r'[A-Z][a-z]?\d*', 'HCl')
['H', 'Cl']