Jacopo Terrinoni Jacopo Terrinoni - 3 months ago 23
Python Question

regular expression unicode character does not match

I am trying to use regular expression over a text that contains some special character like à,è,ù etc.

filter_2 = ur'(?:^\|\s+)?(?:(?:main_interests)|(?:influenced)|(?:influences))\s+?=[\s\W]+?(?:[\w}])*?([\d\w\s\-()*–&;\[\]|.<>:/",\']*)(?=\n)'
compiled = re.compile(filter_2, flags=re.U | re.M)
filter_list = re.findall(compiled, information)


The text below is the result of the evaluation of the expression.


[[Pedro Calderón de la Barca|Calderón]], [[Christian Fürchtegott Gellert|Gellert]], [[Oliver Goldsmith|Goldsmith]], [[Hafez]], [[Johann Gottfried Herder|Herder]], [[Homer]], [[Kālidāsa]], [[Kant]], [[Friedrich Gottlieb Klopstock|Klopstock]], [[Gotthold Ephraim Lessing|Lessing]], [[Carl Linnaeus|Linnaeus]], [[James Macpherson|Macpherson]], [[Jean-Jacques Rousseau|Rousseau]], [[Friedrich Schiller|Schiller]], [[William Shakespeare|Shakespeare]], [[Spinoza]], [[Emanuel Swedenborg|Swedenborg]],[[Karl Robert Mandelkow]], Bodo Morawe: Goethes Briefe. 2. edition. Vol. 1: Briefe der Jahre 1764–1786. ''Christian Wegner'', Hamburg 1968, p. 709 [[Johann Joachim Winckelmann|Winckelmann]]`


Now, when i try to use another regular expression over the above text in order to extrapolate the words in the square brackets, the result is wrong. All the words that represent a special character, like à ù or è, are removed and the result is not the one expected.

filter_6 = ur'(?<=\[\[)([\w\s.-]+)((?=]])|(?=|))'
another_compiled = re.compile(filter_6, flags=re.U | re.M)
another_filtered_list = re.findall(another_compiled, (str(filter_list)))


These are my results:


[('Pedro Calder', ''), ('Christian F', ''), ('Oliver Goldsmith', ''), ('Hafez', ''), ('Johann Gottfried Herder', ''), ('Homer', ''), ('K', ''), ('Kant', ''), ('Friedrich Gottlieb Klopstock', ''), ('Gotthold Ephraim Lessing', ''), ('Carl Linnaeus', ''), ('James Macpherson', ''), ('Jean-Jacques Rousseau', ''), ('Friedrich Schiller', ''), ('William Shakespeare', ''), ('Spinoza', ''), ('Emanuel Swedenborg', ''), ('Karl Robert Mandelkow', ''), ('Johann Joachim Winckelmann', ''), ('Thomas Carlyle', ''), ('Ernst Cassirer', ''), ('Charles Darwin', ''), ('Sigmund Freud', ''), ('G', ''), ('Andr', ''), ('Hermann Hesse', ''), ('G.W.F. Hegel', ''), ('Muhammad Iqbal', ''), ('Daisaku Ikeda', ''), ('Carl Gustav Jung', ''), ('Milan Kundera', ''), ('S', ''), ('Jean-Baptiste Lamarck', ''), ('Joaquim Maria Machado de Assis', ''), ('Thomas Mann', ''), ('Friedrich Nietzsche', ''), ('France Pre', ''), ('Grigol Robakidze', ''), ('Friedrich Schiller', ''), ('Oswald Spengler', ''), ('Max Stirner', ''), ('Friedrich Wilhelm Joseph Schelling', ''), ('Arthur Schopenhauer', ''), ('Oswald Spengler', ''), ('Rudolf Steiner', ''), ('Henry David Thoreau', ''), ('Nikola Tesla', ''), ('Ivan Turgenev', ''), ('Ludwig Wittgenstein', ''), ('Richard Wagner', ''), ('Leopold von Ranke', '')]


These are the results i would like to achieve


MATCH 1
1. [2-28]
Pedro Calderón de la Barca

MATCH 2
1. [43-72]
Christian Fürchtegott Gellert

MATCH 3
1. [86-102]
Oliver Goldsmith

MATCH 4
1. [118-123]
Hafez

MATCH 5
1. [129-152]
Johann Gottfried Herder

MATCH 6
1. [165-170]
Homer

MATCH 7
1. [176-184]
Kālidāsa

MATCH 8
1. [190-194]
Kant

MATCH 9
1. [200-228]
Friedrich Gottlieb Klopstock

MATCH 10
1. [244-268]
Gotthold Ephraim Lessing

MATCH 11
1. [282-295]
Carl Linnaeus

MATCH 12
1. [310-326]
James Macpherson

MATCH 13
1. [343-364]
Jean-Jacques Rousseau

MATCH 14
1. [379-397]
Friedrich Schiller

MATCH 15
1. [412-431]
William Shakespeare

MATCH 16
1. [449-456]
Spinoza

MATCH 17
1. [462-480]
Emanuel Swedenborg

MATCH 18
1. [501-522]
Karl Robert Mandelkow

MATCH 19
1. [659-685]
Johann Joachim Winckelmann



All the regular expression are tested online and they work perfectly. There is a way to actually include these special characters?

Answer

In Python 3, the regex doesn't compile. This seemed to work for me when I changed:

filter_6 = ur'(?<=\[\[)([\w\s.-]+)((?=]])|(?=|))'

to just a unicode (not raw) string:

filter_6 = u'(?<=\[\[)([\w\s.-]+)((?=]])|(?=|))'

In Python 2, I believe the issue is the casting of the list to a string. Changing str(filter_list) to ' '.join(filter_list) seemed to work for me.

Comments