Vladislav Ladenkov Vladislav Ladenkov - 11 months ago 37
Python Question

Error applying simple regexp

I had a function with RegExp working perfect:

def preprocess(topic, sample, RegSample):
topic = re.sub(RegSample,'?X?', topic, flags=re.I)# "" «» для агента X
topic = re.sub(sample, '?X?', topic, flags=re.I)# без скобок
topic = re.sub('[ЗАО]*[АО]О\s?X?', '?X? ', topic, flags=re.I)# ЗАО ОАО ООО и т.д. для X
topic = re.sub('\?X\?\?X\?', '?X?', topic)# Двойные агенты X
topic = re.sub('групп[^\s]\s?X?', '?X? ', topic, flags=re.I)# группа агента X

topic = re.sub('\s[a-zA-Z\s\d]*[\s\.$]', ' ?Y? ', topic) # Английские слова+цифры = Агент Y
topic = re.sub('[\"\«][^\"\»]*[\"\»]', '?Y?', topic, flags=re.I)# "" «» для агента Y
topic = re.sub('[ЗАО][ЗАО]О\s?Y?', '?Y?', topic)# ЗАО ОАО ООО и т.д. для Y
topic = re.sub('\s[А-Я][^\s]*[\s.$]', ' ?Y? ', topic)# Русские названия/имена заменяем на агента Y
topic = re.sub('\s[А-Я]\S*', '?Y?', topic)
topic = re.sub('\s[a-zA-Z][^\s]*', ' ?Y?', topic)
topic = re.sub('\?Y\?\?Y\?', '?Y?', topic)# Двойные агенты Y

topic = re.sub('[a-zA-Z\d\.-]*[\d][a-zA-Z\d\.-]*', '?D?', topic)# Английские наименования с цифрами(не компании)
topic = re.sub('[а-яА-Я\d\.-]*[\d][а-яА-Я\d\.-]*', '?D?', topic)# Российские наименования с цифрами(не компании)
return topic

But then i needed some more RegExp's:

def final_preprocessing(topic):
topic = re.sub('?X?', 'лол', topic)# лол - слово, кодирующее компанию агент-которого рассматриваем
topic = re.sub('?Y?', 'кек', topic)# лол - слово, кодирующее всех остыльных компаний-агентов
topic = re.sub('?D?', 'd ', topic)# кодирует весь треш в ?D?
return topic

And got an error:
error: nothing to repeat at position 0

According to some existing answers, i.e.: Python regex strange behavior, i had to ensure, that there ARE those patterns in my text. I cheked and can trustly say - they ARE in my text.
So whats the problem now?

P.S. Other RegExp's either could return ZERO substrings, but they didnt end with a mistake!

Answer Source

? means "repeat zero or one times". When that is the first character of the regular expression, what do you expect to be repeated zero or one times? That's what "nothing to repeat at position zero" means: at position zero you are asking for something to repeat zero or one times, but there's nothing there to repeat.

You need to escape the question mark if you are looking for a literal question mark:

topic = re.sub('\?X\?', 'лол', topic)