user2961008 user2961008 - 3 months ago 22
Python Question

Exact string search in XML files?

I need to search into some XML files (all of them have the same name, pom.xml) for the following text sequence exactly (also in subfolders), so in case somebody write some text or even a blank, I must get an alert:

<!--
| Start of user code (user defined modules)
|-->
<!--
| End of user code
|-->


I'm running the following Python script, but still not matching exactly, I also get alert even when it's partially the text inside:

import re
import os
from os.path import join
comment=re.compile(r"<!--\s+| Start of user code (user defined modules)\s+|-->\s+<!--\s+| End of user code\s+|-->")
tag="<module>"

for root, dirs, files in os.walk("."):

if "pom.xml" in files:
p=join(root, "pom.xml")
print("Checking",p)
with open(p) as f:
s=f.read()
if tag in s and comment.search(s):
print("Matched",p)


UPDATE #3

I am expecting to print out, the content of tag
<module>
if it exists between
|--> <!--


into the search:

<!--
| Start of user code (user defined modules)
|-->
<!--
| End of user code
|-->


for instance print after Matched , and the name of the file, also print "example.test1" in the case below :

<!--
| Start of user code (user defined modules)
|-->
<module>example.test1</module>
<!--
| End of user code
|-->


----LAST UPDATE ---#4

Should be using the following :

import re
import os
from os.path import join
comment=re.compile(r"<!--\s+\| Start of user code \(user defined modules\)\s+\|-->\s+<!--\s+\| End of user code\s+\|-->", re.MULTILINE)
tag="<module>"

for root, dirs, files in os.walk("/home/temp/test_folder/"):
for skipped in ("test1", "test2", ".repotest"):
if skipped in dirs: dirs.remove(skipped)

if "pom.xml" in files:
p=join(root, "pom.xml")
print("Checking",p)
with open(p) as f:
s=f.read()
if tag in s and comment.search(s):
print("The following file contains user code modules:-------------> ",p)

Answer

Don't parse a XML file with regular expression. The best Stackoverflow answer ever can explain you why

You can use BeautifulSoup to help on that task

Look how simple would be extract something from your code

from bs4 import BeautifulSoup

content = """
    <!--
     | Start of user code (user defined modules)
     |-->

    <!--
     | End of user code
     |-->
"""

bs = BeautifulSoup(content, "html.parser")
print(''.join(bs.contents))

Of course you can use your xml file instead of the literal I'm using

bs = BeautifulSoup(open("pom.xml"), "html.parser")

A small example using your expected input

from bs4 import BeautifulSoup
from bs4 import Comment

bs = BeautifulSoup(open(p), "html.parser")
# Extract all comments
comments=soup.find_all(string=lambda text:isinstance(text,Comment))
for c in comments:
    # Check if it's the start of the code
    if "Start of user code" in c:
        modules = [m for m in c.findNextSiblings(name='module')]
        for mod in modules:
            print(mod.text)

But if your code is always in a module tag I don't know why you should care about the comments before/after, you can just find the code inside the module tag directly

Comments