Analyzer Analyzer - 13 days ago 5
Python Question

Extracting file names from text using regular expression Python

I am trying to extract source code file names saved in python string variable. However, variable contains html type tags and lot of other contents as shown below:

<p> Result = FAILURE<br/ hshreedharan : <a href="http://git-wip-
<ul>
<li>flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSEventSink.java</li>
<li>flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/TestBucketWriter.java</li>
<li>flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java</li>
<li>sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java</li>
<li>sink.src.main.java.org.apache.flume.sink.hdfs.BucketWriter.java</li>
</ul>


However, I am looking for proper regular expression using "re" python library to ignore all other text, html tags and extract output only as source code files contained in the variable.

flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSEventSink.java
flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/TestBucketWriter.java
flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java
sink.src.main.java.org.apache.flume.sink.hdfs.BucketWriter.java
sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java


Currently, I am using following code:

import re

htmlText= \\ may be variable containing above code

matchSrcFiles= re.findall('\\.[^.]*.java$', htmlText) \\text ending .java


Help for proper regular expression or function modification like, re.sub to extract relavent source code files shall be appreciated.

Answer

Check this: ([a-zA-Z-.\/]+.java)

import re

a="""<p> Result = FAILURE<br/ hshreedharan : <a href="http://git-wip-
<ul>
<li>flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSEventSink.java</li>     
<li>flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/TestBucketWriter.java</li>
<li>flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java</li>
</ul>
channel/src/main/java/org/apache/flume/channel/file/protoProtosFactory.java.
sink.src.main.java.apache.flume.sink.java
"""

pat = "([a-zA-Z-.\/]+.java)"
c =  re.findall(pat,a)
print c

output:

['flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSEventSink.java', 'flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/TestBucketWriter.java', 'flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java', 'channel/src/main/java/org/apache/flume/channel/file/protoProtosFactory.java', 'sink.src.main.java.apache.flume.sink.java']

Demo on Regex101: https://regex101.com/r/zzFpKJ/3