Sina Sh Sina Sh - 2 months ago 4
Python Question

Extract multiple substrings from a file and list them in another place using python/shell

I've got a log file similar to below:

/* BUG: axiom too complex: SubClassOf(ObjectOneOf([NamedIndividual(http://www.sem.org/sina/onto/2015/7/TSB-GCL#t_Xi_xi)]),DataHasValue(DataProperty(http://www.code.org/onto/ont.owl#XoX_type),^^(periodic,http://www.mdos.org/1956/21/2-rdf-syntax-ns#PlainLiteral))) */
/* BUG: axiom too complex: SubClassOf(ObjectOneOf([NamedIndividual(http://www.sem.org/sina/onto/2015/7/TSB-GCL#t_Ziz)]),DataHasValue(DataProperty(http://www.co-ode.org/ontologies/ont.owl#YoY_type),^^(latency,http://www.w3.org/1956/01/11-rdf-syntax-ns#PlainLiteral))) */
....


I want to extract the fields of t_Xi_xi, t_Ziz ,XoX_type and YoY_type and also the values after ^^( which in this case are latency and periodic.

Note: There are different alphabetic values for each X and Y in the file (e.g. X="sina" Y="Boom" so --> t_Xi_xi ~ t_Sina_sina) so I guess using the regex would be a better choice.

So the final result must be something like below:

t_Xi_xi XoX_type periodic
t_Ziz YoY_type latency


I've tried the regex below to extract them and hopefully to be able to replace the rest of it to " " in the file with the help of sed in shell, but I failed.

([a-zA-Z]_[a-zA-Z]*_[a-zA-Z]*)|(\#[a-zA-Z]*_[a-zA-Z]*)|(\^\([a-zA-Z]*)+


Any kind of help is appreciated on how to do this in Python (or even shell itself).

Answer
$ awk -F'#|\\^\\^\\(' '{for (i=2; i<NF; i++) printf "%s%s", gensub(/[^[:alnum:]_].*/,"",1,$i), (i<(NF-1) ? OFS : ORS) }' file
t_Xi_xi XoX_type periodic
t_Ziz YoY_type latency

The above uses GNU awk for gensub(), with other awks you'd use sub() and a separate printf statement.