mercury0114 mercury0114 - 17 days ago 6
Linux Question

What's a short way in Linux to extract pattern string and another pattern string later?

Suppose we have one line of text stored in a file:

// In the actual file this will be one line
{unrelated_text1,ID:13, unrelated_text2,TIMESTAMP:1476280500,unrelated_text3},
{other_unrelated_text1,other_unrelated_text2,ID:25,TIMESTAMP:1476280600},
{ID:30,more_unrelated_text1,TIMESTAMP:1476280700},
{ID:40,final_unrelated_text}


What I want is for this particular input extract 3 entries:

// The details, such as whether to put { character in front or not do not matter.
// Any form of output which extracts only these 3 entries and groups them in a
// visually nice way will do the job.
{ID:13, TIMESTAMP:1476280500}
{ID:25, TIMESTAMP:1476280600}
{ID:30, TIMESTAMP:1476280700}
// I do not want the last entry, because it does not contain timestamp field.


So far the closest command I found is

grep -Po {id:[0-9]+(.+?)} input_file


which gives the output

{unrelated_text1,ID:13,unrelated_text2,TIMESTAMP:1476280500,unrelated_text3}
{other_unrelated_text1,other_unrelated_text2,ID:25,TIMESTAMP:1476280600}
{ID:30,more_unrelated_text1,TIMESTAMP:1476280700}
{ID:40,final_unrelated_text}


The next improvement I am searching for is how to remove
unrelated_text
from each entry and also remove the last entry.

Question: what's the shortest way to do that in Linux?

Answer

With GNU awk for multi-char RS and RT and word boundaries:

$ awk -v RS='\\<(ID|TIMESTAMP):[0-9]+' 'NR%2{id=RT;next} RT{printf "{%s, %s}\n", id, RT}' file
{ID:13, TIMESTAMP:1476280500}
{ID:25, TIMESTAMP:1476280600}
{ID:30, TIMESTAMP:1476280700}

The above will work no matter if the input is on one line or multiple lines and no matter what other text you have in the file, all it relies on is the ID appearing before each related TIMESTAMP and that's not hard to change if necessary.