dimid dimid - 6 months ago 18
Bash Question

Grep with conditional context

I'd like to

grep
a file for a regex MR (main) and get all consecutive preceding lines that match a regex BR (before), and all consecutive following lines that match a regex AR (after).

i.e. something like this

grep -B [BR] -A [AR] [MR] file


e.g. for the following segment (taken from the CHILDES project):

8|10|SUBJ 9|10|AUX 10|6|ROOT 11|10|PUNCT
*CHI: here .
%mor: adv|here .
%gra: 1|0|INCROOT 2|1|PUNCT
*URS: ask her (.) okay ?
%mor: v|ask pro:poss:det|her adj|okay ?
%gra: 1|0|ROOT 2|3|MOD 3|1|OBJ 4|1|PUNCT
*URS: ask her what she can eat .
%mor: v|ask pro:obj|her pro:wh|what pro:sub|she mod|can v|eat .
%gra: 1|0|ROOT 2|1|OBJ 3|6|LINK 4|6|SUBJ 5|6|AUX 6|1|COMP 7|1|PUNCT
*URS: but what is it ?
%mor: conj|but pro:wh|what aux|be&3S pro|it ?
%gra: 1|3|LINK 2|3|OBJ 3|0|ROOT 4|3|OBJ 5|3|PUNCT
*CHI: it's peaches and pears .


The query

grep -B '^*' -A '^%' '^%mor:\s+v' file


will return

*URS: ask her (.) okay ?
%mor: v|ask pro:poss:det|her adj|okay ?
%gra: 1|0|ROOT 2|3|MOD 3|1|OBJ 4|1|PUNCT
*URS: ask her what she can eat .
%mor: v|ask pro:obj|her pro:wh|what pro:sub|she mod|can v|eat .
%gra: 1|0|ROOT 2|1|OBJ 3|6|LINK 4|6|SUBJ 5|6|AUX 6|1|COMP 7|1|PUNCT


In other words, I'm looking for all utterances (lines starting with *) that begin with a verb, and each utterance should be followed by its dependent tires (lines starting with %). Feel free to suggest other command-line tools instead of grep (e.g. awk).

Answer

You can use awk:

awk -v br='^\\*' -v ar='^%' -v mr='^%mor:[[:blank:]]+v' '
$0 ~ br {
   data = $0
}
$0 ~ mr {
   data = data RS $0
   p=1
   next
}
$0 ~ ar {
   if (p)
      print data RS $0
   p = 0
   data = ""
}' file

*URS:   ask her (.) okay ?
%mor:   v|ask pro:poss:det|her adj|okay ?
%gra:   1|0|ROOT 2|3|MOD 3|1|OBJ 4|1|PUNCT
*URS:   ask her what she can eat .
%mor:   v|ask pro:obj|her pro:wh|what pro:sub|she mod|can v|eat .
%gra:   1|0|ROOT 2|1|OBJ 3|6|LINK 4|6|SUBJ 5|6|AUX 6|1|COMP 7|1|PUNCT

This awk works as follows:

  • When it matches br in a line it initiates a variable data with that line i.e. data=$0
  • When it matches mr it appends that line in data variable and sets a flag p=1
  • Finally when it matches ar it prints the data and current line if flag is set. Finally it reinitializes the flags.