Search This Blog

Thursday, October 11, 2012

do a grep on a column

Consider the following data
$ cat data.txt
1,fruit,apple red,spherical
2,fruit,apple green,spherical
3,vegetable,peppers green,irregular
4,vegetable,peppers yellow,irregular
5,vegetable,peppers red,irregular
6,vegetable,broccoli,irregular and green
7,plant,green spinach,leaves
8,plant,very green spinach,leaves
9,plant,verygreenspinach,leaves
10,seed,green pea,spherical
11,unknown,green,undefined
The problem is to filter the lines where the third field ends in the word green. So the output should be
2,fruit,apple green,spherical
3,vegetable,peppers green,irregular
Short answer:- use awk with regular expression support
$ awk -F"," '{if ($3 ~ /\sgreen$/) print $0}' data.txt
2,fruit,apple green,spherical
3,vegetable,peppers green,irregular
Long answer:-
Naive use of grep gives a lot of false positives.
$ grep green data.txt
2,fruit,apple green,spherical
3,vegetable,peppers green,irregular
6,vegetable,broccoli,irregular and green
7,plant,green spinach,leaves
8,plant,very green spinach,leaves
9,plant,verygreenspinach,leaves
10,seed,green pea,spherical
11,unknown,green,undefined
line6 should not be printed as the word green appears in the 4th column (and not the 3rd).

lines 7,8,10 has the word green in the 3rd field. But they should not be printed since green does not appear at the end.

line9 - the letters green are present in the third field but is not preceded by a space, so should not be printed.

line 11 is most likely a data error. The third field has the word green but is not associated with any fruit, vegetable, plant etc.,


Further, to print all the spherical objects, one can use
$ awk -F"," '{if ($4=="spherical") print $0}' data.txt
1,fruit,apple red,spherical
2,fruit,apple green,spherical
10,seed,green pea,spherical
Here a full match on the 4th field is performed. However, this trick cannot be extended to the present problem as only partially matches on the third field are desired.

How the solution works:-
$ awk -F"," '{if ($3 ~ /\sgreen$/) print $0}' data.txt 
~       tests for a match
/ ... /  delimiters of the regular expression
\s       test for space
$        test for end of field

Tested on Debian Wheezy using
$ awk --version
GNU Awk 4.0.1
Copyright (C) 1989, 1991-2012 Free Software Foundation.

No comments:

Followers