Most of the text processing can be processed by awk and sed. Sed is non-interactive stream editor that allows you to specify all editing instructions in one place and execute them on a single pass through the file. Awk is a pattern-matching programming language.
Using sed and awk requires some understanding of regular expressions. Here’s the basics of regular expression signs:
| sign | operation |
| ^ | matches beginning of line |
| $ | matches end of line |
| . | matches any single character (wildcard) |
| * | repeat previous token zero, one or more times. |
| + | repeat previous token one or more times |
| ? | repeat previous token zero or one time |
| […] | matches any one of the class of characters enclosed between the classes. ^ as first character reverses the match – is used to ndicate a range of characters |
| () | groups regex |
| | | either preceding or following regex can be matched |
| {n,m} | matches a range of occurrences of the single character that immediately precedes it. {n} will match exactly n occurrences {n,} will match at least n occurrences {n,m} will match any number of occurrences between n and m |
Common expressions
| exp | interpretation |
| [^0-9] | excluding number |
| [15]00* | matches “10”, “50”, “100”, “500”, “1000”, “5000”. Here the first 0 is literal, the second is modified by *, see the table above |
| .* | any number (including 0) of any character |
| <.*> | any html tags |
| book | matches book with preceding and following spaces |
| books* | matches books, or book, but not “book.” “book?” etc |
| book.* | matches book, followed by any number of characters, or none followed by a space |
Note that regular expression comes in several different flavours, which can be confusing and frustrating. This is a good summary. There are DFA (Deterministic Finite Automata) based engines and NFA (Non-Deterministic Finite Automata) based engines:
- NFA based engines can “go back” in the regex, used in Perl, Python, vim, sed and GNU grep.
- DFA based engines cannot “go back” in the regex, used in awk and BSD grep.
| Standard | IEEE POSIX BRE | IEEE POSIX ERE | PCRE |
| Detail | Basic Regular Expressions | Extended Regular Expressions that add repetition, alternation on top of BRE | Perl Compatible regular expression. |
| Engine | DFA | DFA | NFA |
| GNU grep | grep by default, or grep -G | egrep grep -E | grep -P |
| BSD grep | grep | egrep | |
| GNU sed | sed | sed -r | NA. Just use perl |
| BSD sed | sed | sed -E | NA. Just use perl |
| awk | awk |
The best way to check isn on BSD manual and GNU.
Here are several examples of sed and awk I came across at work:
Find and remove duplicate lines:
awk '!x[$0]++' input_file.txt > output_file.txt
Remove white spaces at the beginning and end of each line:
awk '{$1=$1}1' input_file.txt > output_file.txt
Print with multiple dilimiters (;, , , and |)
awk -F '[;,|]' '{print $1, $3, $5}'
Print with calculation between columns
awk '{res=$1-$2;print res,$0}'
Print rows conditionally
awk '$1>20{print;}'
Prefix each line of a file
awk '$0="PREFIX|"$0' input.txt > prefix.input.txt
Replace string original to new in file
sed -i 's/original/new/g' file.txt
Remove multiple patterns
sed 's/pattern1\|pattern2\|pattern3//g'
Delete the first matching pattern only
sed 's/pattern//'
Remove blank lines:
sed -i '/^$/d' input_file.txt > output_file.txt
Copy from line 100 to line 500 of input file to output file
sed -n 100,500p input.log>output.log
Merge every three lines:
sed 'N;N;N; s/\n/ /g'
