Linux Admin Basics 3 of 3 - text processing, regex, sed & awk

Most of the text processing can be processed by awk and sed. Sed is non-interactive stream editor that allows you to specify all editing instructions in one place and execute them on a single pass through the file. Awk is a pattern-matching programming language.

Using sed and awk requires some understanding of regular expressions. Here’s the basics of regular expression signs:

sign	operation
^	matches beginning of line
$	matches end of line
.	matches any single character (wildcard)
*	repeat previous token zero, one or more times.
+	repeat previous token one or more times
?	repeat previous token zero or one time
[…]	matches any one of the class of characters enclosed between the classes. ^ as first character reverses the match – is used to ndicate a range of characters
()	groups regex
\|	either preceding or following regex can be matched
{n,m}	matches a range of occurrences of the single character that immediately precedes it. {n} will match exactly n occurrences {n,} will match at least n occurrences {n,m} will match any number of occurrences between n and m

Common expressions

exp	interpretation
[^0-9]	excluding number
[15]00*	matches “10”, “50”, “100”, “500”, “1000”, “5000”. Here the first 0 is literal, the second is modified by *, see the table above
.*	any number (including 0) of any character
<.*>	any html tags
book	matches book with preceding and following spaces
books*	matches books, or book, but not “book.” “book?” etc
book.*	matches book, followed by any number of characters, or none followed by a space

Note that regular expression comes in several different flavours, which can be confusing and frustrating. This is a good summary. There are DFA (Deterministic Finite Automata) based engines and NFA (Non-Deterministic Finite Automata) based engines:

NFA based engines can “go back” in the regex, used in Perl, Python, vim, sed and GNU grep.
DFA based engines cannot “go back” in the regex, used in awk and BSD grep.

Standard	IEEE POSIX BRE	IEEE POSIX ERE	PCRE
Detail	Basic Regular Expressions	Extended Regular Expressions that add repetition, alternation on top of BRE	Perl Compatible regular expression.
Engine	DFA	DFA	NFA
GNU grep	grep by default, or grep -G	egrep grep -E	grep -P
BSD grep	grep	egrep
GNU sed	sed	sed -r	NA. Just use perl
BSD sed	sed	sed -E	NA. Just use perl
awk		awk

The best way to check isn on BSD manual and GNU.

Here are several examples of sed and awk I came across at work:

Find and remove duplicate lines:

awk '!x[$0]++' input_file.txt > output_file.txt

Remove white spaces at the beginning and end of each line:

awk '{$1=$1}1' input_file.txt > output_file.txt

Print with multiple dilimiters (;, , , and |)

awk -F '[;,|]' '{print $1, $3, $5}'

Print with calculation between columns

awk '{res=$1-$2;print res,$0}'

Print rows conditionally

awk '$1>20{print;}'

Prefix each line of a file

awk '$0="PREFIX|"$0' input.txt > prefix.input.txt

Replace string original to new in file

sed -i 's/original/new/g' file.txt

Remove multiple patterns

sed 's/pattern1\|pattern2\|pattern3//g'

Delete the first matching pattern only

sed 's/pattern//'

Remove blank lines:

sed -i '/^$/d' input_file.txt > output_file.txt

Copy from line 100 to line 500 of input file to output file

sed -n 100,500p input.log>output.log

Merge every three lines:

sed 'N;N;N; s/\n/ /g'

Linux Admin Basics 3 of 3 – text processing, regex, sed & awk

cron and anacron in RedHat Linux (How logrotate works)

DICOM data encoding