Linux Admin Basics 3 of 3 – text processing, regex, sed & awk

Most of the text processing can be processed by awk and sed. Sed is non-interactive stream editor that allows you to specify all editing instructions in one place and execute them on a single pass through the file. Awk is a pattern-matching programming language.

Using sed and awk requires some understanding of regular expressions. Here’s the basics of regular expression signs:

^matches beginning of line
$matches end of line
.matches any single character (wildcard)
*repeat previous token zero, one or more times.
+repeat previous token one or more times
?repeat previous token zero or one time
[…]matches any one of the class of characters enclosed between the classes.
^ as first character reverses the match
– is used to ndicate a range of characters
()groups regex
|either preceding or following regex can be matched
{n,m}matches a range of occurrences of the single character that immediately precedes it.
{n} will match exactly n occurrences
{n,} will match at least n occurrences
{n,m} will match any number of occurrences between n and m

Common expressions

[^0-9]excluding number
[15]00*matches “10”, “50”, “100”, “500”, “1000”, “5000”. Here the first 0 is literal, the second is modified by *, see the table above
.*any number (including 0) of any character
<.*>any html tags
book matches book with preceding and following spaces
books* matches books, or book, but not “book.” “book?” etc
book.* matches book, followed by any number of characters, or none followed by a space

Note that regular expression comes in several different flavours, which can be confusing and frustrating. This is a good summary. There are DFA (Deterministic Finite Automata) based engines and NFA (Non-Deterministic Finite Automata) based engines:

  • NFA based engines can “go back” in the regex, used in Perl, Python, vim, sed and GNU grep.
  • DFA based engines cannot “go back” in the regex, used in awk and BSD grep.
DetailBasic Regular ExpressionsExtended Regular Expressions that add repetition, alternation on top of BREPerl Compatible regular expression.
GNU grepgrep by default, or grep -Gegrep
grep -E
grep -P
BSD grepgrepegrep
GNU sedsedsed -rNA. Just use perl
BSD sedsedsed -ENA. Just use perl

The best way to check isn on BSD manual and GNU.

Here are several examples of sed and awk I came across at work:

Find and remove duplicate lines:

awk '!x[$0]++' input_file.txt > output_file.txt

Remove white spaces at the beginning and end of each line:

awk '{$1=$1}1' input_file.txt > output_file.txt

Print with multiple dilimiters (;, , , and |)

awk -F '[;,|]' '{print $1, $3, $5}'

Print with calculation between columns

awk '{res=$1-$2;print res,$0}'

Print rows conditionally

awk '$1>20{print;}'

Prefix each line of a file

awk '$0="PREFIX|"$0' input.txt > prefix.input.txt

Replace string original to new in file

sed -i 's/original/new/g' file.txt

Remove multiple patterns

sed 's/pattern1\|pattern2\|pattern3//g'

Delete the first matching pattern only

sed 's/pattern//'

Remove blank lines:

sed -i '/^$/d' input_file.txt > output_file.txt

Copy from line 100 to line 500 of input file to output file

sed -n 100,500p input.log>output.log

Merge every three lines:

sed 'N;N;N; s/\n/ /g'