Mastering Regex in Linux: A Comprehensive Guide

Regular expressions (regex) in Linux are powerful tools used for pattern matching within text. They allow users to search, match, and manipulate text using specific patterns of characters. Regular expressions are integral to many commands like grep, sed, awk, and vi. Linux uses two primary types of regular expressions: Basic Regular Expressions (BRE) and Extended Regular Expressions (ERE), which differ slightly in their syntax and available operators.

In this detailed explanation, we’ll cover the concept of regular expressions, common regex commands, metacharacters, and practical examples.

Basic and Extended Regular Expressions

  • Basic Regular Expressions (BRE): This is the default type of regular expression in Linux, and it’s simpler. For example, grep uses BRE by default.
  • Extended Regular Expressions (ERE): EREs allow for more advanced regex patterns, and commands like egrep or grep -E use ERE by default. The sed -r and awk commands also support ERE.

Although the two types of regular expressions differ, most of the same principles apply to both. The major difference lies in how metacharacters (special symbols used for pattern matching) are interpreted.

Metacharacters in Regular Expressions

Regular expressions are built using metacharacters. Metacharacters have special meanings and allow for complex pattern matching. Below is a breakdown of the most commonly used metacharacters:

1. Dot (.)

The dot matches any single character except for a newline (\n).

  • Example:
  grep 'h.t' file.txt

This will match “hat”, “hit”, “hot”, etc., in file.txt.

2. Caret (^)

The caret matches the start of a line.

  • Example:
  grep '^Error' file.txt

This will match any line that starts with “Error”.

3. Dollar Sign ($)

The dollar sign matches the end of a line.

  • Example:
  grep 'log$' file.txt

This will match lines that end with the word “log”.

4. Asterisk (*)

The asterisk matches zero or more occurrences of the preceding character.

  • Example:
  grep 'go*gle' file.txt

This will match “ggle”, “gogle”, “google”, “gooogle”, etc.

5. Plus (+) (Extended Regex Only)

The plus matches one or more occurrences of the preceding character.

  • Example (with grep -E for extended regex):
  grep -E 'go+gle' file.txt

This will match “gogle”, “google”, “gooogle”, etc., but not “ggle”.

6. Question Mark (?)

The question mark matches zero or one occurrence of the preceding character. It makes the preceding character optional.

  • Example:
  grep -E 'colou?r' file.txt

This will match both “color” and “colour”.

7. Square Brackets ([])

Square brackets match any one of the enclosed characters. You can also specify ranges of characters using hyphens.

  • Examples:
  • grep '[aeiou]' file.txt: Matches any line that contains a vowel.
  • grep '[0-9]' file.txt: Matches any line containing a digit.
  • grep '[a-z]' file.txt: Matches any lowercase letter.
  • grep '[A-Za-z]' file.txt: Matches any uppercase or lowercase letter.

8. Negated Character Class ([^])

If you place a caret (^) inside square brackets at the beginning, it negates the character class, meaning it will match any character except those specified.

  • Example:
  grep '[^aeiou]' file.txt

This will match any character that is not a vowel.

9. Curly Braces ({})

Curly braces specify repetitions of the preceding character or group.

  • Examples:
  • grep -E 'a{3}' file.txt: Matches “aaa”.
  • grep -E 'a{2,4}' file.txt: Matches “aa”, “aaa”, or “aaaa” (between 2 and 4 occurrences).
  • grep -E 'a{2,}' file.txt: Matches “aa”, “aaa”, “aaaa”, etc. (2 or more occurrences).

10. Pipe (|)

The pipe operator performs a logical OR between patterns. It’s used to match one pattern or another.

  • Example (with grep -E for extended regex):
  grep -E 'error|failure' file.txt

This will match lines containing either “error” or “failure”.

11. Parentheses (())

Parentheses are used for grouping expressions. They group patterns that you want to treat as a single unit.

  • Example:
  grep -E '(ab|cd){2}' file.txt

This matches two repetitions of either “ab” or “cd”, like “abab”, “cdcd”, “abcd”, or “cdab”.

12. Backslash (\)

The backslash is used to escape a metacharacter, treating it as a literal character rather than a special one.

  • Example:
  grep '\.' file.txt

This matches a literal dot (.) instead of using the dot as a metacharacter.

13. Word Boundaries (\b)

\b represents a word boundary, ensuring that the pattern matches at the beginning or end of a word.

  • Example:
  grep '\berror\b' file.txt

This matches the word “error” but not words like “errors” or “supererror”.

Types of Regular Expressions

1. Basic Regular Expressions (BRE)

BRE uses simple patterns where most metacharacters like +, ?, |, and {} need to be escaped with a backslash (\) to be used as metacharacters.

  • Examples:
  • To match one or more occurrences of “a”, you use:
    bash grep 'a\+' file.txt
  • To match any line that starts with “Hello” and ends with a number:
    bash grep '^Hello.*[0-9]$' file.txt

2. Extended Regular Expressions (ERE)

ERE, supported by grep -E, egrep, sed -E, and awk, offers more advanced pattern matching and does not require escaping metacharacters like +, ?, |, and {}.

  • Examples:
  • Match one or more occurrences of “a” without escaping:
    bash grep -E 'a+' file.txt
  • Match any line that contains “abc” followed by either “123” or “456”:
    bash grep -E 'abc(123|456)' file.txt

Practical Examples of Regular Expressions

1. Matching Lines That Start with a Word

grep '^The' file.txt

This matches lines that start with the word “The”.

2. Find Lines Ending with a Number

grep '[0-9]$' file.txt

This matches lines that end with a digit (0-9).

3. Search for IP Addresses

grep -Eo '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' logfile.txt

This extracts IP addresses (in the format from logfile.txt.

4. Find Empty Lines in a File

grep '^$' file.txt

This finds all empty lines in file.txt.

5. Find Lines That Contain Only Numbers

grep '^[0-9]\+$' file.txt

This matches lines that consist of only digits (one or more digits).

6. Matching a Phone Number Pattern

To match a U.S. phone number in the format (123) 456-7890, you can use:

grep -E '\([0-9]{3}\) [0-9]{3}-[0-9]{4}' file.txt

7. Extracting Words with awk and Regex

awk '/\bLinux\b/' file.txt

This finds lines containing the word “Linux” as a whole word using word boundaries (\b).

Advanced Usage with sed, awk, and Other Tools

Regular expressions are used extensively with other Linux commands like sed (stream editor) and awk (text processing tool).

1. Using Regex in sed

  • Replace the word “foo” with “bar” in a file:
  sed 's/foo/bar/g' file.txt
  • Delete lines containing

