1. What are Regular Expressions?

  • Regular Expressions (Regex) are patterns used to match and manipulate text.
  • They are a powerful tool for searching, extracting, and replacing text based on specific patterns.
  • Python provides the re module for working with regular expressions.

2. Basic Regex Syntax

1. Literal Characters

  • Match exact characters in the text.
  • Example: The regex cat matches the string "cat".

2. Metacharacters

  • Special characters with specific meanings in regex:
    • . : Matches any single character except newline.
    • ^ : Matches the start of a string.
    • $ : Matches the end of a string.
    • * : Matches 0 or more repetitions of the preceding character.
    • + : Matches 1 or more repetitions of the preceding character.
    • ? : Matches 0 or 1 repetition of the preceding character.
    • {m,n} : Matches between m and n repetitions of the preceding character.
    • [] : Matches any single character within the brackets.
    • | : Acts as an OR operator.
    • () : Groups patterns together.

Examples:

  • a.b matches "aab", "acb", but not "ab".
  • ^abc matches "abc" at the start of a string.
  • xyz$ matches "xyz" at the end of a string.

3. Special Sequences

  • \d : Matches any digit (0-9).
  • \D : Matches any non-digit.
  • \w : Matches any word character (a-z, A-Z, 0-9, _).
  • \W : Matches any non-word character.
  • \s : Matches any whitespace character (space, tab, newline).
  • \S : Matches any non-whitespace character.
  • \b : Matches a word boundary.
  • \B : Matches a non-word boundary.

Examples:

  • \d{3} matches any 3 digits (e.g., "123").
  • \w+ matches one or more word characters (e.g., "hello").

3. Using the re Module

1. re.match()

  • Checks if the regex matches at the beginning of the string.
  • Returns a match object if found, otherwise None.

Example:

import re
result = re.match(r"hello", "hello world")
print(result.group())  # Output: hello
  • Searches the entire string for a match.
  • Returns a match object if found, otherwise None.

Example:

import re
result = re.search(r"world", "hello world")
print(result.group())  # Output: world

3. re.findall()

  • Returns all non-overlapping matches of the regex in the string as a list.

Example:

import re
result = re.findall(r"\d+", "There are 3 apples and 5 oranges.")
print(result)  # Output: ['3', '5']

4. re.finditer()

  • Returns an iterator yielding match objects for all matches.

Example:

import re
matches = re.finditer(r"\d+", "There are 3 apples and 5 oranges.")
for match in matches:
    print(match.group())  # Output: 3, 5

5. re.sub()

  • Replaces all occurrences of the regex pattern in the string with a replacement string.

Example:

import re
result = re.sub(r"\d+", "X", "There are 3 apples and 5 oranges.")
print(result)  # Output: There are X apples and X oranges.

6. re.split()

  • Splits the string by the occurrences of the regex pattern.

Example:

import re
result = re.split(r"\s+", "Split this sentence.")
print(result)  # Output: ['Split', 'this', 'sentence.']

4. Regex Groups

  • Use parentheses () to create groups in a regex.
  • Groups allow you to extract specific parts of a match.

Example:

import re
result = re.search(r"(\d{2})-(\d{2})-(\d{4})", "Date: 12-31-2023")
print(result.group(1))  # Output: 12 (day)
print(result.group(2))  # Output: 31 (month)
print(result.group(3))  # Output: 2023 (year)

5. Named Groups

  • Assign names to groups using (?P<name>...) syntax.

Example:

import re
result = re.search(r"(?P<day>\d{2})-(?P<month>\d{2})-(?P<year>\d{4})", "Date: 12-31-2023")
print(result.group("day"))   # Output: 12
print(result.group("month")) # Output: 31
print(result.group("year"))  # Output: 2023

6. Additional Examples

  • Matching Names:

    import re
    names = ["Raj", "Ram", "Anand", "Bala", "Karthik"]
    pattern = r"^R\w+"  # Names starting with 'R'
    matches = [name for name in names if re.match(pattern, name)]
    print(matches)  # Output: ['Raj', 'Ram']
    
  • Extracting Phone Numbers:

    import re
    text = "Contact Raj at 123-456-7890 or Bala at 987-654-3210."
    phone_numbers = re.findall(r"\d{3}-\d{3}-\d{4}", text)
    print(phone_numbers)  # Output: ['123-456-7890', '987-654-3210']
    
  • Replacing Text:

    import re
    text = "Hello Raj, how are you Raj?"
    new_text = re.sub(r"Raj", "Ram", text)
    print(new_text)  # Output: Hello Ram, how are you Ram?
    
  • Splitting Text:

    import re
    text = "Karthik,Suresh,Sathish"
    names = re.split(r",", text)
    print(names)  # Output: ['Karthik', 'Suresh', 'Sathish']
    

7. Best Practices

  • Use raw strings (r"...") for regex patterns to avoid escaping backslashes.
  • Test regex patterns using tools like regex101.com.
  • Use comments and verbose mode (re.VERBOSE) for complex regex patterns.

Example:

import re
pattern = re.compile(r"""
    \b       # Word boundary
    \d{3}    # 3 digits
    -        # Hyphen
    \d{3}    # 3 digits
    -        # Hyphen
    \d{4}    # 4 digits
    \b       # Word boundary
""", re.VERBOSE)