TOPIC: STRING-SEARCHING ALGORITHM
A primer on Regular Expressions in PowerShell: Learning through SAS code analysis
5th November 2025Many already realise that regular expressions are a powerful tool for finding patterns in text, though they can be complex in their own way. Thus, this primer uses a real-world example (scanning SAS code) to introduce the fundamental concepts you need to work with regex in PowerShell. Since the focus here is on understanding how patterns are built and applied, you do not need to know SAS to follow along.
Getting Started with Pattern Matching
Usefully, PowerShell gives you several ways to test whether text matches a pattern, and the simplest is the -match operator:
$text = "proc sort data=mydata;"
$text -match "proc sort" # Returns True
When -match finds a match, it returns True. PowerShell's -match operator is case-insensitive by default, so "PROC SORT" would also match. If you need case-sensitive matching, use -cmatch instead.
For more detailed results, Select-String is useful:
$text | Select-String -Pattern "proc sort"
This shows you where the match occurred and provides context around it.
Building Your First Patterns
Literal Matches
The simplest patterns match exactly what you type. If you want to find the text proc datasets, you write:
$pattern = "proc datasets"
This will match those exact words in sequence.
Special Characters and Escaping
Some characters have special meaning in regex. The dot (.) matches any single character, so if you wish to match a literal dot, you need to escape it with a backslash:
$pattern = "first\." # Matches "first." literally
Without the backslash, first. would match "first" followed by any character (for example, "firstly", "first!", "first2").
Alternation
The pipe symbol (|) lets you match one thing or another:
$pattern = "(delete;$|output;$)"
This matches either delete; or output; at the end of a line. The dollar sign ($) is an anchor that means "end of line".
Anchors: Controlling Where Matches Occur
Anchors do not match characters. Instead, they specify positions in the text.
^matches the start of a line$matches the end of a line\bmatches a word boundary (the position between a word character and a non-word character)
Here is a pattern that finds proc sort only at the start of a line:
$pattern = "^\s*proc\s+sort"
Breaking this down:
^anchors to the start of the line\s*matches zero or more whitespace charactersprocmatches the literal text\s+matches one or more whitespace characterssortmatches the literal text
The \s is a shorthand for any whitespace character (spaces, tabs, newlines). The * means "zero or more" and + means "one or more".
Quantifiers: How Many Times Something Appears
Quantifiers control repetition:
-
*means zero or more+means one or more?means zero or one (optional){n}means exactlyntimes{n,}meansnor more times{n,m}means betweennandmtimes
By default, quantifiers are greedy (they match as much as possible). Adding ? after a quantifier makes it reluctant (it matches as little as possible):
$pattern = "%(m|v).*?(?=[,(;\s])"
Here .*? matches any characters, but as few as possible before the next part of the pattern matches. This is useful when you want to stop at the first occurrence of something rather than continuing to the last.
Groups and Capturing
Parentheses create groups, which serve two purposes: they let you apply quantifiers to multiple characters, and they capture the matched text for later use:
$pattern = "(first\.|last\.)"
This creates a group that matches either first. or last.. The group captures whichever one matches so you can extract it later.
Non-capturing groups, written as (?:...), group without capturing. This is useful when you need grouping for structure but do not need to extract the matched text:
$pattern = "(?:raw\.|sdtm\.|adam\.)"
Lookaheads: Matching Without Consuming
Lookaheads are powerful but can be confusing at first. They check what comes next without including it in the match.
A positive lookahead (?=...) succeeds if the pattern inside matches what comes next:
$pattern = "%(m|v).*?(?=[,(;\s])"
This matches %m or %v followed by any characters, stopping just before a comma, parenthesis, semicolon or whitespace. The delimiter is not included in the match.
A negative lookahead (?!...) succeeds if the pattern inside does not match what comes next:
$pattern = "(?!(?:r_options))\b(raw\.|sdtm\.|adam\.|r_)(?!options)"
This is more complex. It matches library prefixes like raw., sdtm., adam. or r_, but only if:
- The text at the current position is not
r_options - The matched prefix is not followed by
options
This prevents false matches on text like r_options whilst still allowing r_something_else.
Inline Modifiers
You can change how patterns behave by placing modifiers at the start:
(?i)makes the pattern case-insensitive(?-i)makes the pattern case-sensitive
$pattern = "(?i)^\s*proc\s+sort"
This matches proc sort, PROC SORT, Proc Sort or any other case variation.
A Complete Example: Finding PROC SORT Without DATA=
Let us build up a practical pattern step by step. The goal is to find PROC SORT statements that are missing a DATA= option.
Start with the basic match:
$pattern = "proc sort"
Add case-insensitivity and line-start flexibility:
$pattern = "(?i)^\s*proc\s+sort"
Add a word boundary to avoid matching proc sorting:
$pattern = "(?i)^\s*proc\s+sort\b"
Now add a negative lookahead to exclude lines that contain data=:
$pattern = "(?i)^\s*proc\s+sort\b(?![^;\n]*\bdata\s*=\s*)"
This negative lookahead checks the rest of the line (up to a semicolon or newline) and fails if it finds data followed by optional spaces, an equals sign and more optional spaces.
Finally, match the rest of the line:
$pattern = "(?i)^\s*proc\s+sort\b(?![^;\n]*\bdata\s*=\s*).*"
Working with Multiple Patterns
Real-world scanning often involves checking many patterns. PowerShell arrays make this straightforward:
$matchStrings = @(
"%(m|v).*?(?=[,(;\s])",
"(raw\.|sdtm\.|adam\.).*?(?=[,(;\s])",
"(first\.|last\.)",
"proc datasets"
)
$text = "Use %mvar in raw.dataset with first.flag"
foreach ($pattern in $matchStrings) {
if ($text -match $pattern) {
Write-Host "Match found: $pattern"
}
}
Finding All Matches in a String
The -match operator only tells you whether a pattern matches. To find all occurrences, use [regex]::Matches:
$text = "first.x and last.y and first.z"
$pattern = "(first\.|last\.)"
$matches = [regex]::Matches($text, $pattern)
foreach ($match in $matches) {
Write-Host "Found: $($match.Value) at position $($match.Index)"
}
This returns a collection of match objects, each containing details about what was found and where.
Replacing Text
The -replace operator applies a pattern and substitutes matching text:
$text = "proc datasets; run;"
$text -replace "proc datasets", "proc sql"
# Result: "proc sql; run;"
You can use captured groups in the replacement:
$text = "raw.demographics"
$text -replace "(raw\.|sdtm\.|adam\.)", "lib."
# Result: "lib.demographics"
Validating Patterns Before Use
Before running patterns against large files, validate that they are correct:
$matchStrings = @(
"%(m|v).*?(?=[,(;\s])",
"(raw\.|sdtm\.|adam\.).*?(?=[,(;\s])"
)
foreach ($pattern in $matchStrings) {
try {
[regex]::new($pattern) | Out-Null
Write-Host "Valid: $pattern"
}
catch {
Write-Host "Invalid: $pattern - $($_.Exception.Message)"
}
}
This catches malformed patterns (for example, unmatched parentheses or invalid syntax) before they cause problems in your scanning code.
Scanning Files Line by Line
A typical workflow reads a file and checks each line against your patterns:
$matchStrings = @(
"proc datasets",
"(first\.|last\.)",
"%(m|v).*?(?=[,(;\s])"
)
$code = Get-Content "script.sas"
foreach ($line in $code) {
foreach ($pattern in $matchStrings) {
if ($line -match $pattern) {
Write-Warning "Pattern '$pattern' found in: $line"
}
}
}
Counting Pattern Occurrences
To understand which patterns appear most often:
$results = @{}
foreach ($pattern in $matchStrings) {
$count = ($code | Select-String -Pattern $pattern).Count
$results[$pattern] = $count
}
$results | Format-Table
This builds a table showing how many times each pattern matched across the entire file.
Practical Tips
Start simple. Build patterns incrementally. Test each addition to ensure it behaves as expected.
Use verbose mode for complex patterns. PowerShell supports (?x) which allows whitespace and comments inside patterns:
$pattern = @"
(?x) # Enable verbose mode
^ # Start of line
\s* # Optional whitespace
proc # Literal "proc"
\s+ # Required whitespace
sort # Literal "sort"
\b # Word boundary
"@
Test against known examples. Create a small set of test strings that should and should not match:
$shouldMatch = @("proc sort;", " PROC SORT data=x;")
$shouldNotMatch = @("proc sorting", "# proc sort")
foreach ($test in $shouldMatch) {
if ($test -notmatch $pattern) {
Write-Warning "Failed to match: $test"
}
}
Document your patterns. Regular expressions can be cryptic. Add comments explaining what each pattern does and why it exists:
# Match macro variables starting with %m or %v, stopping at delimiters
$pattern1 = "%(m|v).*?(?=[,(;\s])"
# Match library prefixes (raw., sdtm., adam.) before delimiters
$pattern2 = "(raw\.|sdtm\.|adam\.).*?(?=[,(;\s])"
Two Approaches to the Same Problem
The document you started with presents two arrays of patterns. One is extended with negative lookaheads to handle ambiguous cases. The other is simplified for cleaner codebases. Understanding why both exist teaches an important lesson: regex is not one-size-fits-all.
The extended version handles edge cases:
$pattern = "(?i)(?!(?:r_options))\b(raw\.|sdtm\.|adam\.|r_)(?!options)\w*?(?=[,(;\s])"
The simplified version assumes those edge cases do not occur:
$pattern = "(raw\.|sdtm\.|adam\.).*?(?=[,(;\s])"
Choose the approach that matches your data. If you know your text follows strict conventions, simpler patterns are easier to maintain. If you face ambiguity, add the precision you need (but no more).
Moving Forward
This primer has introduced the building blocks: literals, special characters, anchors, quantifiers, groups, lookaheads and modifiers. You have seen how to apply patterns in PowerShell using -match, Select-String and [regex]::Matches. You have also learnt to validate patterns, scan files and count occurrences.
The best way to learn is to experiment. Take a simple text file and try to match patterns in it. Build patterns incrementally. When something does not work as expected, break the pattern down into smaller pieces and test each part separately.
Regular expressions are not intuitive at first, but they become clear with practice. The examples here are drawn from SAS code analysis, yet the techniques apply broadly. Whether you are scanning logs, parsing configuration files or extracting data from reports, the principles remain the same: understand what you want to match, build the pattern step by step, test thoroughly and document your work.