regular expression | Technology Tales

A primer on Regular Expressions in PowerShell: Learning through SAS code analysis

5^th November 2025

Many already realise that regular expressions are a powerful tool for finding patterns in text, though they can be complex in their own way. Thus, this primer uses a real-world example (scanning SAS code) to introduce the fundamental concepts you need to work with regex in PowerShell. Since the focus here is on understanding how patterns are built and applied, you do not need to know SAS to follow along.

Getting Started with Pattern Matching

Usefully, PowerShell gives you several ways to test whether text matches a pattern, and the simplest is the -match operator:

$text = "proc sort data=mydata;"
$text -match "proc sort"  # Returns True

When -match finds a match, it returns True. PowerShell's -match operator is case-insensitive by default, so "PROC SORT" would also match. If you need case-sensitive matching, use -cmatch instead.

For more detailed results, Select-String is useful:

$text | Select-String -Pattern "proc sort"

This shows you where the match occurred and provides context around it.

Building Your First Patterns

Literal Matches

The simplest patterns match exactly what you type. If you want to find the text proc datasets, you write:

$pattern = "proc datasets"

This will match those exact words in sequence.

Special Characters and Escaping

Some characters have special meaning in regex. The dot (.) matches any single character, so if you wish to match a literal dot, you need to escape it with a backslash:

$pattern = "first\."  # Matches "first." literally

Without the backslash, first. would match "first" followed by any character (for example, "firstly", "first!", "first2").

Alternation

The pipe symbol (|) lets you match one thing or another:

$pattern = "(delete;$|output;$)"

This matches either delete; or output; at the end of a line. The dollar sign ($) is an anchor that means "end of line".

Anchors: Controlling Where Matches Occur

Anchors do not match characters. Instead, they specify positions in the text.

^ matches the start of a line
$ matches the end of a line
\b matches a word boundary (the position between a word character and a non-word character)

Here is a pattern that finds proc sort only at the start of a line:

$pattern = "^\s*proc\s+sort"

Breaking this down:

^ anchors to the start of the line
\s* matches zero or more whitespace characters
proc matches the literal text
\s+ matches one or more whitespace characters
sort matches the literal text

The \s is a shorthand for any whitespace character (spaces, tabs, newlines). The * means "zero or more" and + means "one or more".

Quantifiers: How Many Times Something Appears

Quantifiers control repetition:

* means zero or more
+ means one or more
? means zero or one (optional)
{n} means exactly n times
{n,} means n or more times
{n,m} means between n and m times

By default, quantifiers are greedy (they match as much as possible). Adding ? after a quantifier makes it reluctant (it matches as little as possible):

$pattern = "%(m|v).*?(?=[,(;\s])"

Here .*? matches any characters, but as few as possible before the next part of the pattern matches. This is useful when you want to stop at the first occurrence of something rather than continuing to the last.

Groups and Capturing

Parentheses create groups, which serve two purposes: they let you apply quantifiers to multiple characters, and they capture the matched text for later use:

$pattern = "(first\.|last\.)"

This creates a group that matches either first. or last.. The group captures whichever one matches so you can extract it later.

Non-capturing groups, written as (?:...), group without capturing. This is useful when you need grouping for structure but do not need to extract the matched text:

$pattern = "(?:raw\.|sdtm\.|adam\.)"

Lookaheads: Matching Without Consuming

Lookaheads are powerful but can be confusing at first. They check what comes next without including it in the match.

A positive lookahead (?=...) succeeds if the pattern inside matches what comes next:

$pattern = "%(m|v).*?(?=[,(;\s])"

This matches %m or %v followed by any characters, stopping just before a comma, parenthesis, semicolon or whitespace. The delimiter is not included in the match.

A negative lookahead (?!...) succeeds if the pattern inside does not match what comes next:

$pattern = "(?!(?:r_options))\b(raw\.|sdtm\.|adam\.|r_)(?!options)"

This is more complex. It matches library prefixes like raw., sdtm., adam. or r_, but only if:

The text at the current position is not r_options
The matched prefix is not followed by options

This prevents false matches on text like r_options whilst still allowing r_something_else.

Inline Modifiers

You can change how patterns behave by placing modifiers at the start:

(?i) makes the pattern case-insensitive
(?-i) makes the pattern case-sensitive

$pattern = "(?i)^\s*proc\s+sort"

This matches proc sort, PROC SORT, Proc Sort or any other case variation.

A Complete Example: Finding PROC SORT Without DATA=

Let us build up a practical pattern step by step. The goal is to find PROC SORT statements that are missing a DATA= option.

Start with the basic match:

$pattern = "proc sort"

Add case-insensitivity and line-start flexibility:

$pattern = "(?i)^\s*proc\s+sort"

Add a word boundary to avoid matching proc sorting:

$pattern = "(?i)^\s*proc\s+sort\b"

Now add a negative lookahead to exclude lines that contain data=:

$pattern = "(?i)^\s*proc\s+sort\b(?![^;\n]*\bdata\s*=\s*)"

This negative lookahead checks the rest of the line (up to a semicolon or newline) and fails if it finds data followed by optional spaces, an equals sign and more optional spaces.

Finally, match the rest of the line:

$pattern = "(?i)^\s*proc\s+sort\b(?![^;\n]*\bdata\s*=\s*).*"

Working with Multiple Patterns

Real-world scanning often involves checking many patterns. PowerShell arrays make this straightforward:

$matchStrings = @(
    "%(m|v).*?(?=[,(;\s])",
    "(raw\.|sdtm\.|adam\.).*?(?=[,(;\s])",
    "(first\.|last\.)",
    "proc datasets"
)

$text = "Use %mvar in raw.dataset with first.flag"

foreach ($pattern in $matchStrings) {
    if ($text -match $pattern) {
        Write-Host "Match found: $pattern"
    }
}

Finding All Matches in a String

The -match operator only tells you whether a pattern matches. To find all occurrences, use [regex]::Matches:

$text = "first.x and last.y and first.z"
$pattern = "(first\.|last\.)"
$matches = [regex]::Matches($text, $pattern)

foreach ($match in $matches) {
    Write-Host "Found: $($match.Value) at position $($match.Index)"
}

This returns a collection of match objects, each containing details about what was found and where.

Replacing Text

The -replace operator applies a pattern and substitutes matching text:

$text = "proc datasets; run;"
$text -replace "proc datasets", "proc sql"
# Result: "proc sql; run;"

You can use captured groups in the replacement:

$text = "raw.demographics"
$text -replace "(raw\.|sdtm\.|adam\.)", "lib."
# Result: "lib.demographics"

Validating Patterns Before Use

Before running patterns against large files, validate that they are correct:

$matchStrings = @(
    "%(m|v).*?(?=[,(;\s])",
    "(raw\.|sdtm\.|adam\.).*?(?=[,(;\s])"
)

foreach ($pattern in $matchStrings) {
    try {
        [regex]::new($pattern) | Out-Null
        Write-Host "Valid: $pattern"
    }
    catch {
        Write-Host "Invalid: $pattern - $($_.Exception.Message)"
    }
}

This catches malformed patterns (for example, unmatched parentheses or invalid syntax) before they cause problems in your scanning code.

Scanning Files Line by Line

A typical workflow reads a file and checks each line against your patterns:

$matchStrings = @(
    "proc datasets",
    "(first\.|last\.)",
    "%(m|v).*?(?=[,(;\s])"
)

$code = Get-Content "script.sas"

foreach ($line in $code) {
    foreach ($pattern in $matchStrings) {
        if ($line -match $pattern) {
            Write-Warning "Pattern '$pattern' found in: $line"
        }
    }
}

Counting Pattern Occurrences

To understand which patterns appear most often:

$results = @{}

foreach ($pattern in $matchStrings) {
    $count = ($code | Select-String -Pattern $pattern).Count
    $results[$pattern] = $count
}

$results | Format-Table

This builds a table showing how many times each pattern matched across the entire file.

Practical Tips

Start simple. Build patterns incrementally. Test each addition to ensure it behaves as expected.

Use verbose mode for complex patterns. PowerShell supports (?x) which allows whitespace and comments inside patterns:

$pattern = @"
(?x)        # Enable verbose mode
^           # Start of line
\s*         # Optional whitespace
proc        # Literal "proc"
\s+         # Required whitespace
sort        # Literal "sort"
\b          # Word boundary
"@

Test against known examples. Create a small set of test strings that should and should not match:

$shouldMatch = @("proc sort;", "  PROC SORT data=x;")
$shouldNotMatch = @("proc sorting", "# proc sort")

foreach ($test in $shouldMatch) {
    if ($test -notmatch $pattern) {
        Write-Warning "Failed to match: $test"
    }
}

Document your patterns. Regular expressions can be cryptic. Add comments explaining what each pattern does and why it exists:

# Match macro variables starting with %m or %v, stopping at delimiters
$pattern1 = "%(m|v).*?(?=[,(;\s])"

# Match library prefixes (raw., sdtm., adam.) before delimiters
$pattern2 = "(raw\.|sdtm\.|adam\.).*?(?=[,(;\s])"

Two Approaches to the Same Problem

The document you started with presents two arrays of patterns. One is extended with negative lookaheads to handle ambiguous cases. The other is simplified for cleaner codebases. Understanding why both exist teaches an important lesson: regex is not one-size-fits-all.

The extended version handles edge cases:

$pattern = "(?i)(?!(?:r_options))\b(raw\.|sdtm\.|adam\.|r_)(?!options)\w*?(?=[,(;\s])"

The simplified version assumes those edge cases do not occur:

$pattern = "(raw\.|sdtm\.|adam\.).*?(?=[,(;\s])"

Choose the approach that matches your data. If you know your text follows strict conventions, simpler patterns are easier to maintain. If you face ambiguity, add the precision you need (but no more).

Moving Forward

This primer has introduced the building blocks: literals, special characters, anchors, quantifiers, groups, lookaheads and modifiers. You have seen how to apply patterns in PowerShell using -match, Select-String and [regex]::Matches. You have also learnt to validate patterns, scan files and count occurrences.

The best way to learn is to experiment. Take a simple text file and try to match patterns in it. Build patterns incrementally. When something does not work as expected, break the pattern down into smaller pieces and test each part separately.

Regular expressions are not intuitive at first, but they become clear with practice. The examples here are drawn from SAS code analysis, yet the techniques apply broadly. Whether you are scanning logs, parsing configuration files or extracting data from reports, the principles remain the same: understand what you want to match, build the pattern step by step, test thoroughly and document your work.

Searching file contents using PowerShell

25^th October 2018

Having made plenty of use of grep on the Linux/UNIX command and findstr on the legacy Windows command line, I wondered if PowerShell could be used to search the contents of files for a text string. Usefully, this turns out to be the case, but I found that the native functionality does not use what I have used before. The form of the command is given below:

Select-String -Path <filename search expression> -Pattern "<search expression>" > <output file>

While you can have the output appear on the screen, it always seems easier to send it to a file for subsequent use, and that is what I am doing above. The input to the -Path switch can be a filename or a wildcard expression, while that to the -Pattern can be a text string enclosed in quotes or a regular expression. Given that it works well once you know what to do, here is an example:

Select-String -Path *.sas -Pattern "proc report" > c:\temp\search.txt

The search.txt file then includes both the file information and the text that has been found for the sake of checking that you have what you want. What you do next is up to you.

Understanding Perl binding operators for pattern matching

20^th May 2009

While this piece is as much an aide de memoire for myself as anything else, putting it here seems worthwhile if it answers questions for others. The binding operators, =~ or !~, come in handy when you are framing conditional statements in Perl using Regular Expressions, for example, testing whether x =~ /\d+/ or not. The =~ variant is also used for changing strings using the s/[pattern1]/[pattern2]/ regular expression construct (here, s stands for "substitute"). What has brought this to mind is that I wanted to ensure that something was done for strings that did not contain a certain pattern, and that's where the !~ binding operator came in useful; ^~ might have come to mind for some reason, but it wasn't what I needed.

Tidying dynamic URL’s

15^th June 2007

A few years back, I came across a very nice article discussing how you would make a dynamic URL more palatable to a search engine, and I made good use of its content for my online photo gallery. The premise was that URL's that look like that below are no help to search engines indexing a website. Though this is received wisdom in some quarters, it doesn't seem to have done much to stall the rise of WordPress as a blogging platform.

http://www.mywebsite.com/serversidescript.php?id=394

That said, WordPress does offer a friendlier URL display option too, which you can see in use on this blog; they look a little like the example URL that you see below, and the approach is equally valid for both Perl and PHP. Since I have been using the same approach for the Perl scripts powering my online phone gallery, now want to apply the same thinking to a gallery written in PHP:

http://www.mywebsite.com/serversidescript.pl/id/394

The way that both expressions work is that a web server will chop pieces from a URL until it reaches a physical file. For a query URL, the extra information after the question mark is retained in its QUERY_STRING variable, while extraneous directory path information is passed in the variable PATH_INFO. For both Perl and PHP, these are extracted from the entries in an array; for Perl, this array is called is $ENV and $_SERVER is the PHP equivalent. Thus, $ENV{QUERY_STRING} and $_SERVER{'QUERY_STRING'} traps what comes after the ? while $ENV{PATH_INFO} and $_SERVER{'PATH_INFO'} picks up the extra information following the file name (/id/394/ in the example). From there on, the usual rules apply regarding cleaning of any input but changing from one to another should be too arduous.

HennessyBlog theme update

12^th February 2007

Over the weekend, I have been updating the theme on my other blog, HennessyBlog. It has been a task that projected me onto a learning curve with the WordPress 2.1 codebase. Thus, I have collected what I encountered, so I know that it’s out there on the web for you (and I) to use and peruse. It took some digging to get to know some of what you find below. Since any function used to power WordPress takes some finding, I need to find one place on the web where the code for WordPress is more fully documented. The sites presenting tutorials on how to use WordPress are more often than not geared towards non-techies rather than code cutters like myself. Then again, they might be waiting for someone to do it for them…

The changes made are as follows:

Tweaks to the interface

These are subtle, with the addition of navigation controls to the sidebar and the change in location of the post metadata being the most obvious enhancements. “Decoration” with solid and dashed lines (using CSS border attributes rather than the deprecated hr tagset) and standards compliance links.

Standards compliance

Adding standards compliance links does mean that you’d better check that all is in order; it was then that I discovered that there was work to be done. There is an issue with the WordPress wpautop function (it lives in the formatting.php file) in that it sometimes doesn’t add closing tags. Finding out that it was this function that is implicated took a trip to the WordPress.org website; while a good rummage in the wp-includes folder does a lot, it can’t achieve everything.

Like many things in the WordPress code, the wpautop function isn’t half buried. The the_content function (see template-functions-post.php) used to output blog entries calls the get_content function (also in template-functions-post.php) to extract the data from MySQL. The add_filter function (in plugin.php) associates the wpautop function and others with the get_the_content function to add the p tags to the output.

To return to the non-ideal behaviour that caused me to start out on the above quest, an example is where you have an img tag enclosed by div tags. The required substitution involves the use of regular expressions that work most of the time but get confused here. So adding a hack to the wpautop function was needed to change the code so that the p end tag got inserted. I’ll be keeping an eye out for any more scenarios like this that slip through the net and for any side effects. Otherwise, compliance is just making sure that all those img tags have their alt attributes completed.

Tweaks to navigation code

Most of my time has been spent on tweaking of the PHP code supporting the navigation. Because different functions were being called in different places, I wanted to harmonise things. To accomplish this, I created new functions in the functions.php for my theme and needed to resolve a number of issues along the way. Not least among these were regular expressions used for subsetting with the preg_match function; these were not Perl-compliant to my eyes, as would be implied by the choice of function. Now that I have found that PCRE’s in PHP use a more pragmatic syntax, there appeared to be issues with the expressions that were being used. These seemed to behave OK in their native environment but fell out of favour within the environs of my theme. Being acquainted with Perl, I went for a more familiar expression style and the issue has been resolved.

Along the way, I broke the RSS feed. This was on my off-line test blog so no one, apart from myself, that is, would have noticed. After a bit of searching, I realised that some stray white-space from the end of a PHP file (wp-config.php being a favourite culprit), after the PHP end tag in the script file as it happens, was finding its way into the feed and causing things to fall over. Feed readers don’t take too kindly to the idea of the XML declaration not making an appearance on the first line of the file. Some confusion was caused by the refusal of Firefox to refresh things as it should before I realised that a forced refresh of the feed display was needed. Sometimes, it takes a while for an addled brain to think of these kinds of things.