source code | Technology Tales

A primer on Regular Expressions in PowerShell: Learning through SAS code analysis

5^th November 2025

Many already realise that regular expressions are a powerful tool for finding patterns in text, though they can be complex in their own way. Thus, this primer uses a real-world example (scanning SAS code) to introduce the fundamental concepts you need to work with regex in PowerShell. Since the focus here is on understanding how patterns are built and applied, you do not need to know SAS to follow along.

Getting Started with Pattern Matching

Usefully, PowerShell gives you several ways to test whether text matches a pattern, and the simplest is the -match operator:

$text = "proc sort data=mydata;"
$text -match "proc sort"  # Returns True

When -match finds a match, it returns True. PowerShell's -match operator is case-insensitive by default, so "PROC SORT" would also match. If you need case-sensitive matching, use -cmatch instead.

For more detailed results, Select-String is useful:

$text | Select-String -Pattern "proc sort"

This shows you where the match occurred and provides context around it.

Building Your First Patterns

Literal Matches

The simplest patterns match exactly what you type. If you want to find the text proc datasets, you write:

$pattern = "proc datasets"

This will match those exact words in sequence.

Special Characters and Escaping

Some characters have special meaning in regex. The dot (.) matches any single character, so if you wish to match a literal dot, you need to escape it with a backslash:

$pattern = "first\."  # Matches "first." literally

Without the backslash, first. would match "first" followed by any character (for example, "firstly", "first!", "first2").

Alternation

The pipe symbol (|) lets you match one thing or another:

$pattern = "(delete;$|output;$)"

This matches either delete; or output; at the end of a line. The dollar sign ($) is an anchor that means "end of line".

Anchors: Controlling Where Matches Occur

Anchors do not match characters. Instead, they specify positions in the text.

^ matches the start of a line
$ matches the end of a line
\b matches a word boundary (the position between a word character and a non-word character)

Here is a pattern that finds proc sort only at the start of a line:

$pattern = "^\s*proc\s+sort"

Breaking this down:

^ anchors to the start of the line
\s* matches zero or more whitespace characters
proc matches the literal text
\s+ matches one or more whitespace characters
sort matches the literal text

The \s is a shorthand for any whitespace character (spaces, tabs, newlines). The * means "zero or more" and + means "one or more".

Quantifiers: How Many Times Something Appears

Quantifiers control repetition:

* means zero or more
+ means one or more
? means zero or one (optional)
{n} means exactly n times
{n,} means n or more times
{n,m} means between n and m times

By default, quantifiers are greedy (they match as much as possible). Adding ? after a quantifier makes it reluctant (it matches as little as possible):

$pattern = "%(m|v).*?(?=[,(;\s])"

Here .*? matches any characters, but as few as possible before the next part of the pattern matches. This is useful when you want to stop at the first occurrence of something rather than continuing to the last.

Groups and Capturing

Parentheses create groups, which serve two purposes: they let you apply quantifiers to multiple characters, and they capture the matched text for later use:

$pattern = "(first\.|last\.)"

This creates a group that matches either first. or last.. The group captures whichever one matches so you can extract it later.

Non-capturing groups, written as (?:...), group without capturing. This is useful when you need grouping for structure but do not need to extract the matched text:

$pattern = "(?:raw\.|sdtm\.|adam\.)"

Lookaheads: Matching Without Consuming

Lookaheads are powerful but can be confusing at first. They check what comes next without including it in the match.

A positive lookahead (?=...) succeeds if the pattern inside matches what comes next:

$pattern = "%(m|v).*?(?=[,(;\s])"

This matches %m or %v followed by any characters, stopping just before a comma, parenthesis, semicolon or whitespace. The delimiter is not included in the match.

A negative lookahead (?!...) succeeds if the pattern inside does not match what comes next:

$pattern = "(?!(?:r_options))\b(raw\.|sdtm\.|adam\.|r_)(?!options)"

This is more complex. It matches library prefixes like raw., sdtm., adam. or r_, but only if:

The text at the current position is not r_options
The matched prefix is not followed by options

This prevents false matches on text like r_options whilst still allowing r_something_else.

Inline Modifiers

You can change how patterns behave by placing modifiers at the start:

(?i) makes the pattern case-insensitive
(?-i) makes the pattern case-sensitive

$pattern = "(?i)^\s*proc\s+sort"

This matches proc sort, PROC SORT, Proc Sort or any other case variation.

A Complete Example: Finding PROC SORT Without DATA=

Let us build up a practical pattern step by step. The goal is to find PROC SORT statements that are missing a DATA= option.

Start with the basic match:

$pattern = "proc sort"

Add case-insensitivity and line-start flexibility:

$pattern = "(?i)^\s*proc\s+sort"

Add a word boundary to avoid matching proc sorting:

$pattern = "(?i)^\s*proc\s+sort\b"

Now add a negative lookahead to exclude lines that contain data=:

$pattern = "(?i)^\s*proc\s+sort\b(?![^;\n]*\bdata\s*=\s*)"

This negative lookahead checks the rest of the line (up to a semicolon or newline) and fails if it finds data followed by optional spaces, an equals sign and more optional spaces.

Finally, match the rest of the line:

$pattern = "(?i)^\s*proc\s+sort\b(?![^;\n]*\bdata\s*=\s*).*"

Working with Multiple Patterns

Real-world scanning often involves checking many patterns. PowerShell arrays make this straightforward:

$matchStrings = @(
    "%(m|v).*?(?=[,(;\s])",
    "(raw\.|sdtm\.|adam\.).*?(?=[,(;\s])",
    "(first\.|last\.)",
    "proc datasets"
)

$text = "Use %mvar in raw.dataset with first.flag"

foreach ($pattern in $matchStrings) {
    if ($text -match $pattern) {
        Write-Host "Match found: $pattern"
    }
}

Finding All Matches in a String

The -match operator only tells you whether a pattern matches. To find all occurrences, use [regex]::Matches:

$text = "first.x and last.y and first.z"
$pattern = "(first\.|last\.)"
$matches = [regex]::Matches($text, $pattern)

foreach ($match in $matches) {
    Write-Host "Found: $($match.Value) at position $($match.Index)"
}

This returns a collection of match objects, each containing details about what was found and where.

Replacing Text

The -replace operator applies a pattern and substitutes matching text:

$text = "proc datasets; run;"
$text -replace "proc datasets", "proc sql"
# Result: "proc sql; run;"

You can use captured groups in the replacement:

$text = "raw.demographics"
$text -replace "(raw\.|sdtm\.|adam\.)", "lib."
# Result: "lib.demographics"

Validating Patterns Before Use

Before running patterns against large files, validate that they are correct:

$matchStrings = @(
    "%(m|v).*?(?=[,(;\s])",
    "(raw\.|sdtm\.|adam\.).*?(?=[,(;\s])"
)

foreach ($pattern in $matchStrings) {
    try {
        [regex]::new($pattern) | Out-Null
        Write-Host "Valid: $pattern"
    }
    catch {
        Write-Host "Invalid: $pattern - $($_.Exception.Message)"
    }
}

This catches malformed patterns (for example, unmatched parentheses or invalid syntax) before they cause problems in your scanning code.

Scanning Files Line by Line

A typical workflow reads a file and checks each line against your patterns:

$matchStrings = @(
    "proc datasets",
    "(first\.|last\.)",
    "%(m|v).*?(?=[,(;\s])"
)

$code = Get-Content "script.sas"

foreach ($line in $code) {
    foreach ($pattern in $matchStrings) {
        if ($line -match $pattern) {
            Write-Warning "Pattern '$pattern' found in: $line"
        }
    }
}

Counting Pattern Occurrences

To understand which patterns appear most often:

$results = @{}

foreach ($pattern in $matchStrings) {
    $count = ($code | Select-String -Pattern $pattern).Count
    $results[$pattern] = $count
}

$results | Format-Table

This builds a table showing how many times each pattern matched across the entire file.

Practical Tips

Start simple. Build patterns incrementally. Test each addition to ensure it behaves as expected.

Use verbose mode for complex patterns. PowerShell supports (?x) which allows whitespace and comments inside patterns:

$pattern = @"
(?x)        # Enable verbose mode
^           # Start of line
\s*         # Optional whitespace
proc        # Literal "proc"
\s+         # Required whitespace
sort        # Literal "sort"
\b          # Word boundary
"@

Test against known examples. Create a small set of test strings that should and should not match:

$shouldMatch = @("proc sort;", "  PROC SORT data=x;")
$shouldNotMatch = @("proc sorting", "# proc sort")

foreach ($test in $shouldMatch) {
    if ($test -notmatch $pattern) {
        Write-Warning "Failed to match: $test"
    }
}

Document your patterns. Regular expressions can be cryptic. Add comments explaining what each pattern does and why it exists:

# Match macro variables starting with %m or %v, stopping at delimiters
$pattern1 = "%(m|v).*?(?=[,(;\s])"

# Match library prefixes (raw., sdtm., adam.) before delimiters
$pattern2 = "(raw\.|sdtm\.|adam\.).*?(?=[,(;\s])"

Two Approaches to the Same Problem

The document you started with presents two arrays of patterns. One is extended with negative lookaheads to handle ambiguous cases. The other is simplified for cleaner codebases. Understanding why both exist teaches an important lesson: regex is not one-size-fits-all.

The extended version handles edge cases:

$pattern = "(?i)(?!(?:r_options))\b(raw\.|sdtm\.|adam\.|r_)(?!options)\w*?(?=[,(;\s])"

The simplified version assumes those edge cases do not occur:

$pattern = "(raw\.|sdtm\.|adam\.).*?(?=[,(;\s])"

Choose the approach that matches your data. If you know your text follows strict conventions, simpler patterns are easier to maintain. If you face ambiguity, add the precision you need (but no more).

Moving Forward

This primer has introduced the building blocks: literals, special characters, anchors, quantifiers, groups, lookaheads and modifiers. You have seen how to apply patterns in PowerShell using -match, Select-String and [regex]::Matches. You have also learnt to validate patterns, scan files and count occurrences.

The best way to learn is to experiment. Take a simple text file and try to match patterns in it. Build patterns incrementally. When something does not work as expected, break the pattern down into smaller pieces and test each part separately.

Regular expressions are not intuitive at first, but they become clear with practice. The examples here are drawn from SAS code analysis, yet the techniques apply broadly. Whether you are scanning logs, parsing configuration files or extracting data from reports, the principles remain the same: understand what you want to match, build the pattern step by step, test thoroughly and document your work.

Rendering Markdown in WordPress without plugins by using Parsedown

4^th November 2025

Much of what is generated using GenAI as articles is output as Markdown, meaning that you need to convert the content when using it in a WordPress website. Naturally, this kind of thing should be done with care to ensure that you are the creator and that it is not all the work of a machine; orchestration is fine, regurgitation does that add that much. Naturally, fact checking is another need as well.

Writing plain Markdown has secured its own following as well, with WordPress plugins switching over the editor to facilitate such a mode of editing. When I tried Markup Markdown, I found it restrictive when it came to working with images within the text, and it needed a workaround for getting links to open in new browser tabs as well. Thus, I got rid of it to realise that it had not converted any Markdown as I expected, only to provide rendering at post or page display time. Rather than attempting to update the affected text, I decided to see if another solution could be found.

This took me to Parsedown, which proved to be handy for accomplishing what I needed once I had everything set in place. First, that meant cloning its GitHub repo onto the web server. Next, I created a directory called includes under that of my theme. Into there, I copied Parsedown.php to that location. When all was done, I ensured that file and directory ownership were assigned to www-data to avoid execution issues.

Then, I could set to updating the functions.php file. The first line to get added there included the parser file:

require_once get_template_directory() . '/includes/Parsedown.php';

After that, I found that I needed to disable the WordPress rendering machinery because that got in the way of Markdown rendering:

remove_filter('the_content', 'wpautop'); remove_filter('the_content', 'wptexturize');

The last step was to add a filter that parsed the Markdown and passed its output to WordPress rendering to do the rest as usual. This was a simple affair until I needed to deal with code snippets in pre and code tags. Hopefully, the included comments tell you much of what is happening. A possible exception is $matches[0]which itself is an array of entire <pre>...</pre> blocks including the containing tags, with $i => $block doing a $key (not the same variable as in the code, by the way) => $value lookup of the values in the array nesting.

add_filter('the_content', function($content) {
    // Prepare a store for placeholders
    $placeholders = [];

    // 1. Extract pre blocks (including nested code) and replace with safe placeholders
    preg_match_all('//si', $content, $pre_matches);
    foreach ($pre_matches[0] as $i => $block) {
        $key = "§PREBLOCK{$i}§";
        $placeholders[$key] = $block;
        $content = str_replace($block, $key, $content);
    }

    // 2. Extract standalone code blocks (not inside pre)
    preg_match_all('/).*?<\/code>/si', $content, $code_matches);
    foreach ($code_matches[0] as $i => $block) {
        $key = "§CODEBLOCK{$i}§";
        $placeholders[$key] = $block;
        $content = str_replace($block, $key, $content);
    }

    // 3. Run Parsedown on the remaining content
    $Parsedown = new Parsedown();
    $content = $Parsedown->text($content);

    // 4. Restore both pre and code placeholders
    foreach ($placeholders as $key => $block) {
        $content = str_replace($key, $block, $content);
    }

    // 5. Apply paragraph formatting
    return wpautop($content);
}, 12);

All of this avoided dealing with extra plugins to produce the required result. Handily, I still use the Classic Editor, which makes this work a lot more easily. There still is a Markdown import plugin that I am tempted to remove as well to streamline things. That can wait, though. It best not add any more of them any way, not least avoid clashes between them and what is now in the theme.

Carrying colour coding across multi-line custom log messages in SAS

16^th February 2022

While custom error messages are good to add to SAS macros, you can get inconsistent colouration of the message text in multi-line messages. That was something that I just overlooked until I recently came across a solution. That uses a hyphen at the end of the ERROR/WARNING/NOTE prefix instead of the more usual colon. Any prefixes ending with a hyphen are not included in the log text, and the colouration ignores the carriage return that ordinary would change the text colour to black. The simple macro below demonstrates the effect.

Macro Code:

%macro test; %put ERROR: this is a test; %put ERROR- this is another test; %put WARNING: this is a test; %put WARNING- this is another test; %put NOTE: this is a test; %put NOTE- this is another test; %mend test;

%test

Log Output:

ERROR: this is a test
       this is another test

WARNING: this is a test
         this is another test

NOTE: this is a test
      this is another test

Using multi-line commenting in Perl to inactivate blocks of code during testing

26^th December 2019

Recently, I needed to inactivate blocks of code in a Perl script while doing some testing. Since this is something that I often do in other computing languages, I sought the same in Perl. To accomplish that, I need to use the POD methodology. This meant enclosing the code as follows.

=start

<< Code to be inactivated by inclusion in a comment >>

=cut

While the =start line could use any word after the equality sign, it seems that =cut is required to close the multi-line comment. If this was actual programming documentation, then the comment block should include some meaningful text for use with perldoc. However, that was not a concern here because the commenting statements would be removed afterwards anyway. It also is good practice not to leave commented code in a production script or program to avoid any later confusion.

In my case, this facility allowed me to isolate the code that I had to alter and test before putting everything back as needed. It also saved time since I did not need to individually comment out every executable line because multiple lines could be inactivated at a time.

Fixing an update error in OpenMediaVault 4.0

10^th June 2019

For a time, I found that executing the command omv-update in OpenMediaVault 4.0 produced the following Python errors appeared, among other more benign messages:

Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0xb7099d64> Traceback (most recent call last): File "/usr/lib/python3.5/weakref.py", line 117, in remove TypeError: 'NoneType' object is not callable Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0xb7099d64> Traceback (most recent call last): File "/usr/lib/python3.5/weakref.py", line 117, in remove TypeError: 'NoneType' object is not callable

Not wanting a failed update, I decided that I needed to investigate this and found that /usr/lib/python3.5/weakref.py required the following updates to lines 109 and 117, respectively:

def remove(wr, selfref=ref(self), _atomic_removal=_remove_dead_weakref):

_atomic_removal(d, wr.key)

To be more clear, the line beginning with "def" is how line 109 should appear, while the line beginning with _atomic_removal is how line 117 should appear. Once the required edits were made and the file closed, re-running omv-update revealed that the problem was fixed and that is how things remain at the time of writing.

Using NOT IN operator type functionality in SAS Macro

9^th November 2018

For as long as I have been programming with SAS, there has been the ability to test if a variable does or does not have one value from a list of values in data step IF clauses or WHERE clauses in both data step and most if not all procedures. It was only within the last decade that its Macro language got similar functionality, with one caveat that I recently uncovered: you cannot have a NOT IN construct. To get that, you need to go about things differently.

In the example below, you see the NOT operator being placed before the IN operator component that is enclosed in parentheses. If this is not done, SAS produces the error messages that caused me to look at SAS Usage Note 31322. Once I followed that approach, I was able to do what I wanted without resorting to older, more long-winded coding practices.

options minoperator;

%macro inop(x);
    %if not (&x in (a b c)) %then %do;
        %put Value is not included;
    %end;
    %else %do;
        %put Value is included;
    %end;
%mend inop;

%inop(a);

While running the above code should produce a similar result to another featured on here in another post, the logic is reversed. There are times when such an approach is needed. One is where a few possibilities are to be excluded from a larger number of possibilities. Since programming often involves more inventive thinking, this may be one of those.

Using the IN operator in SAS Macro programming

8^th October 2012

This useful addition came in SAS 9.2, and I am amazed that it isn’t enabled by default. To accomplish that, you need to set the MINOPERATOR option, unless someone has done it for you in the SAS AUTOEXEC or another configuration program. Thus, the safety first approach is to have code like the following:

options minoperator;

%macro inop(x);
    %if &x in (a b c) %then %do;
        %put Value is included;
    %end;
    %else %do;
        %put Value not included;
    %end;
%mend inop;

%inop(a);

Also, the default delimiter is the space, so if you need to change that, then the MINDELIMITER option needs setting. Adjusting the above code so that the delimiter now is the comma character gives us the following:

options minoperator mindelimiter=",";

%macro inop(x);
    %if &x in (a b c) %then %do;
        %put Value is included;
    %end;
    %else %do;
        %put Value not included;
    %end;
%mend inop;

%inop(a);

Without any of the above, the only approach is to have the following, and that is what we had to do for SAS versions before 9.2:

%macro inop(x);
    %if &x=a or &x=b or &x=c %then %do;
        %put Value is included;
    %end;
    %else %do;
        %put Value not included;
    %end;
%mend inop;

%inop(a);

While it may be clunky, it does work and remains a fallback in newer versions of SAS. Saying that, having the IN operator available makes writing SAS Macro code that little bit more swish, so it's a good thing to know.

%sysfunc and missing spaces

10^th June 2009

Recently, I was trying something like this and noted some odd behaviour:

data _null_;
    file fileref;
    put "text %sysfunc(pathname(work)) more text";
run;

This is the kind of thing that I was getting:

text c:\sasworkmore text

In other words, the space after %sysfunc was being ignored and, since I was creating and executing a Windows batch file using SAS 8.2, the command line action wasn't doing what was expected. Though the fix was simple, I reckoned that I'd share what I saw anyway, in case it helped anyone else:

data _null_;
    file fileref;
    x="text %sysfunc(pathname(work))"||" more text";
    put x;
run;

AND & OR, a cautionary tale

27^th March 2009

The inspiration for this post is a situation where having the string "OR" or "AND" as an input to a piece of SAS Macro code, breaking a program that I had written. Here is a simplified example of what I was doing:

%macro test;
    %let doms=GE GT NE LT LE AND OR;
    %let lv_count=1;
    %do %while (%scan(&doms,&lv_count,' ') ne );
        %put &lv_count;
        %let lv_count=%eval(&lv_count+1);
    %end;
%mend test;

%test;

The loop proceeds well until the string "AND" is met and "OR" has the same effect. The result is the following message appears in the log:

ERROR: A character operand was found in the %EVAL function or %IF condition where a numeric operand is required. The condition was: %scan(&doms,&lv_count,' ') ne ERROR: The condition in the %DO %WHILE loop, , yielded an invalid or missing value, . The macro will stop executing. ERROR: The macro TEST will stop executing.

Both AND & OR (case doesn't matter, but I am sticking with upper case for sake of clarity) seem to be reserved words in a macro DO WHILE loop, while equality mnemonics like GE cause no problem. Perhaps, the fact that and equality operator is already in the expression helps. Regardless, the fix is simple:

%macro test;
    %let doms=GE GT NE LT LE AND OR;
    %let lv_count=1;
    %do %while ("%scan(&doms,&lv_count,' ')" ne "");
        %put &lv_count;
        %let lv_count=%eval(&lv_count+1);
    %end;
%mend test;

%test;

Now none of the strings extracted from the macro variable &DOMS will appear as bare words and confuse the SAS Macro processor, but you do have to make sure that you are testing for the null string ("" or '') or you'll send your program into an infinite loop, always a potential problem with DO WHILE loops so they need to be used with care. All in all, an odd-looking message gets an easy solution without recourse to macro quoting functions like %NRSTR or %SUPERQ.

SAS Macro and Dataline/Cards Statements in Data Step

28^th October 2008

Recently, I tried code like this in a SAS macro:

data sections;
    infile datalines dlm=",";
    input graph_table_number $15. text_line @1 @;
    datalines;
    "11.1           ,Section 11.1",
    "11.2           ,Section 11.2",
    "11.3           ,Section 11.3"
    ;
run;

While it works in its own right, including it as part of a macro yielded this type of result:

ERROR: The macro X generated CARDS (data lines) for the DATA step, which could cause incorrect results. The DATA step and the macro will stop executing.

A bit of googling landed me on SAS-L where I spotted a solution like this one that didn't involve throwing everything out:

filename temp temp;

data _null_;
    file temp;
    put;
run;

data sections;
    length graph_table_number $15 text_line $100;
    infile temp dlm=",";
    input @;
    do _infile_=
    "11.1           ,Section 11.1",
    "11.2           ,Section 11.2",
    "11.3           ,Section 11.3"
    ;
        input graph_table_number $15. text_line @1 @;
        output;
    end;
run;

filename temp clear;

The filename statement and ensuing data step creates a dummy file in the SAS work area that gets cleared at the end of every session. That seems to fool the macro engine into thinking that input is from a file and not the CARDS/DATALINES method, to which it takes grave exception. The trailing @'s hold an input record for the execution of the next INPUT statement within the same iteration of the DATA step so that the automatic variable _infile_ can be fed as part of the input process in a do block with the output statement ensure that all records from the input buffer reach the data set being created.

While this method does work, I would like to know the underlying reason as to why SAS Macro won't play well with included data entry using DATALINES or CARDS statements in a data step, particularly when it allows other methods that using either SQL insert statements or standard variable assignment in data step. I find it such a curious behaviour that I remain on the lookout for the explanation why it is like this.

« Older Entries «