String-searching algorithm | Technology Tales

Using the LIKE operator in PROC SQL WHERE clauses in SAS

26^th November 2025

Recently, I was working in SAS and decided to trying picking out datasets and variables from its dictionary tables, eventually picking out the maximum length of a variable type for assigning the length of a new variable. This could have been done using a long-established technique:

proc sql;
    select distinct memname into :dsns separated by '#'
        from dictionary.tables
            where lowcase(libname) = 'work'
            and index(lowcase(memname), "r_") = 1
            and index(lowcase(memname), "visit") = 0;
quit;

The result is that it creates a macro variable containing a delimited list of work datasets with names beginning with r_ and not containing the string visit. As well as using the index function to find the placing of one string within another, I have seen the count function used for similar purposes, albeit without the placement specificity. Since the =: operator which looks for a search string at the start of a larger is not something that works in SQL (data step is more than fine), you cannot do something like this:

proc print data = sashelp.vmember noobs;
    where lowcase(libname) = 'work'
    and lowcase(memname) =: "r_";
run;

While the contains operator works similarly to the count function when it comes to search text positioning, yet another option is the like operator, and that is shown in the example below:

proc sql;
    select distinct memname into :dsns separated by '#'
        from dictionary.tables
            where lowcase(libname) = 'work'
            and lowcase(memname) like 'r\_%' escape '\'
            and lowcase(memname) not like '%visit%';
quit;

Here, % and _ are placeholder characters, with the first matching zero or more characters and the second matching one character. Thus, the underscore in r_ needs escaping to look for that pattern (otherwise, it will look for the letter r at the start of a string and a single character after it) and a backslash character (\) will cover that duty. To ensure that it does what you want, adding escape '\' after the expression tells SAS what is happening.

Another thing to watch is that the percent (%) character needs a form escaping from the SAS Macro language processor, and placing the search term in single quotes attends to that. That means that %visit% does not cause any errors when you are looking for visit within a dataset name and using negation (the not operator) to exclude that possibility from the search results. However, using _%visit% might be a better pattern for finding visit at the end of a name, though.

Should you wish to play around with the above to see what happens for your own learning, try using something like this to give you a few test datasets:

data r_test r_visit visit;
    set sashelp.class;
run;

Otherwise, feel free to add your own test cases to cement the ideas even further. All too often, we look up something, deploy it and then forget about, especially when AI is involved. Nevertheless, the fastest way to write code can be to use what is embedded in your memory.

A primer on Regular Expressions in PowerShell: Learning through SAS code analysis

5^th November 2025

Many already realise that regular expressions are a powerful tool for finding patterns in text, though they can be complex in their own way. Thus, this primer uses a real-world example (scanning SAS code) to introduce the fundamental concepts you need to work with regex in PowerShell. Since the focus here is on understanding how patterns are built and applied, you do not need to know SAS to follow along.

Getting Started with Pattern Matching

Usefully, PowerShell gives you several ways to test whether text matches a pattern, and the simplest is the -match operator:

$text = "proc sort data=mydata;"
$text -match "proc sort"  # Returns True

When -match finds a match, it returns True. PowerShell's -match operator is case-insensitive by default, so "PROC SORT" would also match. If you need case-sensitive matching, use -cmatch instead.

For more detailed results, Select-String is useful:

$text | Select-String -Pattern "proc sort"

This shows you where the match occurred and provides context around it.

Building Your First Patterns

Literal Matches

The simplest patterns match exactly what you type. If you want to find the text proc datasets, you write:

$pattern = "proc datasets"

This will match those exact words in sequence.

Special Characters and Escaping

Some characters have special meaning in regex. The dot (.) matches any single character, so if you wish to match a literal dot, you need to escape it with a backslash:

$pattern = "first\."  # Matches "first." literally

Without the backslash, first. would match "first" followed by any character (for example, "firstly", "first!", "first2").

Alternation

The pipe symbol (|) lets you match one thing or another:

$pattern = "(delete;$|output;$)"

This matches either delete; or output; at the end of a line. The dollar sign ($) is an anchor that means "end of line".

Anchors: Controlling Where Matches Occur

Anchors do not match characters. Instead, they specify positions in the text.

^ matches the start of a line
$ matches the end of a line
\b matches a word boundary (the position between a word character and a non-word character)

Here is a pattern that finds proc sort only at the start of a line:

$pattern = "^\s*proc\s+sort"

Breaking this down:

^ anchors to the start of the line
\s* matches zero or more whitespace characters
proc matches the literal text
\s+ matches one or more whitespace characters
sort matches the literal text

The \s is a shorthand for any whitespace character (spaces, tabs, newlines). The * means "zero or more" and + means "one or more".

Quantifiers: How Many Times Something Appears

Quantifiers control repetition:

* means zero or more
+ means one or more
? means zero or one (optional)
{n} means exactly n times
{n,} means n or more times
{n,m} means between n and m times

By default, quantifiers are greedy (they match as much as possible). Adding ? after a quantifier makes it reluctant (it matches as little as possible):

$pattern = "%(m|v).*?(?=[,(;\s])"

Here .*? matches any characters, but as few as possible before the next part of the pattern matches. This is useful when you want to stop at the first occurrence of something rather than continuing to the last.

Groups and Capturing

Parentheses create groups, which serve two purposes: they let you apply quantifiers to multiple characters, and they capture the matched text for later use:

$pattern = "(first\.|last\.)"

This creates a group that matches either first. or last.. The group captures whichever one matches so you can extract it later.

Non-capturing groups, written as (?:...), group without capturing. This is useful when you need grouping for structure but do not need to extract the matched text:

$pattern = "(?:raw\.|sdtm\.|adam\.)"

Lookaheads: Matching Without Consuming

Lookaheads are powerful but can be confusing at first. They check what comes next without including it in the match.

A positive lookahead (?=...) succeeds if the pattern inside matches what comes next:

$pattern = "%(m|v).*?(?=[,(;\s])"

This matches %m or %v followed by any characters, stopping just before a comma, parenthesis, semicolon or whitespace. The delimiter is not included in the match.

A negative lookahead (?!...) succeeds if the pattern inside does not match what comes next:

$pattern = "(?!(?:r_options))\b(raw\.|sdtm\.|adam\.|r_)(?!options)"

This is more complex. It matches library prefixes like raw., sdtm., adam. or r_, but only if:

The text at the current position is not r_options
The matched prefix is not followed by options

This prevents false matches on text like r_options whilst still allowing r_something_else.

Inline Modifiers

You can change how patterns behave by placing modifiers at the start:

(?i) makes the pattern case-insensitive
(?-i) makes the pattern case-sensitive

$pattern = "(?i)^\s*proc\s+sort"

This matches proc sort, PROC SORT, Proc Sort or any other case variation.

A Complete Example: Finding PROC SORT Without DATA=

Let us build up a practical pattern step by step. The goal is to find PROC SORT statements that are missing a DATA= option.

Start with the basic match:

$pattern = "proc sort"

Add case-insensitivity and line-start flexibility:

$pattern = "(?i)^\s*proc\s+sort"

Add a word boundary to avoid matching proc sorting:

$pattern = "(?i)^\s*proc\s+sort\b"

Now add a negative lookahead to exclude lines that contain data=:

$pattern = "(?i)^\s*proc\s+sort\b(?![^;\n]*\bdata\s*=\s*)"

This negative lookahead checks the rest of the line (up to a semicolon or newline) and fails if it finds data followed by optional spaces, an equals sign and more optional spaces.

Finally, match the rest of the line:

$pattern = "(?i)^\s*proc\s+sort\b(?![^;\n]*\bdata\s*=\s*).*"

Working with Multiple Patterns

Real-world scanning often involves checking many patterns. PowerShell arrays make this straightforward:

$matchStrings = @(
    "%(m|v).*?(?=[,(;\s])",
    "(raw\.|sdtm\.|adam\.).*?(?=[,(;\s])",
    "(first\.|last\.)",
    "proc datasets"
)

$text = "Use %mvar in raw.dataset with first.flag"

foreach ($pattern in $matchStrings) {
    if ($text -match $pattern) {
        Write-Host "Match found: $pattern"
    }
}

Finding All Matches in a String

The -match operator only tells you whether a pattern matches. To find all occurrences, use [regex]::Matches:

$text = "first.x and last.y and first.z"
$pattern = "(first\.|last\.)"
$matches = [regex]::Matches($text, $pattern)

foreach ($match in $matches) {
    Write-Host "Found: $($match.Value) at position $($match.Index)"
}

This returns a collection of match objects, each containing details about what was found and where.

Replacing Text

The -replace operator applies a pattern and substitutes matching text:

$text = "proc datasets; run;"
$text -replace "proc datasets", "proc sql"
# Result: "proc sql; run;"

You can use captured groups in the replacement:

$text = "raw.demographics"
$text -replace "(raw\.|sdtm\.|adam\.)", "lib."
# Result: "lib.demographics"

Validating Patterns Before Use

Before running patterns against large files, validate that they are correct:

$matchStrings = @(
    "%(m|v).*?(?=[,(;\s])",
    "(raw\.|sdtm\.|adam\.).*?(?=[,(;\s])"
)

foreach ($pattern in $matchStrings) {
    try {
        [regex]::new($pattern) | Out-Null
        Write-Host "Valid: $pattern"
    }
    catch {
        Write-Host "Invalid: $pattern - $($_.Exception.Message)"
    }
}

This catches malformed patterns (for example, unmatched parentheses or invalid syntax) before they cause problems in your scanning code.

Scanning Files Line by Line

A typical workflow reads a file and checks each line against your patterns:

$matchStrings = @(
    "proc datasets",
    "(first\.|last\.)",
    "%(m|v).*?(?=[,(;\s])"
)

$code = Get-Content "script.sas"

foreach ($line in $code) {
    foreach ($pattern in $matchStrings) {
        if ($line -match $pattern) {
            Write-Warning "Pattern '$pattern' found in: $line"
        }
    }
}

Counting Pattern Occurrences

To understand which patterns appear most often:

$results = @{}

foreach ($pattern in $matchStrings) {
    $count = ($code | Select-String -Pattern $pattern).Count
    $results[$pattern] = $count
}

$results | Format-Table

This builds a table showing how many times each pattern matched across the entire file.

Practical Tips

Start simple. Build patterns incrementally. Test each addition to ensure it behaves as expected.

Use verbose mode for complex patterns. PowerShell supports (?x) which allows whitespace and comments inside patterns:

$pattern = @"
(?x)        # Enable verbose mode
^           # Start of line
\s*         # Optional whitespace
proc        # Literal "proc"
\s+         # Required whitespace
sort        # Literal "sort"
\b          # Word boundary
"@

Test against known examples. Create a small set of test strings that should and should not match:

$shouldMatch = @("proc sort;", "  PROC SORT data=x;")
$shouldNotMatch = @("proc sorting", "# proc sort")

foreach ($test in $shouldMatch) {
    if ($test -notmatch $pattern) {
        Write-Warning "Failed to match: $test"
    }
}

Document your patterns. Regular expressions can be cryptic. Add comments explaining what each pattern does and why it exists:

# Match macro variables starting with %m or %v, stopping at delimiters
$pattern1 = "%(m|v).*?(?=[,(;\s])"

# Match library prefixes (raw., sdtm., adam.) before delimiters
$pattern2 = "(raw\.|sdtm\.|adam\.).*?(?=[,(;\s])"

Two Approaches to the Same Problem

The document you started with presents two arrays of patterns. One is extended with negative lookaheads to handle ambiguous cases. The other is simplified for cleaner codebases. Understanding why both exist teaches an important lesson: regex is not one-size-fits-all.

The extended version handles edge cases:

$pattern = "(?i)(?!(?:r_options))\b(raw\.|sdtm\.|adam\.|r_)(?!options)\w*?(?=[,(;\s])"

The simplified version assumes those edge cases do not occur:

$pattern = "(raw\.|sdtm\.|adam\.).*?(?=[,(;\s])"

Choose the approach that matches your data. If you know your text follows strict conventions, simpler patterns are easier to maintain. If you face ambiguity, add the precision you need (but no more).

Moving Forward

This primer has introduced the building blocks: literals, special characters, anchors, quantifiers, groups, lookaheads and modifiers. You have seen how to apply patterns in PowerShell using -match, Select-String and [regex]::Matches. You have also learnt to validate patterns, scan files and count occurrences.

The best way to learn is to experiment. Take a simple text file and try to match patterns in it. Build patterns incrementally. When something does not work as expected, break the pattern down into smaller pieces and test each part separately.

Regular expressions are not intuitive at first, but they become clear with practice. The examples here are drawn from SAS code analysis, yet the techniques apply broadly. Whether you are scanning logs, parsing configuration files or extracting data from reports, the principles remain the same: understand what you want to match, build the pattern step by step, test thoroughly and document your work.