Technology Tales

Notes drawn from experiences in consumer and enterprise technology

TOPIC: JULIA

Why SAS, R and Python can report different percentiles for the same data

2nd February 2026

Quantiles look straightforward on the surface. Ask for the median, the 75th percentile or the 95th percentile and most people expect one clear answer. Yet small differences between software packages often reveal that quantiles are not defined in only one way. When the same data are analysed in SAS, R or Python, the reported percentile can differ, particularly for small samples or for data sets with large gaps between adjacent values.

That difference is not necessarily a bug, and it is not a sign that one platform is wrong. It reflects the fact that sample quantiles are estimates of population quantiles, and statisticians have proposed several valid ways to construct those estimates. For everyday work with large samples, the distinction often fades into the background because the values tend to be close. For smaller samples, the choice of definition can matter enough to alter a reported result, a chart or a downstream calculation.

The Problem With the Empirical CDF

A useful starting point is understanding why multiple definitions exist at all. A sample quantile is an estimate of an unknown population quantile. Many approaches base that estimate on the empirical cumulative distribution function (ECDF), which approximates the cumulative distribution function (CDF) for the population. As Rick Wicklin explains in his 22nd May 2017 article on The DO Loop, the ECDF is a step function with a jump discontinuity at each unique data value. For that reason, the inverse ECDF does not exist and quantiles are not uniquely defined, which is precisely why different conventions have developed.

In high school, most people learn that when a sorted sample has an even number of observations, the median is the average of the two middle values. The default quantile definition in SAS extends that familiar rule to other quantiles. If the sample size is N and the q-th quantile is requested, then when Nq is an integer, the result is the data value x[Nq]. When Nq is not an integer, the result is the average of the two adjacent data values x[j] and x[j+1], where j = floor(Nq). Averaging is not the only choice available when Nq is not an integer, and that is where the definitions diverge.

The Hyndman and Fan Taxonomy

According to Hyndman and Fan ("Sample Quantiles in Statistical Packages," TAS, 1996), there are nine definitions of sample quantiles that commonly appear in statistical software packages. Three of those definitions are based on rounding and six are based on linear interpolation. All nine result in valid estimates.

As Wicklin describes in his 24th May 2017 article comparing all nine definitions, the nine methods share a common general structure. For a sample of N sorted observations and a target probability p, the estimate uses two adjacent data values x[j] and x[j+1]. A fractional quantity determines an interpolation parameter λ, and each definition has a parameter m that governs how interpolation between adjacent data points is handled. In general terms, the estimate takes the form q = (1 − λ)x[j] + λx[j+1], where λ and j depend on the values of p, N and the method-specific parameter m. The practical consideration at the extremes is that when p is very small or very close to 1, most definitions fall back to returning x[1] or x[N] respectively.

Default Methods Across Platforms

It is a misnomer to refer to one approach as "the SAS method" and another as "the R method." As Wicklin notes in his 26th July 2021 article comparing SAS, R and Python defaults, SAS supports five different quantile definitions through the PCTLDEF= option in PROC UNIVARIATE or the QNTLDEF= option in other procedures, and all nine can be computed via SAS/IML. R likewise supports all nine through the type parameter in its quantile function. The confusion arises not from limited capability, but from the defaults that most users accept without much thought.

By default, SAS uses Hyndman and Fan's Type 2 method (QNTLDEF=5 in SAS procedure syntax). R uses Type 7 by default, and that same Type 7 method is also the default in Julia and in the Python packages SciPy and NumPy. A comparison between SAS and Python therefore often becomes the same comparison as between SAS and R.

A Worked Example

The contrast between Type 2 and Type 7 is especially clear on a small data set. Wicklin uses the sample {0, 1, 1, 1, 2, 2, 2, 4, 5, 8} throughout both his 2017 and 2021 articles: ten observations, six unique values, and a particularly large gap between the two highest values, 5 and 8. That gap is deliberately chosen because the differences between quantile definitions are most visible when the sample is small and when adjacent ordered values are far apart.

The Type 2 method (SAS default) uses the ECDF to estimate population quantiles, so a quantile is always an observed data value or the average of two adjacent data values. The Type 7 method (R default) uses a piecewise-linear estimate of the CDF. Because the inverse of that piecewise-linear estimate is continuous, a small change in the probability level produces a small change in the estimated quantile, a property that is absent from the ECDF-based methods.

Where the Methods Agree and Where They Part Company

For the 0.5 quantile (the median), both methods return 2. A horizontal line at 0.5 crosses both CDF estimates at the same point, so there is no disagreement. This is one reason the issue can be easy to miss: some commonly reported percentiles coincide across definitions.

The 0.75 quantile tells a different story. Under Type 2, a horizontal line at 0.75 crosses the empirical CDF at 4, which is a data value. Under Type 7, the estimate is 3.5, which is neither a data value nor the average of adjacent values; it emerges from the piecewise-linear interpolation rule. The 0.95 quantile shows the sharpest divergence: Type 2 returns 8 (the maximum data value), while Type 7 returns 6.65, a value between the two largest observations.

Those differences are not errors. They are consequences of the assumptions built into each estimator. The default in SAS always returns a data value or the average of adjacent data values, whereas the default in R can return any value in the range of the data.

The Five Definitions Available in SAS Procedures

For users who stay within base SAS procedures, that same 22nd May 2017 article sets out the five available definitions clearly. QNTLDEF=1 and QNTLDEF=4 are piecewise-linear interpolation methods, whilst QNTLDEF=2, QNTLDEF=3 and QNTLDEF=5 are discrete rounding methods. The default is QNTLDEF=5. For the discrete definitions, SAS returns either a data value or the average of adjacent data values; the interpolation methods can return any value between observed data values.

The differences between the definitions are most apparent when there are large gaps between adjacent data values. Using the same ten-point data set, for the 0.45 quantile, different definitions return 1, 1.5, 1.95 or 2. For the 0.901 quantile, the round-down method (QNTLDEF=2) gives 5, the round-up method (QNTLDEF=3) gives 8, the backward interpolation method (QNTLDEF=1) gives 5.03 and the forward interpolation method (QNTLDEF=4) gives 7.733. These are not trivial discrepancies on a small sample.

The Four Remaining Definitions and the General Formula

The 24th May 2017 comparison article goes further, showing how SAS/IML can be used to compute the four Hyndman and Fan definitions that are not natively supported in SAS procedures. Each of the nine methods is an instance of the same general formula involving the parameter m. The four non-native methods each require their own specific value (or expression) for m, plus a small boundary value c that governs the behaviour at the extreme ends of the probability scale.

Wicklin also overlays the default methods for SAS (Type 2) and R (Type 7) graphically on the ten-point data set, showing that the SAS default produces a discrete step pattern whilst the R default traces a smoother piecewise-linear curve. He then repeats the comparison on a sample of 100 observations from a uniform distribution and finds that the two methods are almost indistinguishable at that scale, illustrating why many analysts work comfortably with defaults most of the time.

A SAS/IML Function to Match R's Default

For analysts who need cross-platform consistency, that same 26th July 2021 article provides a simplified SAS/IML function that reproduces the Type 7 default from R, Julia, SciPy and NumPy. The function converts the input to a column vector, handles missing values and the degenerate case of a single observation, then sorts the data and applies the Type 7 rule. The index into the sorted data would be j = floor(N*p + m) with m = 1 − p, the interpolation fraction is g = N*p + m − j, and the estimate is (1 − g)x[j] + gx[j+1] for all p < 1, with x[N] returned when p = 1. This gives SAS users a practical route to reproduce the default quantiles from other platforms without switching software.

If SAS/IML is unavailable, Wicklin suggests using PCTLDEF=1 in PROC UNIVARIATE (or QNTLDEF=1 in PROC MEANS) as the next best option. This produces the Type 4 method, which is not the same as Type 7 but does use interpolation rather than a purely discrete rule, so it avoids the jumpy behaviour of the ECDF-based defaults.

A Wider Point About Conventions in Statistical Software

The comments on the 2021 article make clear that quantiles are not an isolated example. Conventions differ across platforms in ARIMA sign conventions, whether likelihood constants are included in reported values, the definition of the multivariate autocovariance function and the sign convention and constant term used in discrete Fourier transforms. Quantiles are simply a particularly visible instance of a broader pattern where results can differ even when each platform is behaving correctly.

One question from the same comment thread is also worth noting: SQL's percent_rank formula, defined as (rank − 1) / (total_rows − 1), does not estimate a quantile. As Wicklin clarifies in his reply, it estimates the empirical distribution function for observed data values. Both concepts involve percentiles and rankings, but they address different problems. One maps values to cumulative proportions; the other maps cumulative probabilities to estimated values.

Does the Definition of a Sample Quantile Actually Matter?

The answer from all three articles is balanced. Yes, it matters in principle, and it is noticeably important for small samples, in extreme tails and wherever there are wide gaps in the ordered data. No, it often matters very little for larger samples (say, 100 or more observations), where the nine methods tend to produce results that are nearly indistinguishable. Wicklin's 100-observation comparison showed that the Type 2 and Type 7 estimates were so close that one set of points sat almost directly on top of the other.

That is why, as Wicklin notes, most analysts simply accept the default method of whichever software they are using. Even so, there are contexts where the definition should be stated explicitly. Regulatory work, reproducible research, published analyses and any cross-software validation all benefit from naming the method in use. Without that detail, two analysts can work correctly with the same data and still arrive at different percentile values.

Matching Quantile Definitions Across SAS, R and Python

The practical conclusion is clear. SAS defaults to Hyndman and Fan Type 2 (QNTLDEF=5), while R, Julia, SciPy and NumPy default to Type 7. SAS procedures natively support five of the nine definitions, and SAS/IML can be used to compute all nine, including a simplified function for the R default. For large data sets, the differences are typically negligible. For small data sets, particularly those with unevenly spaced observations, they can be large enough to change the story the numbers appear to tell. The solution is not to favour any particular platform, but to be explicit about the method wherever precision matters.

A more elegant way to read and combine data from multiple CSV files in Julia

24th October 2025

When I was compiling financial information for my accountant recently, I needed to read in a number of CSV files and combine their contents to enable further processing. This was all in a Julia script, and there was a time when I would use an explicit loop to do this combination. However, I came across a better way to accomplish this that I am sharing here with you now. First, you need to define a list of files like this:

files = ["5-2024.csv", "6-2024.csv", "7-2024.csv", "9-2024.csv", "10-2024.csv", "11-2024.csv", "12-2024.csv", "1-2025.csv", "2-2025.csv", "3-2025.csv", "4-2025.csv"]

Where there are alternatives to the above, including globbing (using wildcards with a Julia package that works with these), I decided to keep things simple for myself. Now we come to the line that does all the heavy lifting:

df = vcat([CSV.read(dir * file, DataFrame, normalizenames = true, header = 5, skipto = 6; silencewarnings=true) for file in files]...)

Near the end, there is the list comprehension ([***** for file in files]) that avoids the need for an explicit loop that I have used a few times in the past. This loops through each file in the list defined at the top, reading it into a dataframe as per the DataFrame option. The normalizenames option replaces spaces with underscores and cleans up any invalid characters. The header and skipto options tell Julia where to find the column headings and where to start reading the file, respectively. Then, the silencewarnings option suppresses any warnings about missing columns or inconsistent rows; clearly a check on the data frame is needed to ensure that all is in order if you wish to go the same route as I did.

The splat (...) operator takes the resulting list of data frames and converts them into individual arguments passed to the vcat function that virtually concatenates them together to create the df data frame. Just like suppressing warnings about missing columns or inconsistent rows during CSV file read time, this involves trust in the input data that everything is structured alike. Naturally, you need to do your own checks to ensure that is the case, as it was for me with what I had to do.

Resolving Python UnicodeEncodeError messages issued while executing scripts using the Windows Command Line

14th March 2025

Recently, I got caught out by this message when summarising some text using Python and Open AI's API while working within VS Code:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 56-57: character maps to <undefined>

There was no problem on Linux or macOS, but it was triggered on the Windows command line from within VS Code. Unlike the Julia or R REPL's, everything in Python gets executed in the console like this:

& "C:/Program Files/Python313/python.exe" script.py

The Windows command line shell operated with cp1252 character encoding, and that was tripping up the code like the following:

with open("out.txt", "w") as file:
    file.write(new_text)

The cure was to specify the encoding of the output text as utf-8:

with open("out.txt", "w", encoding='utf-8') as file:
    file.write(new_text)

After that, all was well and text was written to a file like in the other operating systems. One other thing to note is that the use of backslashes in file paths is another gotcha. Adding an r before the quotes gets around this to escape the contents, like using double backslashes. Using forward slashes is another option.

with open(r"c:\temp\out.txt", "w", encoding='utf-8') as file:
    file.write(new_text)

Finding human balance in an age of AI code generation

12th March 2025

Recently, I was asked about how I felt about AI. Given that the other person was not an enthusiast, I picked on something that happened to me, not so long ago. It involved both Perplexity and Google Gemini when I was trying to debug something: both produced too much code. The experience almost inspired a LinkedIn post, only for some of the thinking to go online here for now. A spot of brainstorming using an LLM sounds like a useful exercise.

Going back to the original question, it happened during a meeting about potential freelance work. Thus, I tapped into experiences with code generators over several decades. The first one involved a metadata-driven tool that I developed; users reported that there was too much imperfect code to debug with the added complexity that dealing with clinical study data brings. That challenge resurfaced with another bespoke tool that someone else developed, and I opted to make things simpler: produce some boilerplate code and let users take things from there. Later, someone else again decided to have another go, seemingly with more success.

It is even more challenging when you are insufficiently familiar with the code that is being produced. That happened to me with shell scripting code from Google Gemini that was peppered with some Awk code. There was no alternative but to learn a bit more about the language from Tutorials Point and seek out an online book elsewhere. That did get me up to speed, and I will return to these when I am in need again.

Then, there was the time when I was trying to get a Julia script to deal with Google Drive needing permissions to be set. This started Google Gemini into adding more and more error checking code with try catch blocks. Since I did not have the issue at that point, I opted to halt and wait for its recurrence. When it did, I opted for a simpler approach, especially with the gdrive CLI tool starting up a web server for completing the process of reactivation. While there are times when shell scripting is better than Julia for these things, I added extra robustness and user-friendliness anyway.

During that second task, I was using VS Code with the GitHub Copilot plugin. There is a need to be careful, yet that can save time when it adds suggestions for you to include or reject. The latter may apply when it adds conditional logic that needs more checking, while simple code outputting useful text to the console can be approved. While that certainly is how I approach things for now, it brings up an increasingly relevant question for me.

How do we deal with all this code production? In an environment with myriads of unit tests and a great deal of automation, there may be more capacity for handling the output than mere human inspection and review, which can overwhelm the limitations of a human context window. A quick search revealed that there are automated tools for just this purpose, possibly with their own learning curves; otherwise, manual working could be a better option in some cases.

After all, we need to do our own thinking too. That was brought home to me during the Julia script editing. To come up with a solution, I had to step away from LLM output and think creatively to come up with something simpler. There was a tension between the two needs during the exercise, which highlighted how important it is to learn not to be distracted by all the new technology. Being an introvert in the first place, I need that solo space, only to have to step away from technology to get that when it was a refuge in the first place.

For anyone with a programming hobby, they have to limit all this input to avoid being overwhelmed; learning a programming language could involve stripping out AI extensions from a code editor, for instance, LLM output has its place, yet it has to be at a human scale too. That perhaps is the genius of a chat interface, and we now have Agentic AI too. It is as if the technology curve never slackens, at least not until the current boom ends, possibly when things break because they go too far beyond us. All this acceleration is fine until we need to catch up with what is happening.

Clearing the Julia REPL during code development

23rd September 2024

During development, there are times when you need to clear the Julia REPL. It can become so laden with content that it becomes hard to perform debugging of your code. One way to accomplish this is issuing the CTRL + L keyboard shortcut while focus is within the REPL; you need to click on it first. Another is to issue the following in the REPL itself:

print("\033c")

Here \033 is an escape code in octal format. It is often used in terminal control sequences. The c character is what resets the terminal to its initial state. Printing this sequence is what does the clearance, and variations can be used to clear other kinds of console screens too. That makes it a more generic solution.

Dropping to an underlying shell using the ; character is another possibility. Then, you can use the clear or cls commands as needed; the latter is for Windows systems.

One last option is to define a Julia function for doing this:

function clear_console()
    run(`clear`)  # or `cls` for Windows
end

Calling the clear_console function then clears the screen programmatically, allowing for greater automation. The run function is the one that sends that command in backticks to the underlying shell for execution. Even using that alone should work too.

Avoiding errors caused by missing Julia packages when running code on different computers

15th September 2024

As part of an ongoing move to multi-location working, I am sharing scripts and other artefacts via GitHub. This includes Julia programs that I have. That has led me to realise that a bit of added automation would help iron out any package dependencies that arise. Setting up things as projects could help, yet that feels a little too much effort for what I have. Thus, I have gone for adding extra code to check on and install any missing packages instead of having failures.

For adding those extra packages, I instate the Pkg package as follows:

import Pkg

While it is a bit hackish, I then declare a single array that lists the packages to be checked:

pkglits =["HTTP", "JSON3", "DataFrames", "Dates", "XLSX"]

After that, there is a function that uses a try catch construct to find whether a package exists or not, using the inbuilt eval macro to try a using declaration:

tryusing(pkgsym) = try
@eval using $pkgsym
return true
catch e
return false
end

The above function is called in a loop that both tests the existence of a package and, if missing, installs it:

for i in 1:length(pkglits)
rslt = tryusing(Symbol(pkglits[i]))
if rslt == false
Pkg.add(pkglits[i])
end
end

Once that has completed, using the following line to instate the packages required by later processing becomes error free, which is what I sought:

using HTTP, JSON3, DataFrames, Dates, XLSX

A way to survey hours of daylight for locations of interest

9th September 2024

A few years back, I needed to get sunrise and sunset information for a location in Ireland. This was to help me plan visits to a rural location with a bus service going nearby, and I did not want to be waiting on the side of the road in the dark on my return journey. It ended up being a project that I undertook using the Julia programming language.

This had other uses too: one was the planning of trips to North America. This was how I learned that evenings in San Francisco were not as long as their counterparts in Ireland. Later, it had its uses in assessing the feasibility of seeing other parts of the Pacific Northwest during the month of August. Other matters meant that such designs never came to anything.

The Sunrise Sunset API was used to get the times for the start and end of daylight. That meant looping through the days of the year to get the information, but I needed to get the latitude and longitude information from elsewhere to fuel that process. While Google Maps has its uses with this, it is a manual and rather fiddly process. Sparing use of Nomintim's API is what helped with increasing the amount of automation and user-friendliness, especially what comes from OpenStreetMap.

Accessing using Julia's HTTP package got me the data in JSON format that I then converted into atomic vectors and tabular data. The end product is an Excel spreadsheet with all the times in UTC. A next step would be to use the solar noon information to port things to the correct timezone. It can be done manually in Excel and its kind, but some more automation would make things smoother.

Upgrading Julia packages

23rd January 2024

Whenever a new version of Julia is released, I have certain actions to perform. With Julia 1.10, installing and updating it has become more automated thanks to shell scripting or the use of WINGET, depending on your operating system. Because my environment predates this, I found that the manual route still works best for me, and I will continue to do that.

Returning to what needs doing after an update, this includes updating Julia packages. In the REPL, this involves dropping to the PKG subshell using the ] key if you want to avoid using longer commands or filling your history with what is less important for everyday usage.

Previously, I often ran code to find a package was missing after updating Julia, so the add command was needed to reinstate it. That may raise its head again, but there also is the up command for upgrading all packages that were installed. This could be a time saver when only a single command is needed for all packages and not one command for each package as otherwise might be the case.

A look at the Julia programming language

19th November 2022

Several open-source computing languages get mentioned when talking about working with data. Among these are R and Python, but there are others; Julia is another one of these. It took a while before I got to check out Julia because I felt the need to get acquainted with R and Python beforehand. There are others like Lua to investigate too, but that can wait for now.

With the way that R is making an incursion into clinical data reporting analysis following the passage of decades when SAS was predominant, my explorations of Julia are inspired by a certain contrariness on my part. Alongside some small personal projects, there has been some reading in (digital) book form and online. Concerning the latter of these, there are useful tutorials like Introduction to Data Science: Learn Julia Programming, Maths & Data Science from Scratch or Julia Programming: a Hands-on Tutorial. Like what happens with R, there are online versions of published books available free of charge, and they include Interactive Visualization and Plotting with Julia. Video learning can help too and Jane Herriman has recorded and shared useful beginner's guides on YouTube that start with the basics before heading onto more advanced subjects like multiple dispatch, broadcasting and metaprogramming.

This piece of learning has been made of simple self-inspired puzzles before moving on to anything more complex. That differs from my dalliance with R and Python, where I ventured into complexity first, not least because of testing them out with public COVID data. Eventually, I got around to doing that with Julia too, though my interest was beginning to wane by then, and Julia's abilities for creating multipage PDF files were such that the PDF Toolkit was needed to help with this. Along the way, I have made use of such packages as CSV.jl, DataFrames.jl, DataFramesMeta, Plots, Gadfly.jl, XLSX.jl and JSON3.jl, among others. After that, there is PrettyTables.jl to try out, and anyone can look at the Beautiful Makie website to see what Makie can do. There are plenty of other packages creating graphs, such as SpatialGraphs.jl, PGFPlotsX and GRUtils.jl. For formatting numbers, options include Format.jl and Humanize.jl.

So far, my primary usage has been with personal financial data together with automated processing and backup of photo files. The photo file processing has taken advantage of the ability to compile Julia scripts for added speed because just-in-time compilation always means there is a lag before the real work begins.

VS Code is my chosen editor for working with Julia scripts, since it has a plugin for the language. That adds the REPL, syntax highlighting, execution and data frame viewing capabilities that once were added to the now defunct Atom editor by its own plugin. While it would be nice to have a keyboard shortcut for script execution, the whole thing works well and is regularly updated.

Naturally, there have been a load of queries as I have gone along and the Julia Documentation has been consulted as well as Julia Discourse and Stack Overflow. The latter pair have become regular landing spots on many a Google search. One example followed a glitch that I encountered after a Julia upgrade when I asked a question about this and was directed to the XLSX.jl Migration Guides where I got the information that I needed to fix my code for it to run properly.

There is more learning to do as I continue to use Julia for various things. Once compiled, it does run fast like it has been promised. The syntax paradigm is akin to R and Python, but there are Julia-specific features too. If you have used the others, the learning curve is lessened but not eliminated completely. This is not an object-oriented language as such, but its functional nature makes it familiar enough for getting going with it. In short, the project has come a long way since it started more than ten years ago. There is much for the scientific programmer, but only time will tell if it usurped its older competitors. For now, I will remain interested in it.

Removing a Julia package using REPL or script commands

5th October 2022

While I have been programming with SAS for a few decades, and it remains a linchpin in the world of clinical development in the pharmaceutical industry, other technologies like R and Python are gaining a foothold. Two years ago, I started to look at those languages with personal projects being a great way of facilitating this. In addition, I got to hear of Julia and got to try that too. That journey continues since I have put it into use for importing and backing up photos, and there are other possible uses too.

Recently, I updated Julia to version 1.8.2 but ran into a problem with the DataArrays package that I had installed, so I decided to remove it since it was added during experimentation. Though the Pkg package that is used for package management is documented, I had not got to that, which meant that some web searching ensued. It turns out that there are two ways of doing this. One uses the REPL: after pressing the ] key, the following command gets issued:

rm DataArrays

When all is done, pressing the delete or backspace keys returns things to normal. This also can be done in a script as well as the REPL, and the following line works in both instances:

using Pkg; Pkg.rm("DataArrays")

While the semicolon is used to separate two commands issued on the same line, they can be on different lines or issued separately just as well. Naturally, DataArrays is just an example here; you just replace that with the name of whatever other package you need to remove. Since we can get carried away when downloading packages, there are times when a clean-up is needed to remove redundant packages, so knowing how to remove any clutter is invaluable.

  • The content, images, and materials on this website are protected by copyright law and may not be reproduced, distributed, transmitted, displayed, or published in any form without the prior written permission of the copyright holder. All trademarks, logos, and brand names mentioned on this website are the property of their respective owners. Unauthorised use or duplication of these materials may violate copyright, trademark and other applicable laws, and could result in criminal or civil penalties.

  • All comments on this website are moderated and should contribute meaningfully to the discussion. We welcome diverse viewpoints expressed respectfully, but reserve the right to remove any comments containing hate speech, profanity, personal attacks, spam, promotional content or other inappropriate material without notice. Please note that comment moderation may take up to 24 hours, and that repeatedly violating these guidelines may result in being banned from future participation.

  • By submitting a comment, you grant us the right to publish and edit it as needed, whilst retaining your ownership of the content. Your email address will never be published or shared, though it is required for moderation purposes.