Technology Tales

Notes drawn from experiences in consumer and enterprise technology

Why SAS, R and Python can report different percentiles for the same data

Published on 2nd February 2026 Estimated Reading Time: 10 minutes

Quantiles look straightforward on the surface. Ask for the median, the 75th percentile or the 95th percentile and most people expect one clear answer. Yet small differences between software packages often reveal that quantiles are not defined in only one way. When the same data are analysed in SAS, R or Python, the reported percentile can differ, particularly for small samples or for data sets with large gaps between adjacent values.

That difference is not necessarily a bug, and it is not a sign that one platform is wrong. It reflects the fact that sample quantiles are estimates of population quantiles, and statisticians have proposed several valid ways to construct those estimates. For everyday work with large samples, the distinction often fades into the background because the values tend to be close. For smaller samples, the choice of definition can matter enough to alter a reported result, a chart or a downstream calculation.

The Problem With the Empirical CDF

A useful starting point is understanding why multiple definitions exist at all. A sample quantile is an estimate of an unknown population quantile. Many approaches base that estimate on the empirical cumulative distribution function (ECDF), which approximates the cumulative distribution function (CDF) for the population. As Rick Wicklin explains in his 22nd May 2017 article on The DO Loop, the ECDF is a step function with a jump discontinuity at each unique data value. For that reason, the inverse ECDF does not exist and quantiles are not uniquely defined, which is precisely why different conventions have developed.

In high school, most people learn that when a sorted sample has an even number of observations, the median is the average of the two middle values. The default quantile definition in SAS extends that familiar rule to other quantiles. If the sample size is N and the q-th quantile is requested, then when Nq is an integer, the result is the data value x[Nq]. When Nq is not an integer, the result is the average of the two adjacent data values x[j] and x[j+1], where j = floor(Nq). Averaging is not the only choice available when Nq is not an integer, and that is where the definitions diverge.

The Hyndman and Fan Taxonomy

According to Hyndman and Fan ("Sample Quantiles in Statistical Packages," TAS, 1996), there are nine definitions of sample quantiles that commonly appear in statistical software packages. Three of those definitions are based on rounding and six are based on linear interpolation. All nine result in valid estimates.

As Wicklin describes in his 24th May 2017 article comparing all nine definitions, the nine methods share a common general structure. For a sample of N sorted observations and a target probability p, the estimate uses two adjacent data values x[j] and x[j+1]. A fractional quantity determines an interpolation parameter λ, and each definition has a parameter m that governs how interpolation between adjacent data points is handled. In general terms, the estimate takes the form q = (1 − λ)x[j] + λx[j+1], where λ and j depend on the values of p, N and the method-specific parameter m. The practical consideration at the extremes is that when p is very small or very close to 1, most definitions fall back to returning x[1] or x[N] respectively.

Default Methods Across Platforms

It is a misnomer to refer to one approach as "the SAS method" and another as "the R method." As Wicklin notes in his 26th July 2021 article comparing SAS, R and Python defaults, SAS supports five different quantile definitions through the PCTLDEF= option in PROC UNIVARIATE or the QNTLDEF= option in other procedures, and all nine can be computed via SAS/IML. R likewise supports all nine through the type parameter in its quantile function. The confusion arises not from limited capability, but from the defaults that most users accept without much thought.

By default, SAS uses Hyndman and Fan's Type 2 method (QNTLDEF=5 in SAS procedure syntax). R uses Type 7 by default, and that same Type 7 method is also the default in Julia and in the Python packages SciPy and NumPy. A comparison between SAS and Python therefore often becomes the same comparison as between SAS and R.

A Worked Example

The contrast between Type 2 and Type 7 is especially clear on a small data set. Wicklin uses the sample {0, 1, 1, 1, 2, 2, 2, 4, 5, 8} throughout both his 2017 and 2021 articles: ten observations, six unique values, and a particularly large gap between the two highest values, 5 and 8. That gap is deliberately chosen because the differences between quantile definitions are most visible when the sample is small and when adjacent ordered values are far apart.

The Type 2 method (SAS default) uses the ECDF to estimate population quantiles, so a quantile is always an observed data value or the average of two adjacent data values. The Type 7 method (R default) uses a piecewise-linear estimate of the CDF. Because the inverse of that piecewise-linear estimate is continuous, a small change in the probability level produces a small change in the estimated quantile, a property that is absent from the ECDF-based methods.

Where the Methods Agree and Where They Part Company

For the 0.5 quantile (the median), both methods return 2. A horizontal line at 0.5 crosses both CDF estimates at the same point, so there is no disagreement. This is one reason the issue can be easy to miss: some commonly reported percentiles coincide across definitions.

The 0.75 quantile tells a different story. Under Type 2, a horizontal line at 0.75 crosses the empirical CDF at 4, which is a data value. Under Type 7, the estimate is 3.5, which is neither a data value nor the average of adjacent values; it emerges from the piecewise-linear interpolation rule. The 0.95 quantile shows the sharpest divergence: Type 2 returns 8 (the maximum data value), while Type 7 returns 6.65, a value between the two largest observations.

Those differences are not errors. They are consequences of the assumptions built into each estimator. The default in SAS always returns a data value or the average of adjacent data values, whereas the default in R can return any value in the range of the data.

The Five Definitions Available in SAS Procedures

For users who stay within base SAS procedures, that same 22nd May 2017 article sets out the five available definitions clearly. QNTLDEF=1 and QNTLDEF=4 are piecewise-linear interpolation methods, whilst QNTLDEF=2, QNTLDEF=3 and QNTLDEF=5 are discrete rounding methods. The default is QNTLDEF=5. For the discrete definitions, SAS returns either a data value or the average of adjacent data values; the interpolation methods can return any value between observed data values.

The differences between the definitions are most apparent when there are large gaps between adjacent data values. Using the same ten-point data set, for the 0.45 quantile, different definitions return 1, 1.5, 1.95 or 2. For the 0.901 quantile, the round-down method (QNTLDEF=2) gives 5, the round-up method (QNTLDEF=3) gives 8, the backward interpolation method (QNTLDEF=1) gives 5.03 and the forward interpolation method (QNTLDEF=4) gives 7.733. These are not trivial discrepancies on a small sample.

The Four Remaining Definitions and the General Formula

The 24th May 2017 comparison article goes further, showing how SAS/IML can be used to compute the four Hyndman and Fan definitions that are not natively supported in SAS procedures. Each of the nine methods is an instance of the same general formula involving the parameter m. The four non-native methods each require their own specific value (or expression) for m, plus a small boundary value c that governs the behaviour at the extreme ends of the probability scale.

Wicklin also overlays the default methods for SAS (Type 2) and R (Type 7) graphically on the ten-point data set, showing that the SAS default produces a discrete step pattern whilst the R default traces a smoother piecewise-linear curve. He then repeats the comparison on a sample of 100 observations from a uniform distribution and finds that the two methods are almost indistinguishable at that scale, illustrating why many analysts work comfortably with defaults most of the time.

A SAS/IML Function to Match R's Default

For analysts who need cross-platform consistency, that same 26th July 2021 article provides a simplified SAS/IML function that reproduces the Type 7 default from R, Julia, SciPy and NumPy. The function converts the input to a column vector, handles missing values and the degenerate case of a single observation, then sorts the data and applies the Type 7 rule. The index into the sorted data would be j = floor(N*p + m) with m = 1 − p, the interpolation fraction is g = N*p + m − j, and the estimate is (1 − g)x[j] + gx[j+1] for all p < 1, with x[N] returned when p = 1. This gives SAS users a practical route to reproduce the default quantiles from other platforms without switching software.

If SAS/IML is unavailable, Wicklin suggests using PCTLDEF=1 in PROC UNIVARIATE (or QNTLDEF=1 in PROC MEANS) as the next best option. This produces the Type 4 method, which is not the same as Type 7 but does use interpolation rather than a purely discrete rule, so it avoids the jumpy behaviour of the ECDF-based defaults.

A Wider Point About Conventions in Statistical Software

The comments on the 2021 article make clear that quantiles are not an isolated example. Conventions differ across platforms in ARIMA sign conventions, whether likelihood constants are included in reported values, the definition of the multivariate autocovariance function and the sign convention and constant term used in discrete Fourier transforms. Quantiles are simply a particularly visible instance of a broader pattern where results can differ even when each platform is behaving correctly.

One question from the same comment thread is also worth noting: SQL's percent_rank formula, defined as (rank − 1) / (total_rows − 1), does not estimate a quantile. As Wicklin clarifies in his reply, it estimates the empirical distribution function for observed data values. Both concepts involve percentiles and rankings, but they address different problems. One maps values to cumulative proportions; the other maps cumulative probabilities to estimated values.

Does the Definition of a Sample Quantile Actually Matter?

The answer from all three articles is balanced. Yes, it matters in principle, and it is noticeably important for small samples, in extreme tails and wherever there are wide gaps in the ordered data. No, it often matters very little for larger samples (say, 100 or more observations), where the nine methods tend to produce results that are nearly indistinguishable. Wicklin's 100-observation comparison showed that the Type 2 and Type 7 estimates were so close that one set of points sat almost directly on top of the other.

That is why, as Wicklin notes, most analysts simply accept the default method of whichever software they are using. Even so, there are contexts where the definition should be stated explicitly. Regulatory work, reproducible research, published analyses and any cross-software validation all benefit from naming the method in use. Without that detail, two analysts can work correctly with the same data and still arrive at different percentile values.

Matching Quantile Definitions Across SAS, R and Python

The practical conclusion is clear. SAS defaults to Hyndman and Fan Type 2 (QNTLDEF=5), while R, Julia, SciPy and NumPy default to Type 7. SAS procedures natively support five of the nine definitions, and SAS/IML can be used to compute all nine, including a simplified function for the R default. For large data sets, the differences are typically negligible. For small data sets, particularly those with unevenly spaced observations, they can be large enough to change the story the numbers appear to tell. The solution is not to favour any particular platform, but to be explicit about the method wherever precision matters.

  • The content, images, and materials on this website are protected by copyright law and may not be reproduced, distributed, transmitted, displayed, or published in any form without the prior written permission of the copyright holder. All trademarks, logos, and brand names mentioned on this website are the property of their respective owners. Unauthorised use or duplication of these materials may violate copyright, trademark and other applicable laws, and could result in criminal or civil penalties.

  • All comments on this website are moderated and should contribute meaningfully to the discussion. We welcome diverse viewpoints expressed respectfully, but reserve the right to remove any comments containing hate speech, profanity, personal attacks, spam, promotional content or other inappropriate material without notice. Please note that comment moderation may take up to 24 hours, and that repeatedly violating these guidelines may result in being banned from future participation.

  • By submitting a comment, you grant us the right to publish and edit it as needed, whilst retaining your ownership of the content. Your email address will never be published or shared, though it is required for moderation purposes.