Some R functions for working with dates, strings and data frames
Working with data in R often comes down to a handful of recurring tasks: combining text, converting dates and times, reshaping tables and creating summaries that are easier to interpret. This article brings together several strands of base R and tidyverse-style practice, with a particular focus on string handling, date parsing, subsetting and simple time series smoothing. Taken together, these functions form part of the everyday toolkit for data cleaning and analysis, especially when imported data arrive in inconsistent formats.
String Building
At the simplest end of this toolkit is paste(), a base R function for concatenating character vectors. Its purpose is straightforward: it converts one or more R objects to character vectors and joins them together, separating terms with the string supplied in sep, which defaults to a space. If the inputs are vectors, concatenation happens term by term, so paste("A", 1:6, sep = "") yields "A1" through "A6", while paste(1:12) behaves much like as.character(1:12). There is also a collapse argument, which takes the resulting vector and combines its elements into a single string separated by the chosen delimiter, making paste() useful both for constructing values row by row and for creating one final display string from many parts.
That basic string-building role becomes more important when dates and times are involved because imported date-time data often arrive as text split across multiple columns. A common example is having one column for a date and another for a time, then joining them with paste(dates, times) before parsing the result. In that sense, the paste() function often acts as a bridge between messy raw input and structured date-time objects. It is simple, but it appears repeatedly in data preparation pipelines.
Date-Time Conversion
For date-time conversion, base R provides strptime(), strftime() and format() methods for POSIXlt and POSIXct objects. These functions convert between character representations and R date-time classes, and they are central to understanding how R reads and prints times. strptime() takes character input and converts it to an object of class "POSIXlt", while strftime() and format() move in the other direction, turning date-time objects into character strings. The as.character() method for "POSIXt" classes fits into the same family, and the essential idea is that the date-time value and its textual representation are separate things, with the format string defining how R should interpret or display that representation.
Format strings rely on conversion specifications introduced with %, and many of these are standard across systems. %Y means a four-digit year with century, %y means a two-digit year, %m is a month, %d is the day of a month and %H:%M:%S captures hours, minutes and seconds in 24-hour time. %F is equivalent to %Y-%m-%d, which is the ISO 8601 date format. %b and %B represent abbreviated and complete month names, while %a and %A do the same for weekdays. Locale matters here because month names, weekday names, AM/PM indicators and some separators depend on the LC_TIME locale, meaning a date string like "1jan1960" may parse correctly in one locale and return NA in another unless the locale is set appropriately.
R's defaults generally follow ISO 8601 rules, so dates print as "2001-02-28" and times as "14:01:02", though R inserts a space between date and time by default. Several details matter in practice. strptime() processes input strings only as far as needed for the specified format, so trailing characters are ignored. Unspecified hours, minutes and seconds default to zero, and if no year, month or day is supplied then the current values are assumed, though if a month is given, the day must also be valid for that month. Invalid calendar dates such as "2010-02-30 08:00" produce results whose components are all NA.
Time Zones and Daylight Saving
Time zones add another layer of complexity. The tz argument specifies the time zone to use for conversion, with "" meaning the current time zone and "GMT" meaning UTC. Invalid values are often treated as UTC, though behaviour can be system-specific. The usetz argument controls whether a time zone abbreviation is appended to output, which is generally more reliable than %Z. %z represents a signed UTC offset such as -0800, and R supports it for input on all platforms. Even so, time zones can be awkward because daylight saving transitions create times that do not occur at all, or occur twice, and strptime() itself does not validate those cases, though conversion through as.POSIXct may do so.
Two-Digit Years
Two-digit years are a notable source of confusion for analysts working with historical data. As described in the R date formats guide on R-bloggers, %y maps values 00 to 68 to the years 2000 to 2068 and 69 to 99 to 1969 to 1999, following the POSIX standard. A value such as "08/17/20" may therefore be interpreted as 2020 when the intended year is 1920. One practical workaround is to identify any parsed dates lying in the future and then rebuild them with a 19 prefix using format() and ifelse(). This approach is explicit and practical, though it depends on the assumptions of the data at hand.
Plain Dates
For plain dates, rather than full date-times, as.Date() is usually the entry point. Character dates can be imported by specifying the current format, such as %m/%d/%y for "05/27/84" or %B %d %Y for "May 27 1984". If no format is supplied, as.Date() first tries %Y-%m-%d and then %Y/%m/%d. Numeric dates are common when data come from Excel, and here the crucial issue is the origin date: Windows Excel uses an origin of "1899-12-30" for dates after 1900 because Excel incorrectly treated 1900 as a leap year (an error originally copied from Lotus 1-2-3 for compatibility), while Mac Excel traditionally uses "1904-01-01". Once the correct origin is supplied, as.Date() converts the serial numbers into standard R dates.
After import, format() can display dates in other ways without changing their underlying class. For example, format(betterDates, "%a %b %d") might yield values like "Sun May 27" and "Thu Jul 07". This distinction between storage and display is important because once R recognises values as dates, they can participate in date-aware operations such as mean(), min() and max(), and a vector of dates can have a meaningful mean date with the minimum and maximum identifying the earliest and latest observations.
Extracting Columns and Manipulating Lists
These ideas about correct types and structure carry over into table manipulation. A data frame column often needs to be extracted as a vector before further processing, and there are several standard ways to do this, as covered in this guide from Statistics Globe. In base R, the $ operator gives a direct route, as in data$x1. Subsetting with data[, "x1"] yields the same result for a single column, and in the tidyverse, dplyr::pull(data, x1) serves the same purpose. All three approaches convert a column of a data frame into a standalone vector, and each is useful depending on the surrounding code style.
List manipulation has similar patterns, detailed in this Statistics Globe tutorial on removing list elements. Removing elements from a list can be done by position with negative indexing, as in my_list[-2], or by assigning NULL to the relevant component, for example my_list_2[2] <- NULL. If names are more meaningful than positions, then subsetting with names(my_list) != "b" or names(my_list) %in% "b" == FALSE removes the named element instead. The same logic extends to multiple elements, whether by positions such as -c(2, 3) or names such as %in% c("b", "c") == FALSE. These are simple techniques, but they matter because lists are a common structure in R, especially when working with nested results.
Subsetting, Renaming and Reordering Data Frames
Data frames themselves can be subset in several ways, and the choice often depends on readability, as the five-method overview on R-bloggers demonstrates clearly. The bracket form example[x, y] remains the foundation, whether selecting rows and columns directly or omitting unwanted ones with negative indices. More expressive alternatives include which() together with %in%, the base subset() function and tidyverse verbs like filter() and select(). The point is not that one method is universally best, but that R offers both low-level precision and higher-level readability, depending on the task.
Column names and column order also need regular attention. Renaming can be done with dplyr::rename(), as explained in this lesson from Datanovia, for instance changing Sepal.Length to sepal_length and Sepal.Width to sepal_width. In base R, the same effect comes from modifying names() or colnames(), either by matching specific names or by position. Reordering columns is just as direct, with a data frame rearranged by column indices such as my_data[, c(5, 4, 1, 2, 3)] or by an explicit character vector of names, as the STHDA guide on reordering columns illustrates. Both approaches are useful when preparing data for presentation or for functions that expect variables in a certain order.
Sorting and Cumulative Calculations
Sorting and cumulative calculations fit naturally into this same preparatory workflow. To sort a data frame in base R, the DataCamp sorting reference demonstrates that order() is the key function: mtcars[order(mpg), ] sorts ascending by mpg, while mtcars[order(mpg, -cyl), ] sorts by mpg ascending and cyl descending. For cumulative totals, cumsum() provides a running sum, as in calculating cumulative air miles from the airmiles dataset, an example covered in the Data Cornering guide to cumulative calculations. Within grouped data, dplyr::group_by() and mutate() can apply cumsum() separately to each group, and a related idea is cumulative count, which can be built by summing a column of ones within groups, or with data.table::rowid() to create a group index.
Time Series Smoothing
Time series smoothing introduces one further pattern: replacing noisy raw values with moving averages. As the Storybench rolling averages guide explains, the zoo::rollmean() function calculates rolling means over a window of width k, and examples using 3, 5, 7, 15 and 21-day windows on pandemic deaths and confirmed cases by state demonstrate the approach clearly. After arranging and grouping by state, mutate() adds variables such as death_03da, death_05da and death_07da. Because rollmean() is centred by default, the resulting values are symmetrical around the observation of interest and produce NA values at the start and end where there are not enough surrounding observations, which is why odd values of k are usually preferred as they make the smoothing window balanced.
The arithmetic is uncomplicated, but the interpretation is useful. A 3-day moving average for a given date is the mean of that day, the previous day and the following day, while a 7-day moving average uses three observations on either side. As the window widens, the line becomes smoother, but more short-term variation is lost. This trade-off is visible when comparing 3-day and 21-day averages: a shorter average tracks recent changes more closely, while a longer one suppresses noise and makes broader trends stand out. If a trailing rather than centred calculation is needed, rollmeanr() shifts the window to the right-hand end.
The same grouped workflow can be used to derive new daily values before smoothing. In the pandemic example, daily new confirmed cases are calculated from cumulative confirmed counts using dplyr::lag(), with each day's new cases equal to the current cumulative total minus the previous day's total. Grouping by state and date, summing confirmed counts and then subtracting the lagged value produces new_confirmed_cases, which can then be smoothed with rollmean() in the same way as deaths. Once these measures are available, reshaping with pivot_longer() allows raw values and rolling averages to be plotted together in ggplot2, making it easier to compare volatility against trend.
How These R Data Manipulation Techniques Fit Together
What links all of these techniques is not just that they are common in R, but that they solve the mundane, essential problems of analysis. Data arrive as text when they should be dates, as cumulative counts when daily changes are needed, as broad tables when only a few columns matter, or as inconsistent names that get in the way of clear code. Functions such as paste(), strptime(), as.Date(), order(), cumsum(), rollmean(), rename(), select() and simple bracket subsetting are therefore less like isolated tricks and more like pieces of a coherent working practice. Knowing how they fit together makes it easier to move from raw input to reliable analysis, with fewer surprises along the way.
Please be aware that comment moderation is enabled and may delay the appearance of your contribution.