09:17, 19th May 2021
Python Data Wrangling Solutions — Dynamically Creating Variables When Slicing Data Frames
When working on data science projects, a significant portion of time is spent on data wrangling, and one common challenge is splitting a single dataframe into multiple dataframes based on the categorical values of a variable. While manual splitting is feasible for a few categories, it becomes impractical when dealing with tens or hundreds of distinct values.
A practical workaround in Python involves using dictionaries, where each key represents a unique category and its corresponding value holds the relevant dataframe slice. The process involves importing a dataset using Pandas, applying the groupby method to the chosen categorical column, converting the resulting object into a tuple to pair each category with its associated data, and finally converting that tuple into a dictionary. This approach replicates the outcome of creating individual variables for each category, with the dictionary keys serving in place of distinct variable names, allowing the sliced data to be accessed cleanly and efficiently regardless of how many categories exist.
13:45, 16th May 2021
% Macro Core - Production Ready Macros for SAS Application Developers
The Macro Core library is an open-source, MIT-licensed collection of production-quality SAS macros designed to reduce development time and effort for application developers working on the SAS platform. The library is organised into several folders, each targeting a specific platform or environment, including BASE for all platforms, META and METAX for SAS 9 environments, VIYA for SAS Viya, SERVER for the open-source SASjs REST API and XPLATFORM for macros that function across multiple server types.
It also incorporates LUA and FCMP components, allowing developers to embed LUA modules within SAS macros and generate compiled functions, respectively. The entire library can be downloaded and compiled with just two lines of SAS code, and installation involves updating the sasautos path to include the relevant folders.
Strict coding and documentation standards are enforced, including Doxygen-formatted headers, two-space indentation, lowercase filenames, one macro per file and clearly defined naming prefixes for each category. Dependencies must be declared explicitly in macro headers to support the SASjs command-line interface, which can extract and insert them automatically during project compilation.
13:44, 16th May 2021
How to Update All Python Packages
Updating Python packages involves using tools like pip to maintain environment stability and security, with best practices including pinning versions in requirements.txt files to ensure reproducibility. Outdated packages can be identified using pip list --outdated and upgraded through commands tailored to operating systems such as Windows PowerShell or Linux utilities like grep and awk.
Virtual environments require specific scripts or Pipenv commands for updates, while the ActiveState Platform offers an alternative method for managing dependencies and resolving conflicts, though its use is optional. The process highlights the importance of careful upgrades to avoid breaking dependencies, with considerations for both development and production environments.
13:43, 16th May 2021
The OS module in Python offers functions for interacting with the operating system, providing a portable way to access system-dependent functionality. The os.system() method executes a command string in a subshell by calling the Standard C system() function, which has inherent limitations.
It sends any generated output to the interpreter's standard output stream and opens the relevant operating system shell to execute the command. The method's syntax involves passing a string parameter representing the command, with return values dependent on the operating system; Unix returns the exit status of the process, while Windows returns the shell's output. Examples include running system-specific commands such as retrieving the current date or launching applications like Notepad on Windows, demonstrating its utility for interacting with the underlying operating system through Python.
13:42, 16th May 2021
Pandas Split strings into two List/Columns using str.split()
The Pandas str.split() method enables splitting string data in a DataFrame column using a specified delimiter, allowing results to be stored as lists within a Series or expanded into separate columns for structured analysis.
By setting the expand parameter to True, strings can be divided into multiple columns, as demonstrated by splitting full names into first and last names, while expand=False retains split values as lists.
Additional flexibility is achieved by combining str.split() with the apply() function for custom splitting logic, such as dynamically separating complex strings into distinct parts. This technique is particularly useful for reorganising textual data in preparation for further processing or analysis.
13:41, 16th May 2021
SettingwithCopyWarning: How to Fix This Warning in Pandas
The SettingWithCopyWarning in Pandas arises from the ambiguity of whether operations on dataframes modify the original data or create copies. This warning was introduced to address silent failures in chained assignments, where changes to a subset of data might not propagate back to the source.
The root of the issue lies in design of Pandas, which balances flexibility with the efficiency of the underlying array structures of NumPy. When slices of a dataframe contain a single data type, they can be returned as views, which are memory-efficient but may lead to unintended side effects if modified.
Multi-type slices, however, require copies, which are safer but less efficient. Developers are advised to avoid chained indexing, which can obscure whether changes affect the original data or a copy. Instead, using .copy() explicitly ensures modifications are applied to a separate instance, or working directly on the original dataframe with loc or iloc maintains clarity.
Understanding this warning is crucial for reliable data manipulation, as it highlights the need for intentional coding practices. The evolution of Pandas, from its early reliance on the ix indexer to the preference for loc and iloc, reflects a broader effort to make indexing more predictable. While the warning may seem cumbersome, it serves as a safeguard against subtle bugs, reinforcing the importance of deliberate data handling in analysis workflows.
13:17, 16th May 2021
17:02, 12th May 2021
SASCrunch
SASCrunch.com offers an online SAS programming training programme aimed at absolute beginners, with the goal of helping learners develop proficiency in SAS within 30 days. The platform takes a coding-oriented approach, encouraging users to learn through practice via more than 150 interactive tutorials, coding exercises and practical projects covering data reading, cleaning, manipulation, analysis and presentation. Training is delivered in an interactive format that allows learners to write and execute code alongside the course material on the same screen. The platform also offers dedicated preparation courses for the SAS Certified Specialist Exam, which now requires candidates to write and execute code during the examination itself, and includes over 300 practice exercises to support that preparation. Students at Duke University can access the base certification training programme at no cost.
15:25, 12th May 2021
On Microsoft Windows, the SAS Work library stores temporary files used during a SAS session and defaults to the system's TEMP directory. Start-up errors such as "ERROR: Invalid physical name for library WORK", "ERROR: Insufficient authorisation to access WORK library" and "ERROR: Library WORK does not exist" typically indicate that the specified directory path is missing or that the user account lacks sufficient permissions to access it.
The location of the WORK library must be configured before a SAS session begins, as it cannot be reassigned once a session is active. To change it, the sasv9.cfg file for the relevant SAS version must be edited using Notepad, run as administrator, by modifying the -WORK option to point to a new directory path to which the user has full control permissions. Alternatively, the -WORK option can be added directly to the SAS application shortcut's target command.
On machines shared by multiple user accounts, incorporating the Windows environment variable !USERNAME into the directory path ensures that each user account maintains a separate Work library. In either case, a restart of the SAS session is required for the new location to take effect, and assistance from an IT department may be necessary if permissions issues arise during the process.
15:24, 12th May 2021
How do I locate the SAS temporary work directory?
There are at least two ways to locate the temporary work directory that SAS uses. In a Windows environment, this can be done by right-clicking the work icon in SAS and selecting "Property". Alternatively, SAS syntax can be used, either through the options procedure or the %sysfunc(getoption(work)) function, the latter of which is particularly useful when working outside of Windows or when the directory path needs to be passed to a SAS programme. It is also possible to store the directory path in a macro variable for future use within a programme, using the %let command in combination with the %sysfunc function.