Technology Tales

Notes drawn from experiences in consumer and enterprise technology

19:52, 17th March 2025

Unicode character encodings

When working with files in Python, data are initially read as binary bytes, which are then decoded into strings using a specified character encoding. Writing strings to files involves encoding them into bytes. The default encoding varies by operating system, with UTF-8 commonly used on Unix-based systems and CP1252 on Windows. Specifying the encoding explicitly when opening files is recommended to avoid issues like UnicodeDecodeError or mojibake, which occur when mismatched encodings lead to incorrect character interpretation. Proper encoding management ensures accurate data handling, especially when dealing with non-ASCII characters, and helps prevent errors that may arise from differences in default encodings across platforms.

14:56, 24th February 2025

Gerrit Code Review Gerrit Code Review has its roots in Google's internal Mondrian tool, a proprietary peer-review system built on Perforce that became highly valued among Google engineers. Guido van Rossum later open-sourced elements of Mondrian as Rietveld, a similar but advisory tool designed for use with Subversion and hosted on Google App Engine.

When the Android Open Source Project adopted Git as its primary version control system, engineers familiar with Mondrian sought equivalent functionality for their new environment, leading to Gerrit beginning life as a set of patches to Rietveld before diverging significantly enough to warrant its own identity, taking its name from Dutch architect Gerrit Rietveld. The project underwent a substantial rewrite in version 2.x, shifting from Python on App Engine to Java running on a J2EE servlet container with a SQL database, and was later revised again in version 3.x, which replaced the SQL database with NoteDb to store all metadata in Git and migrated the user interface from GWT to Polymer.

02:04, 16th February 2025

Apply Functions in R with Examples [apply(), sapply(), lapply (), tapply()]

The apply family of functions in R provides a more efficient alternative to traditional loops for performing operations on data structures such as lists, matrices, arrays and data frames. These functions, including apply(), lapply(), sapply() and tapply(), allow users to apply custom or built-in functions across elements of a dataset, often reducing code complexity and improving performance, particularly with large datasets.

The apply() function operates on matrices or arrays, applying operations row-wise, column-wise, or across both, while lapply() consistently returns a list by applying a function to each element of a vector, list, or data frame, while sapply() simplifies the output of lapply() to the most straightforward data structure, such as a vector or matrix, depending on the input, and tapply() is used to compute summary statistics across groups defined by factors, making it useful for categorical data analysis. Examples demonstrate how these functions can calculate aggregates like means, sums, or transformations, with outputs varying in structure based on the function and input type, highlighting their versatility in data manipulation and analysis.

17:21, 15th February 2025

After checking in with R again as part of getting client work on the go once more, I went about setting up R and RStudio on a new machine. It was when I tried to add packages that things did not proceed so smoothly. It turned out that there were system dependencies that were missing. The combination of a console showing red against black and a lot of output made the problem difficult to spot. Handily, AI had a use here, and Google Gemini is turning out to be very useful when I have some debugging to do. All got sorted on this occasion; it might help to harvest a list of packages, so I have them for future reference.

22:04, 6th February 2025

The January 2025 release of Visual Studio Code, version 1.97, brings numerous updates and enhancements aimed at improving coding efficiency and security for developers. Notable features include GitHub Copilot's Next Edit Suggestions, which predicts coding edits, and enhancements in workspace management, such as a repositionable Command Palette and enhanced log filtering capabilities. The update introduces significant security features like extension publisher trust and compound log views for better log analysis. Developers can now debug Python scripts without setup and benefit from advanced git blame functionalities and support for various source control actions. Accessibility features have also been refined, enhancing sound clarity and adding keyboard shortcuts for easier navigation. The update further includes support for customisable terminal settings, enhanced debug capabilities, and diverse improvements in documentation and syntax highlighting. Remote development is also enhanced with better SSH configuration, and contributions from the community have helped streamline the codebase and improve the development workflow.

14:22, 17th December 2024

Steve's Data Tips and Tricks provides a comprehensive guide to using the na.omit() function in R to manage missing values effectively in vectors, matrices, and data frames. Missing values, often represented as "NA", can arise from various issues such as data collection errors and incomplete surveys, which can adversely affect statistical calculations, model accuracy, and data visualisation. The guide explains the basic usage of the na.omit() function, its syntax, and how it can be applied to vectors and data frames for removing incomplete cases. It offers practical examples, advanced applications like conditional removal, and best practices, such as backing up original data and considering the implications of data removal. The guide addresses FAQs, highlighting that while na.omit() is effective, alternative methods exist for handling missing values, and ultimately emphasises the importance of documenting strategies for managing NA values in data analysis.

11:55, 25th November 2024

How to run R in Visual Studio Code

Setting up R in Visual Studio Code involves installing specific extensions, configuring settings and adjusting terminal preferences to enable features like GitHub Copilot integration, which offers enhanced AI-assisted coding compared to RStudio. While the process requires additional steps such as installing Python-based tools like radian and R packages, it provides access to features such as interactive data previews, colour pickers and customisable code snippets, making it a viable alternative for users seeking advanced AI capabilities or working with multiple programming languages. The experience, though more complex than RStudio’s streamlined setup, offers flexibility and tools that may appeal to developers prioritising Copilot’s functionality or hybrid workflows.

16:14, 18th November 2024

How to use SAS on a Mac

Although SAS does not produce a version of its software that runs natively on Apple Mac hardware, there are several supported methods that allow Mac users to access its features. Free cloud-based options include SAS OnDemand for Academics, SAS Viya for Learners and SAS Viya Workbench for Learners, all of which can be accessed through a supported browser such as Chrome or Firefox. Visual Studio Code, which runs well on a Mac, supports a SAS extension that enables remote connections to both SAS 9.4 and SAS Viya sessions.

Users can also run SAS for Windows on a Mac through virtualisation software such as Parallels or VMware, though this is not an officially tested configuration and SAS Technical Support will not assist with Mac-specific setup issues. It is worth noting that SAS 9.4 and its associated client applications are incompatible with ARM-based Mac chips, as they are built for Intel x64 architecture. SAS Analytics Pro, which is container-based and deployed using Docker, can be run on macOS and accessed via a local browser. Finally, those who require interactive statistical software that installs and runs natively on a Mac may wish to consider JMP, a separate product from SAS that is available for both Windows and Mac operating systems.

10:58, 8th November 2024

10 Python One-Liners That Will Boost Your Data Science Workflow

Python offers a wide range of tools and techniques that can streamline data science workflows, many of which can be written in a single line of code. Pandas' fillna method can be combined with conditional logic to automatically fill numerical missing values with their median and categorical ones with their mode, while highly correlated features can be removed using a one-line correlation filter. New columns with multiple conditions can be generated efficiently using the apply method with lambda functions, and Python's built-in Set data type allows for quick identification of common or differing elements across datasets.

NumPy boolean masks provide a versatile way to filter arrays, and the Counter function from the collections module offers a rapid means of calculating value frequencies within a list. Regular expressions paired with map can extract numerical values from strings, nested lists can be flattened using the sum function and two lists can be merged into a dictionary using zip and dict together. Finally, multiple dictionaries can be consolidated into one using dictionary unpacking, making it straightforward to aggregate structured data for further preprocessing and analysis.

20:33, 2nd October 2024

How to Create Interactive Visualisations in R

Creating interactive visualisations in R enhances data exploration by allowing users to manipulate and analyse information dynamically. Packages such as DT, Plotly and Leaflet enable the development of interactive tables, charts and maps, offering features like sorting, filtering, zooming and hovering for detailed insights. For instance, DT facilitates sortable and searchable tables, Plotly transforms static plots into interactive scatter and bar charts and Leaflet generates maps with clickable markers displaying regional data.

These tools support tasks such as comparing life expectancy across continents, visualising population distributions, or examining GDP trends, thereby simplifying complex data analysis and improving the clarity of presentations. By leveraging these packages, users can create engaging visual outputs that aid in uncovering patterns and communicating findings effectively.

  • The content, images, and materials on this website are protected by copyright law and may not be reproduced, distributed, transmitted, displayed, or published in any form without the prior written permission of the copyright holder. All trademarks, logos, and brand names mentioned on this website are the property of their respective owners. Unauthorised use or duplication of these materials may violate copyright, trademark and other applicable laws, and could result in criminal or civil penalties.

  • All comments on this website are moderated and should contribute meaningfully to the discussion. We welcome diverse viewpoints expressed respectfully, but reserve the right to remove any comments containing hate speech, profanity, personal attacks, spam, promotional content or other inappropriate material without notice. Please note that comment moderation may take up to 24 hours, and that repeatedly violating these guidelines may result in being banned from future participation.

  • By submitting a comment, you grant us the right to publish and edit it as needed, whilst retaining your ownership of the content. Your email address will never be published or shared, though it is required for moderation purposes.