Technology Tales

Notes drawn from experiences in consumer and enterprise technology

10:08, 28th May 2021

python-docx is a Python library designed for creating and modifying Microsoft Word (.docx) files. It offers a broad range of functionality, allowing developers to programmatically build documents that include headings, paragraphs, formatted runs of bold and italic content, images, tables, page breaks and various list styles. The library is supported by comprehensive documentation covering a user guide and a detailed API reference, the latter encompassing objects related to documents, styles, paragraphs, tables, sections, comments, shapes and shared utilities such as colour formatting and length handling. A contributor guide and a range of enumerations for alignment, styling and formatting options are also included.

10:07, 28th May 2021

TableOne

Developed by Tom J. Pollard and Alistair E. W. Johnson, TableOne is a Python package designed to simplify the process of generating summary measures from a dataset, making it particularly useful for researchers preparing publications.

10:07, 28th May 2021

5 Methods to Check for NaN values in Python

NaN, which stands for Not A Number, is a special floating-point value used to represent missing data in Python, and identifying a standalone NaN value can be more challenging than locating one within a larger data structure.

There are five commonly used methods for detecting NaN values in Python, three of which rely on built-in functions from popular libraries, specifically the isna() function from pandas, the isnan() function from numpy and the isnan() function from the math library, all of which return True when a NaN value is detected.

The remaining two methods exploit inherent properties of NaN itself, the first being that a NaN value is the only floating-point type that is not equal to itself, and the second being that unlike all other floating-point numbers, NaN does not fall within the range of negative infinity to positive infinity, meaning that any value failing this range check can be identified as NaN.

10:05, 28th May 2021

Download Files with Python

Downloading files from online sources is a common task in web programming, essential for applications involving file sharing, data collection and retrieving website resources. There are several methods in Python for achieving this, including the urllib.request module's urlretrieve function, which simplifies downloading by requiring only a URL and a local file path.

However, this approach is noted as legacy in Python 3. The urllib2 module, suitable for Python 2, offers similar functionality but requires additional steps to handle file data. The requests library provides a more modern and flexible solution, allowing for easy retrieval of file content and access to HTTP metadata such as status codes and headers.

Additionally, the wget module offers a straightforward one-line method for downloading files without manually opening the destination file. While the author prefers requests for its balance of simplicity and features, alternatives like urllib.request or urllib2 may be necessary depending on project constraints and Python version. Each method is demonstrated with example code, highlighting practical considerations for implementation.

10:04, 28th May 2021

Dask is a Python library designed to facilitate parallel and distributed computing, offering scalable solutions for handling complex data processing tasks. It provides multiple APIs, including Futures for flexible task management, DataFrames for structured data analysis and Arrays and Bags for handling large datasets, enabling users to construct custom workflows and leverage powerful scaling techniques. Installation is straightforward through pip or conda and deployment options range from local setups to cloud and high-performance computing environments. Widely adopted across industries, Dask addresses challenges associated with large-scale data and intensive computations, supporting a variety of applications through extensive documentation, examples and community resources. Its design prioritises usability, performance and adaptability, making it a versatile tool for both individual and collaborative computational projects.

10:03, 28th May 2021

PyTables is a Python package designed to manage hierarchical datasets efficiently, leveraging the HDF5 library and NumPy for handling large volumes of data. It combines an object-oriented interface with performance-optimised C extensions generated via Cython, enabling fast and user-friendly interaction with extensive datasets while minimising memory and disk usage, particularly through on-the-fly compression. The project provides comprehensive documentation, migration guides and resources for users and is supported by the NumFOCUS organisation.

10:02, 28th May 2021

SAS Analysis Explorers is an online platform designed for users of SAS software to engage with a community of peers, access educational resources and participate in challenges that reward progress with tangible benefits. The initiative encourages skill development through tutorials, networking opportunities and interactions with industry experts, allowing participants to share insights and learn from others facing similar data-related challenges.

Users can earn points by completing tasks or milestones, which can be exchanged for items such as technology gadgets, books, or other merchandise. The platform is structured to facilitate exploration of SAS-related content, including updates, events and quizzes, and is accessible to individuals with varying levels of experience. It distinguishes itself from SAS Communities by focusing on user engagement and recognition, offering a space where participants can connect, collaborate and be acknowledged for their contributions to the SAS ecosystem.

09:50, 28th May 2021

How to Check if a File or Directory Exists in Python

Python offers several methods for checking whether a file or directory exists, each suited to different scenarios. The simplest approach requires no imported modules and works across Python 2 and 3, using a try-except block to attempt opening a file and catching an IOError if it is not found. Using the with keyword alongside this method ensures the file is properly closed after operations are completed.

For situations where a developer needs to verify a file's existence before performing actions such as copying or deleting, the os.path module offers useful functions including os.path.exists, os.path.isfile and os.path.isdir, all of which are compatible with both Python versions. A more modern alternative is the pathlib module, available in Python 3.4 and above, which takes an object-oriented approach and allows developers to work with file paths as Path objects rather than plain strings, though it can be installed for Python 2 via pip. A notable consideration when checking for file existence is the risk of race conditions, which can occur when multiple processes access the same file in the time between a check and a subsequent operation.

09:49, 28th May 2021

Vaex: Pandas but 1000x faster

Vaex is a Python library designed to handle large datasets far more efficiently than Pandas, capable of processing up to one billion rows per second through memory mapping and lazy computations, meaning it avoids copying data unless explicitly instructed to do so. Unlike Pandas, which struggles with memory limitations and slow processing speeds on large datasets, Vaex can work with datasets as large as the available hard drive space.

It can be installed via pip or conda and supports reading from CSV and HDF5 file formats, with benchmarks showing it reads files dramatically faster than Pandas. The library offers a broad range of functionality including statistical operations such as correlation, covariance and groupby aggregations, as well as data cleaning tools for handling missing values and dropping columns. It also includes string operation methods, plotting capabilities for one and two-dimensional visualisations and a virtual columns feature that allows expressions to be stored and computed on the fly without consuming additional memory.

09:03, 27th May 2021

Sample 24820: Creating a Directory Listing Using SAS for Windows

Creating a directory listing using SAS for Windows allows users to document project structures by generating lists of files and folders, which can be annotated for clarity. This is achieved by invoking the DOS DIR command through a FILENAME statement with the pipe device type, enabling the processing of directory information within a data step.

The %DIRLISTWIN macro further enhances this process by filtering files based on size, date, or subdirectory inclusion and producing reports or datasets with details such as file paths, sises, dates and owners. This tool is particularly useful for efficiently locating files within large directories or specific timeframes, though execution time varies depending on the complexity of the selected path.

  • The content, images, and materials on this website are protected by copyright law and may not be reproduced, distributed, transmitted, displayed, or published in any form without the prior written permission of the copyright holder. All trademarks, logos, and brand names mentioned on this website are the property of their respective owners. Unauthorised use or duplication of these materials may violate copyright, trademark and other applicable laws, and could result in criminal or civil penalties.

  • All comments on this website are moderated and should contribute meaningfully to the discussion. We welcome diverse viewpoints expressed respectfully, but reserve the right to remove any comments containing hate speech, profanity, personal attacks, spam, promotional content or other inappropriate material without notice. Please note that comment moderation may take up to 24 hours, and that repeatedly violating these guidelines may result in being banned from future participation.

  • By submitting a comment, you grant us the right to publish and edit it as needed, whilst retaining your ownership of the content. Your email address will never be published or shared, though it is required for moderation purposes.