11:06, 28th May 2021
Dataframe Styling using Pandas
Pandas, the popular Python data analysis library, offers built-in styling functionality that allows analysts to transform plain dataframes into visually informative, presentation-ready tables. By using the Styler object, users can apply conditional formatting to colour-code values, such as highlighting positive figures in green and negative ones in red, as well as format how values are displayed, for example rendering decimals as percentages.
Beyond that, custom CSS can be applied to control font size, alignment and background colour for individual table elements, while additional options such as colour gradient backgrounds, maximum value highlighting and custom captions further enhance clarity. The process typically involves first wrangling data using SQL, then preparing and enriching a dataframe in Python before applying the desired visual styling, enabling data consumers to read underlying figures while also benefiting from visual cues that make insight quicker and easier to extract.
11:05, 28th May 2021
Python's built-in sort() method arranges the elements of a list in ascending order by default, though it accepts two optional keyword arguments that extend its functionality. The reverse argument, which defaults to False, can be set to True to sort elements in descending order. The key argument accepts a function that determines how comparisons are made during sorting, such as using the built-in len function to sort strings by their length rather than alphabetically. When sorting strings without a key argument, the method arranges them in dictionary order, and this behaviour is reversed when the reverse argument is set to True.
11:05, 28th May 2021
Creating PDF Reports with Pandas, Jinja and WeasyPrint
The Pandas data manipulation library can output data in various formats, but combining multiple pieces of data into a single document requires additional tooling. One effective approach involves using Jinja templates to build an HTML document from multiple pandas DataFrames and summary statistics, then converting that HTML into a PDF using a rendering library.
The process begins by reading in sales data, generating pivot tables and descriptive statistics, and passing those outputs as variables into a Jinja template, which supports features such as loops, includes and filters for formatting values. Individual sections of the report, such as per-manager breakdowns and national summary statistics, can be structured across multiple pages using a CSS page-break directive. Once the template is rendered into an HTML string, it can be converted to a styled PDF document by applying a simple stylesheet, producing a readable, multipage report from what would otherwise require considerable manual effort to assemble.
11:04, 28th May 2021
Embarrassingly parallel for loops in Python
Joblib is a Python library that simplifies the process of writing parallel for loops using multiprocessing, allowing computationally intensive tasks to be distributed across multiple CPUs by expressing code as generator expressions. It supports multiple parallelisation backends, with the default being the loky backend which runs tasks in separate worker processes, though a thread-based threading backend is available for code that releases the Python Global Interpreter Lock.
The parallel_config() context manager allows users to configure backend settings such as the number of jobs, verbosity and memory mapping behaviour without hardcoding these choices into library code. Serialisation of Python objects between processes is handled by cloudpickle, which supports a broader range of objects than the standard pickle module but can be slower for large data structures, and alternative serialisation options are available for performance-sensitive use cases.
For large numerical arrays, joblib can automatically convert data to memory-mapped files to allow worker processes to share memory rather than duplicating it, reducing memory overhead significantly. To prevent over-subscription of CPU resources, joblib limits the number of threads that third-party libraries such as OpenBLAS and MKL can use within worker processes, and this limit can be adjusted programmatically using the inner_max_num_threads argument. Results from parallel calls can be returned as a list, an ordered generator or an unordered generator, offering flexibility in how and when output data are consumed and aggregated.
11:03, 28th May 2021
5 Tips for Writing Clean R Code – Leave Your Code Reviewer Comment-less
Writing clean, readable code is essential for effective collaboration and long-term project quality, particularly in R development. Developers should use comments sparingly and purposefully, ensuring they provide context rather than simply restating what the code already makes clear, and any code marked for future revision should be flagged with a TODO note that identifies the author and explains the reason. String concatenation is more readable when using glue or sprintf rather than overusing the paste function, and debugging print statements should always be removed before submitting code for review. Loops require careful consideration, as vectorised alternatives through packages such as dplyr or purrr are often more efficient, and functions like seq_along are safer than relying on length-based iteration. When sharing code, absolute file paths should be avoided in favour of relative project paths, and sensitive credentials should be stored as environmental variables rather than committed to a repository. Finally, good general programming habits make a significant difference, including using descriptive variable names in a consistent style, avoiding abbreviations, writing logical comparisons correctly, maintaining consistent spacing and following an agreed style guide, keeping code DRY by avoiding unnecessary repetition and using a linter to catch errors automatically.
11:01, 28th May 2021
Executing Shell Commands with Python
Python offers developers and system administrators a more scalable and maintainable alternative to shell scripts for automating routine tasks, providing several methods for executing shell commands. The simplest of these is the os.system() function, which runs a command stored in a string, prints any output to the console and returns an exit code indicating success or failure.
For greater control over input and output, the subprocess module is the recommended approach, with subprocess.run() accepting commands as a list of strings and offering options to suppress output, pass input directly to a command and raise exceptions automatically when errors occur. For situations requiring a programme to continue working while a shell command is still executing, subprocess.Popen provides additional flexibility, using methods such as poll() to check completion status and communicate() to handle input and output. The choice between these three methods depends on the complexity of the task at hand, with os.system() suiting simple commands, subprocess.run() offering finer control and subprocess.Popen being best suited to non-blocking operations.
10:52, 28th May 2021
Group and Aggregate by One or More Columns in Pandas
Pandas, the Python data analysis library, offers SQL-like aggregation capabilities that allow users to group and summarise data by one or more columns. Using a sample dataset of baseball players containing their team, position and age, the groupby function can be applied to a single column, such as team name, to calculate aggregate statistics like mean, minimum and maximum values.
The agg function accepts a dictionary specifying which columns to aggregate and which functions to apply, though using multiple aggregation functions on a single column produces a multi-index structure that is best resolved by renaming the resulting columns and resetting the index. Grouping by multiple columns simultaneously is achieved by passing a list of column names to groupby rather than a single string, enabling more granular analysis such as breaking down player ages by both team and position at once.
10:51, 28th May 2021
GroupBy in Pandas: Your Guide to Summarizing and Aggregating Data in Python
The Pandas GroupBy function is a widely used tool in Python's Pandas library that enables analysts and data scientists to organise data into groups based on specific criteria, apply operations to those groups and then combine the results into a meaningful output. Operating on what is known as the Split-Apply-Combine strategy, first introduced by Hadley Wickham in 2011, it breaks large datasets into smaller, more manageable parts before performing calculations and reassembling the findings.
A GroupBy object is created by specifying one or more columns to group by, and from there a range of aggregation functions such as sum, mean, count, median and standard deviation can be applied, either individually or simultaneously using the agg() function. Beyond aggregation, the function also supports transformation, which allows computations to be performed on entire groups before returning a combined dataframe, as well as filtration, which discards values that do not meet defined criteria.
Custom functions can also be applied to grouped results using the apply() method, offering considerable flexibility for handling complex analytical tasks. The ability to group by multiple columns simultaneously and rename aggregated outputs makes it particularly well suited to deriving nuanced insights from structured datasets.
10:33, 28th May 2021
Pandas Groupby: Summarising, Aggregating, Grouping in Python
Python's Pandas library offers a powerful groupby() function that allows users to split large DataFrames into groups based on chosen variables and apply a range of summary statistics to each group. Using a dataset of 830 mobile phone usage records spanning five months, the groupby() function can be combined with agg() to calculate statistics such as sum, mean, min, max and count across grouped data.
The groupby() function returns a GroupBy object, and results can be returned as either a Pandas Series or DataFrame depending on how the operation is structured. Multiple statistics can be calculated per group simultaneously using the agg() function with a dictionary or list of instructions, and custom or lambda functions can also be applied.
Since Pandas version 0.25.0, named aggregations using simple tuples allow grouped columns to be renamed cleanly within a single operation, replacing the older nested dictionary approach that has since been deprecated. When multiple statistics produce a multi-index on columns, methods such as droplevel() or ravel() can be used to simplify and rename the resulting column headers.
10:17, 28th May 2021
Comprehensive Guide to Grouping and Aggregating with Pandas
Grouping and aggregating data is one of the most fundamental analytical operations available in the Pandas library for Python, and the groupby function can be paired with one or more aggregation functions to summarise data quickly and efficiently.
Built-in aggregation options include basic mathematical functions such as sum, mean, median, minimum, maximum, standard deviation and variance, as well as counting functions where it is important to note that count excludes missing values while size does not. Aggregations can be applied using a list, a dictionary or a named aggregation approach, with the dictionary method generally considered the most robust.
Beyond built-in functions, analysts can draw on external libraries such as scipy and numpy, or define their own custom functions using standard definitions, partial functions or lambda expressions, allowing for calculations such as percentile ranges, trimmed means and null value counts. Multiple aggregations across columns can be handled by combining groupby with apply, though this approach is slower and best used sparingly.
Results can be further manipulated through chained groupby operations to produce cumulative totals, and hierarchical column indices that Pandas creates by default can be flattened into single-row labels for easier downstream analysis. For those needing subtotals, the third-party sidetable package provides a straightforward way to add them at multiple levels alongside a grand total.