TOPIC: DEVOPS
When Operations and Machine Learning meet
Here's a scenario you'll recognise: your SRE team drowns in 1,000 alerts daily. 95% are false positives. Meanwhile, your data scientists built five ML models last quarter, and none have reached production. These problems are colliding, and solving each other. Machine learning is moving out of research labs and into the operations that keep your systems running. At the same time, DevOps practices are being adapted to get ML models into production reliably. Since this convergence has created three new disciplines (AIOps, MLOps and LLM observability), here is what you need to know.
Why Traditional Operations Can't Keep Up
Modern systems generate unprecedented volumes of operational data. Logs, metrics, traces, events and user interaction signals create a continuous stream that's too large and too fast for manual analysis.
Your monitoring system might send thousands of alerts per day, but most are noise. A CPU spike in one microservice cascades into downstream latency warnings, database connection errors and end-user timeouts, generating dozens or hundreds of alerts from a single root cause. Without intelligent correlation, engineers waste hours manually connecting the dots.
Meanwhile, machine learning models that could solve real business problems sit in notebooks, never making it to production. The gap between data science and operations is costly. Data scientists lack the infrastructure to deploy models reliably. Operations teams lack the tooling to monitor models that do make it live.
The complexity of cloud-native architectures, microservices and distributed systems has outpaced traditional approaches. Manual processes that worked for simpler systems simply cannot scale.
Three Emerging Practices Changing the Game
Three distinct but related practices have emerged to address these challenges. Each solves a specific problem whilst contributing to a broader transformation in how organisations build and run digital services.
AIOps: Intelligence for Your Operations
AIOps (Artificial Intelligence for IT Operations) applies machine learning to the work of IT operations. Originally coined by Gartner, AIOps platforms collect data from across your environment, analyse it in real-time and surface patterns, anomalies or likely incidents.
The key capability is event correlation. Instead of presenting 1,000 raw alerts, AIOps systems analyse metadata, timing, topological dependencies and historical patterns to collapse related events into a single coherent incident. What was 1,000 alerts becomes one actionable event with a causal chain attached.
Beyond detection, AIOps platforms can trigger automated responses to common problems, reducing time to remediation. Because they learn from historical data, they can offer predictive insights that shift operations away from constant firefighting.
Teams implementing AIOps report measurable improvements: 60-80% reduction in alert volume, 50-70% faster incident response and significant reductions in operational toil. The technology is maturing rapidly, with Gartner predicting that 60% of large enterprises will have adopted AIOps platforms by 2026.
MLOps: Getting Models into Production
Whilst AIOps uses ML to improve operations, MLOps (Machine Learning Operations) is about operationalising machine learning itself. Building a model is only a small part of making it useful. Models change, data changes, and performance degrades over time if the system isn't maintained.
MLOps is an engineering culture and practice that unifies ML development and ML operations. It extends DevOps by treating machine learning models and data assets as first-class citizens within the delivery lifecycle.
In practice, this means continuous integration and continuous delivery for machine learning. Changes to models and pipelines are tested and deployed in a controlled way. Model versioning tracks not just the model artefact, but also the datasets and hyperparameters that produced it. Monitoring in production watches for performance drift and decides when to retrain or roll back.
The MLOps market was valued at $2.2 billion in 2024 and is projected to reach $16.6 billion by 2030, reflecting rapid adoption across industries. Organisations that successfully implement MLOps report that up to 88% of ML initiatives that previously failed to reach production are now being deployed successfully.
A typical MLOps implementation looks like this: data scientists work in their preferred tools, but when they're ready to deploy, the model goes through automated testing, gets versioned alongside its training data and deploys with built-in monitoring for performance drift. If the model degrades, it can automatically retrain or roll back.
The SRE Automation Opportunity
Site Reliability Engineering, originally created at Google, applies software engineering principles to operations problems. It encompasses availability, latency, performance, efficiency, change management, monitoring, emergency response and capacity planning. Rather than replacing AIOps, the likely outcome is convergence. Analytics, automation and reliability engineering become mutually reinforcing, with organisations adopting integrated approaches that combine intelligent monitoring, automated operations and proactive reliability practices.
What This Looks Like in the Real World
The difference between traditional operations and ML-powered operations shows up in everyday scenarios.
Before: An application starts responding slowly. Monitoring systems fire hundreds of alerts across different tools. An engineer spends two hours correlating logs, metrics and traces to identify that a database connection pool is exhausted. They manually scale the service, update documentation and hope to remember the fix next time.
After: The same slowdown triggers anomaly detection. The AIOps platform correlates signals across the stack, identifies the connection pool issue and surfaces it as a single incident with context. Either an automated remediation kicks in (scaling the pool based on learned patterns) or the engineer receives a notification with diagnosis complete and remediation steps suggested. Resolution time drops from hours to minutes.
Before: A data science team builds a pricing optimisation model. After three months of development, they hand a trained model to engineering. Engineering spends another month building deployment infrastructure, writing monitoring code and figuring out how to version the model. By the time it reaches production, the model is stale and performs poorly.
After: The same team works within an MLOps platform. Development happens in standard environments with experiment tracking. When ready, the data scientist triggers deployment through a single interface. The platform handles testing, versioning, deployment and monitoring. The model reaches production in days instead of months, and automatic retraining keeps it current.
These patterns extend across industries. Financial services firms use MLOps for fraud detection models that need continuous updating. E-commerce platforms use AIOps to manage complex microservices architectures. Healthcare organisations use both to ensure critical systems remain available whilst deploying diagnostic models safely.
The Tech Behind the Transformation (Optional Deep Dive)
If you want to understand why this convergence is happening now, it helps to know about transformers and vector embeddings. If you're more interested in implementation, skip to the next section.
The breakthrough that enabled modern AI came in 2017 with a paper titled "Attention Is All You Need". Ashish Vaswani and colleagues at Google introduced the transformer architecture, a neural network design that processes sequential data (like sentences) by computing relationships across the entire sequence at once, rather than step by step.
The key innovation is self-attention. Earlier models struggled with long sequences because they processed data sequentially and lost context. Self-attention allows a model to examine all parts of an input simultaneously, computing relationships between each token and every other token. This parallel processing is a major reason transformers scale well and perform strongly on large datasets.
Transformers underpin models like GPT and BERT. They enable applications from chatbots to content generation, code assistance to semantic search. For operations teams, transformer-based models power the natural language interfaces that let engineers query complex systems in plain English and the embedding models that enable semantic search across logs and documentation.
Vector embeddings represent concepts as dense vectors in high-dimensional space. Similar concepts have embeddings that are close together, whilst unrelated concepts are far apart. This lets models quantify meaning in a way that supports both understanding and generation.
In operations contexts, embeddings enable semantic search. Instead of searching logs for exact keyword matches, you can search for concepts. Query "authentication failures" and retrieve related events like "login rejected", "invalid credentials" or "session timeout", even if they don't contain your exact search terms.
Retrieval-Augmented Generation (RAG) combines these capabilities to make AI systems more accurate and current. A RAG system pairs a language model with a retrieval mechanism that fetches external information at query time. The model generates responses using both its internal knowledge and retrieved context.
This approach is particularly valuable for operations. A RAG-powered assistant can pull current runbook procedures, recent incident reports and configuration documentation to answer questions like "how do we handle database failover in the production environment?" with accurate, up-to-date information.
The technical stack supporting RAG implementations typically includes vector databases for similarity search. As of 2025, commonly deployed options include Pinecone, Milvus, Chroma, Faiss, Qdrant, Weaviate and several others, reflecting a fast-moving landscape that's becoming standard infrastructure for many AI implementations.
Where to Begin
Starting with ML-powered operations doesn't require a complete transformation. Begin with targeted improvements that address your most pressing problems.
If you're struggling with alert-fatigue...
Start with event correlation. Many AIOps platforms offer this as an entry point without requiring full platform adoption. Look for solutions that integrate with your existing monitoring tools and can demonstrate noise reduction in a proof of concept.
Focus on one high-volume service or team first. Success here provides both immediate relief and a template for broader rollout. Track metrics like alerts per day, time to acknowledge and time to resolution to demonstrate impact.
Tools worth considering include established platforms like Datadog, Dynatrace and ServiceNow, alongside newer entrants like PagerDuty AIOps and specialised incident response platforms like incident.io.
If you have ML models stuck in development...
Begin with MLOps fundamentals before investing in comprehensive platforms. Focus on model versioning first (track which code, data and hyperparameters produced each model). This single practice dramatically improves reproducibility and makes collaboration easier.
Next, automate deployment for one model. Choose a model that's already proven valuable but requires manual intervention to update. Build a pipeline that handles testing, deployment and basic monitoring. Use this as a template for other models.
Popular MLOps platforms include MLflow (open source), cloud provider offerings like AWS SageMaker, Google Vertex AI and Azure Machine Learning, and specialised platforms like Databricks and Weights & Biases.
If you're building with LLMs...
Implement observability from day one. LLM applications are different from traditional software. They're probabilistic, can be expensive to run, and their behaviour varies with prompts and context. You need to monitor performance (response times, throughput), quality (output consistency, appropriateness), bias, cost (token usage) and explainability.
Common pitfalls include underestimating costs, failing to implement proper prompt versioning, neglecting to monitor for model drift and not planning for the debugging challenges that come with non-deterministic systems.
The LLM observability space is evolving rapidly, with platforms like LangSmith, Arize AI, Honeycomb and others offering specialised tooling for monitoring generative AI applications in production.
Why This Matters Beyond the Tech
The convergence of ML and operations isn't just a technical shift. It requires cultural change, new skills and rethinking of traditional roles.
Teams need to understand not only deployment automation and infrastructure as code, but also concepts like attention mechanisms, vector embeddings and retrieval systems because these directly influence how AI-enabled services behave in production. They also need operational practices that can handle both deterministic systems and probabilistic ones, whilst maintaining reliability, compliance and cost control.
Data scientists are increasingly expected to understand production concerns like latency budgets, deployment strategies and operational monitoring. Operations engineers are expected to understand model behaviour, data drift and the basics of ML pipelines. The gap between these roles is narrowing.
Security and governance cannot be afterthoughts. As AI becomes embedded in tooling and operations become more automated, organisations need to integrate security testing throughout the development cycle, implement proper access controls and audit trails, and ensure models and automated systems operate within appropriate guardrails.
The organisations succeeding with these practices treat them as both a technical programme and an organisational transformation. They invest in training, establish cross-functional teams, create clear ownership and accountability, and build platforms that reduce cognitive load whilst enabling self-service.
Moving Forward
The convergence of machine learning and operations isn't a future trend, it's happening now. AIOps platforms are reducing alert noise and accelerating incident response. MLOps practices are getting models into production faster and keeping them performing well. The economic case for SRE automation is driving investment and innovation.
The organisations treating this as transformation rather than tooling adoption are seeing results: fewer outages, faster deployments, models that actually deliver value. They're not waiting for perfect solutions. They're starting with focused improvements, learning from what works and scaling gradually.
The question isn't whether to adopt these practices. It's whether you'll shape the change or scramble to catch up. Start with the problem that hurts most (alert fatigue, models stuck in development, reliability concerns) and build from there. The convergence of ML and operations offers practical solutions to real problems. The hard part is committing to the cultural and organisational changes that make the technology work.
Modernising SAS: The 4GL Apps and SASjs Ecosystem
Custom interfaces to the world's most powerful analytics platform are no longer a niche concern. In many organisations, SAS remains central to reporting, modelling and operational decision-making, yet the way users interact with that capability can vary widely. Some teams still rely on desktop applications, batch processes, shared drives and manual interventions, while others are moving towards web-based interfaces, stronger governance and a more modern development workflow. The material at sasapps.io points to an ecosystem built around precisely that transition, blending long-standing SAS expertise with open-source tooling and documented delivery methods.
The Company Behind the Ecosystem
At the centre of this transition is 4GL Apps. The company's positioning is straightforward: help organisations leverage their SAS investment through services, solutions and products that fit specific needs. Rather than replacing SAS, the aim is to extend it with custom interfaces and delivery approaches that are maintainable, transparent and based on standard frameworks. An emphasis on documentation appears throughout the site, suggesting that projects are intended either for handover to internal teams or for ongoing support under clearly defined packages.
That proposition matters because many SAS environments have grown over years, sometimes decades. In such settings, technical capability is rarely the issue. The challenge is more often how to expose that capability in ways that are usable, secure and sustainable. A powerful analytics platform can still be hampered by awkward user journeys, brittle desktop tooling or resource-heavy support arrangements, and the 4GL Apps model tries to address those practical concerns without discarding existing SAS infrastructure.
Services
The service offering gives a useful sense of how this approach is organised. One strand is SAS App Delivery, framed not merely as building applications, but also as building tools that make SAS app development faster. That detail points to an emphasis on repeatability rather than one-off implementation. Another strand is SAS App Support, aimed at organisations with existing SAS-powered applications but insufficient internal resource to keep them running. Fixed-price plans are offered to keep those interfaces active, which implies an attempt to make operational costs more predictable. A third service area is SASjs Enhancement, where new features can be added to SASjs at a discounted rate to support particular use cases.
Solutions
These services sit alongside a broader set of solutions. One is the creation of SAS-powered HTML5 applications, described as bespoke builds tailored to specific workflow and reporting requirements, using fully open-source tools, standard frameworks and full documentation. Clients are given a practical choice: maintain the application in-house or use a transparent support package. Another solution addresses end-user computing risk through data capture and control. Here, the approach enables business users to self-load VBA-driven Excel reporting tools into a preferred database while applying data quality checks at source, a four-eyes (or more) approval step at each stage and full audit traceability back to the original EUC artefact. A further solution is the modernisation of legacy AF/SCL desktop applications, with direct migration to SAS 9 or Viya in order to improve user experience, security and scalability while moving to a modern SAS stack supported by open-source technology.
That last area reveals a theme running through the whole ecosystem: modernisation does not necessarily mean abandoning what exists. In many SAS estates, AF/SCL applications remain deeply embedded in business processes, and replacing them outright can be costly and risky, especially when they encode years of operational logic. A migration path that preserves business function while improving maintainability and interface design will naturally appeal to teams that need progress without disruption.
Products
The product range fills out the picture further. Data Controller for SAS enables business users to make controlled changes to data in SAS. The SASjs Framework is a collection of open-source tools to accelerate SAS DevOps and the development of SAS-powered web applications. There is also an AF/SCL Kit, migration tooling for the rapid modernisation of monolithic AF/SCL applications. Together, these products form a stack covering interface delivery, governed data change and development workflow, and they suggest that the company's work is not limited to consultancy but includes reusable software assets with their own documentation and source code.
Data Controller: Governance and Audit
Data Controller receives the richest functional description in the ecosystem's documentation. It is intended for business owners in regulatory reporting environments and, more broadly, for any enterprise that needs to perform manual data uploads with validation, approval, security and control. The rationale is rooted in familiar SAS working practices. Users may place files on network drives for batch loading, update data directly using SAS code, open a dataset in Enterprise Guide and change a value, or ask a database administrator to run a script update. According to the product's own documentation, those approaches are less than ideal: every new piece of data may require a new programme, end users may need to have `modify` access to sensitive data locations, datasets can become locked, and change requests can slow the process.
Data Controller is presented as a response to those weaknesses. The goal is described as focusing on great user experience and auditor satisfaction, while saving years of development and testing compared with a custom-built alternative. It is a SAS-powered web application with real-time capabilities, where intraday concurrent updates are managed using a lock table and queuing mechanism. Updates are aborted if another user has changed the table since the approval difference was generated, which helps preserve consistency in multi-user environments. Authentication and authorisation rely on the existing SASLogon framework, and end users do not require direct access to the target tables.
The governance model is equally central. All data changes require one or more approvals before a table is updated, and the approver sees only the changes that will be applied to the target, including new, deleted and changed rows. The system supports loading tables of different types through SAS libname engines, with support for retained keys, SCD2 loads, bitemporal data and composite primary keys. Full audit history is a prominent feature: users can track every change to data, including who made it, when it was made, why it was made and what the actual change was, all accessible through a History page.
A particularly notable feature is that onboarding new tables requires zero code. Adding a table is a matter of configuration performed within the tool itself, without the need to define column types or lengths manually, as these are determined dynamically at runtime. Workflow extensibility is built in through configurable hook scripts that execute before and after each action, with examples such as running a data quality check after uploading a mapping table or running a model after changing a parameter. Taken together, those features position Data Controller less as a narrow upload utility and more as a governed operational layer for business-managed data change.
The application was designed to work on multiple devices and different screen types, combined with SAS scalability and security to provide flexibility and location independence when managing data. This suggests it is intended for practical day-to-day use by business teams rather than solely by technical specialists at a desktop workstation.
SASjs: DevOps for SAS
Underpinning much of the ecosystem is SASjs, described on its GitHub organisation page as "DevOps for SAS." It is designed to accelerate the development and deployment of solutions on all flavours of SAS, including Viya, EBI and Base. Everything in SASjs is MIT open-source and free for commercial use. The framework also explicitly underpins Data Controller for SAS, which connects the product and framework strands of the wider ecosystem. The GitHub organisation page notes that the SASjs project and its repositories are not affiliated with SAS Institute.
The resources page at sasjs.io lists the key GitHub repositories: the Macro Core library, the SASjs adapter for bidirectional SAS and JavaScript communication, the SASjs CLI, a minimal seed application and seed applications for React and Angular. Documentation sites cover the adapter, CLI, Macro Core library, SASjs Server and Data Controller. Useful external links from the same resources page include guides to building and deploying web applications with the SASjs CLI, scaffolding SAS projects with NPM and SASjs, extending Angular web applications on Viya and building a vanilla JavaScript application on SAS 9 or Viya. There is also mention of a Viya log parser, training resources, guides, FAQs and a glossary, pointing to an effort to support both implementation and adoption.
The SASjs CLI
The command-line tooling, documented at cli.sasjs.io, gives a clearer view of how SASjs approaches DevOps. The CLI is described as a Swiss-army knife with a flexible set of options and utilities for DevOps on SAS Viya, SAS 9 EBI and SASjs Server. Its core functions include creating a SAS Git repository in an opinionated way, compiling each service with all dependent macros, macro variables and pre- or post-code, building the master SAS deployment, deploying through local scripts and remote SAS programmes, running unit tests with coverage and generating a Doxygen documentation site with data lineage, homepage and project logo from the configuration file. There is also a feature for deploying a frontend as a streaming application, bypassing the need to access the SAS web server directly.
The full command set covers the project lifecycle. The CLI can add and authenticate targets, compile and build projects, deploy them to a SAS server location, generate documentation and manage contexts, folders and files. It can execute jobs, run arbitrary SAS code from the terminal, deploy a service pack and generate a snippets file for macro autocompletion in VS Code. It can also lint SAS code to identify common problems and run unit tests while collecting results in JSON or CSV format, together with logs. In effect, this brings SAS development considerably closer to the workflows commonly seen in mainstream software engineering, which may be especially valuable in organisations trying to standardise delivery practices across mixed technology estates.
Presentations and the Wider SAS Community
The slides.sasjs.io collection adds another dimension by showing that these ideas have been presented in conference and user group settings. Available decks cover DevOps for MSUG, SUGG and WUSS, SASjs for application development, SASjs Server, AF and AF/SCL modernisation, SASjs for PHUSE, testing and a legacy SAS apps presentation for FANS in January 2023. While slide decks alone do not prove adoption or outcomes, they do show a sustained effort to communicate methods and patterns to the broader SAS community, consistent with the open documentation and MIT licensing found throughout the ecosystem.
Building a Modern Layer Around an Established Platform
The most useful way to understand this ecosystem is not as a single product but as a layered approach. At one level, there are services for building and supporting applications. At another, there are packaged tools such as Data Controller and the AF/SCL Kit. Underneath both sits SASjs, providing open-source components and delivery practices intended to make SAS development more structured and scalable. The combination of bespoke SAS-powered HTML5 applications, governed data update tooling, AF/SCL migration support and open-source DevOps utilities points to a coherent effort to modernise how SAS is delivered and used, without severing ties to established platforms. SAS remains the analytical engine, but the interfaces, workflows and operational controls around it are updated to reflect current expectations in web application design, governance and DevOps practice.