Innovation Engineering
AI in Financial Services: Moving from Pilot to Production
Feb 10, 2026

AI in Financial Services: Moving from Pilot to Production

Most financial institutions have run an AI pilot. Far fewer have taken one to production. Here is what separates the AI experiments that scale.


There is a graveyard that every bank and insurer in Southeast Asia maintains quietly. It contains AI proofs-of-concept — fraud detection models, credit scoring engines, document processing pipelines — that produced impressive results in a sandbox environment, received enthusiastic sign-off from the innovation committee, and then were never heard from again. The projects did not fail because the models were bad. They failed because the institutions discovered, too late, that getting a model into production is a fundamentally different problem from building one.

McKinsey’s 2023 State of AI report found that only 55% of organisations that had piloted AI had deployed at least one use case into production. In financial services, where regulatory requirements and operational risk tolerances are higher, the conversion rate is lower still. Our experience working with financial institutions across Malaysia and Southeast Asia corroborates this: the average institution we engage with has run three to five AI pilots; fewer than one has reached production scale.

The reasons are almost never about the model. The model is usually fine. The problems are in the infrastructure, the governance, and the organisation around the model.

Why Pilots Succeed and Productions Fail

The conditions that make a pilot succeed are almost precisely the conditions that production systems cannot rely on.

Data quality is the most common divergence. Pilots are built on curated datasets — historical transaction records cleaned by a data analyst, KYC documents selected to represent good quality scans, credit files pulled from a consistent source. Production systems receive data from dozens of upstream systems with varying quality, formatting conventions, and latency. A fraud detection model that achieves 94% accuracy on a curated historical dataset may perform significantly worse when it encounters the real-time data pipeline, with its missing fields, encoding inconsistencies, and occasional upstream outages.

Pilots are judged on model metrics; production systems are judged on business outcomes. An accuracy score of 92% sounds impressive in a pilot presentation. In production fraud detection, what matters is the cost of false positives (legitimate transactions incorrectly declined — a customer experience failure that drives churn) versus the cost of false negatives (fraudulent transactions that clear — a direct financial loss). Optimising for accuracy is not the same as optimising for the loss function that the business actually cares about. We have seen pilot models that looked excellent on standard metrics perform poorly in production because the business impact of each error type was not factored into model design.

Model governance is undefined. In a pilot, the data scientist who built the model is also the person who monitors it, updates it, and decides when it needs retraining. In production, this person moves on to the next project. Who owns the model? Who retrains it when performance degrades? Who approves a model update before it goes live? Who is accountable when the model makes a decision that a customer disputes? These questions sound organisational rather than technical, but they have technical implications — the tooling for model versioning, approval workflows, and audit trails must be in place before the model goes live.

The notebook-to-production gap is real. Jupyter notebooks are excellent tools for exploration and prototyping. They are not production software. The transition from a well-written notebook to a production ML pipeline typically requires rewriting the feature engineering logic in a form that can run reliably at scale, building data validation and error handling, implementing logging and monitoring, and integrating with the institution’s deployment and operations toolchain. This is frequently underestimated by three to five times in effort.

Regulatory expectations have sharpened. Bank Negara Malaysia’s Financial Technology Innovation Strategy (FTIS) and the Monetary Authority of Singapore’s FEAT Principles (Fairness, Ethics, Accountability, Transparency) both require that AI systems used in consequential decisions — credit, insurance, fraud — be explainable and auditable. A black-box model that cannot produce a reason for its decision cannot be deployed in these contexts. This is not a future requirement. It is current regulatory expectation.

The Four Pillars of Production-Ready AI

Getting AI to production in a financial institution requires investment across four areas simultaneously. Teams that address only one or two of them reliably stall.

Data Infrastructure

The model is only as reliable as the data it receives. Production AI systems require a data infrastructure that most financial institutions do not have at the start of their AI journey.

A feature store is a centralised repository of computed features — the derived inputs to ML models — that serves both training and inference. Without one, the feature engineering logic that runs during training must be replicated exactly at inference time. This duplication is a source of subtle bugs: a feature computed slightly differently between training and serving produces a model that performs worse in production than on the test set. It is also operationally fragile — a change to one pipeline that is not reflected in the other breaks the model silently.

Data versioning and lineage tracking are non-negotiable in a regulated environment. When a credit model makes a decision in 2026, regulators and auditors must be able to reconstruct exactly what data trained the model version that made that decision, what features were computed, and what version of the feature engineering logic was used. Without versioning and lineage, this reconstruction is impossible.

We cannot overstate how much the quality of data infrastructure determines the ceiling for AI capability in an institution. The best model in the world cannot compensate for unreliable data.

Model Lifecycle Management

Production models are not static. They are deployed into an environment that changes: customer behaviour evolves, fraud patterns shift, macroeconomic conditions change. A model that performs well at deployment will degrade over time as the distribution of its inputs drifts away from what it was trained on. MLOps practices address this.

Automated retraining triggers monitor model performance metrics and data distribution statistics. When either crosses a threshold — a drop in the Gini coefficient for a credit model, or a shift in the statistical distribution of input features — a retraining job is triggered automatically, with the resulting model going through a validation and approval workflow before deployment.

Canary deployments for models mirror the software engineering practice: a new model version receives a small fraction of traffic (5–10%) and its outputs are compared to the incumbent before a full rollout. This allows performance degradation or unexpected behaviour to be caught before it affects all decisions.

Shadow scoring runs a new model version in parallel with the incumbent, recording its decisions without acting on them. This produces a rich dataset for comparing model versions before any production traffic is committed.

Explainability

For credit decisioning, fraud investigation, and insurance underwriting, explainability is both a regulatory requirement and an operational necessity.

SHAP (SHapley Additive exPlanations) values are our most commonly used technique: they decompose a model’s prediction into the contribution of each feature, producing a consistent and mathematically grounded explanation. For a credit decision, this translates directly into a reason code: “Application declined due to high existing debt-to-income ratio and short employment tenure.” This is the level of specificity that regulators require and that customer-facing teams need to handle disputes.

Where explainability requirements are stringent, we sometimes recommend constrained model architectures — scorecard models or monotone gradient boosting — that trade some predictive accuracy for interpretability. In practice, the accuracy trade-off is smaller than most data scientists expect, and the operational and regulatory benefits are large.

Monitoring

Model monitoring is distinct from infrastructure monitoring. A model can be running correctly — receiving requests, returning responses, with no errors in the logs — while simultaneously producing wrong answers, because the population of customers sending requests has shifted in a way the model was not trained to handle.

We instrument production models with two classes of monitoring. Data drift detection tracks the statistical distribution of input features in production and alerts when they diverge from the training distribution beyond a defined threshold. Performance monitoring tracks business-relevant metrics — approval rates, fraud catch rates, default rates on approved credits — and alerts on anomalies. Both require a baseline established from the training period and an alerting mechanism connected to the on-call rotation.

A rollback procedure must be defined and tested before go-live. When a model is identified as degraded, the ability to revert to the previous version in minutes — not hours — is critical.

Real Use Cases That Make It to Production

Some AI use cases have a significantly better track record of reaching production than others. The common factor is not model complexity. It is data reliability.

Credit scoring for MSMEs has become one of the most successful production AI applications in Southeast Asian financial services. Traditional credit assessment struggles with MSMEs because most small businesses have thin or non-existent credit bureau files. Alternative data — 12 to 24 months of transaction history, payment behaviour on supplier invoices, cash flow patterns — can supplement bureau data effectively. This works in production when the data pipeline from the relevant bank accounts or accounting systems is reliable. The World Bank’s 2022 report on MSME finance estimated a financing gap of US$5 trillion globally, with Southeast Asian markets significantly underserved — AI-assisted credit assessment is one of the more credible paths to closing it.

Real-time fraud detection is one of the highest-stakes production AI use cases, with latency requirements that shape the entire architecture. A fraud scoring system that takes 500 milliseconds to respond cannot sit in the synchronous path of a payment transaction. Production systems require feature engineering to happen at data ingest — not at query time — so that the features needed for scoring are precomputed and available sub-100 milliseconds when a transaction arrives. Teams that discover this constraint after building their feature engineering at query time face a significant rearchitecting effort.

Document processing for KYC — OCR plus entity extraction for identity documents — achieves high accuracy on structured, high-quality documents and degrades predictably on low-quality scans. Production systems require a human-in-the-loop fallback: documents that score below a confidence threshold are routed to a human reviewer rather than processed automatically. This is not a limitation to hide. It is a feature that makes the system trustworthy. Institutions that try to eliminate the human review step to reduce cost typically experience accuracy problems that create more cost in remediation.

Collections propensity-to-pay modelling has one of the better production success rates of any financial services AI application. The use case — ranking delinquent accounts by likelihood of repayment in order to prioritise outreach — is relatively forgiving of model error. An incorrect ranking means a suboptimal collections sequence, not a credit loss or a regulatory incident. This lower error cost makes the operational bar to deployment more achievable, and the business value of even a moderate improvement in collections efficiency is meaningful.

Building the Organisational Muscle

AI in financial services is not a data science problem with a technology solution. It is an organisational problem with a technology component.

The production use cases we have seen succeed share a common organisational characteristic: they had a named owner — not a project team, but a specific person — who was accountable for the model’s performance in production. This person typically sits in the business unit that benefits from the model (credit risk, fraud, operations), not in the technology or data function. They bring domain expertise that no data scientist working in isolation can replicate. They are present in model review meetings, they participate in incident response when the model behaves unexpectedly, and they own the decision to retrain or roll back.

The team composition for a production AI use case in a regulated financial institution is larger and more diverse than most pilots account for: data engineers to build and maintain the data pipeline, ML engineers to build the model lifecycle infrastructure, domain experts (credit analysts, fraud investigators) who provide the ground truth and validate model outputs, compliance representatives who ensure regulatory requirements are met, and IT operations who own the production environment. Pilots typically have data scientists and business stakeholders. Productions require all of the above.

Our strongest recommendation to institutions that have run multiple pilots without reaching production is to stop running new pilots and get one existing use case into production. The learnings from shipping one model — the data pipeline challenges, the operational tooling gaps, the governance processes that needed to be invented — are more valuable than any number of additional proof-of-concepts. Once the first use case is in production and stable, the second takes a fraction of the effort.

The governance layer that regulated institutions need is also non-trivial to build: model cards that document what a model does and does not do, version control for both model artefacts and training datasets, approval workflows for model changes with appropriate sign-off, and audit trails that satisfy regulatory requirements. None of this is optional, and all of it requires upfront investment that delivers no visible output until the moment a regulator or auditor asks a question that it can answer.

The Gap Is Operational, Not Technological

The models that financial institutions are building are, by and large, good enough. The gap between AI pilots and production is not a technology gap — it is an operational and organisational one. Closing it requires the same discipline that makes any complex software system work: clear ownership, robust data infrastructure, rigorous testing, and the willingness to instrument everything.

The institutions that are building genuine AI capability in Southeast Asia are not the ones running the most pilots. They are the ones that have treated their first production deployment as a platform investment — building the data infrastructure, the MLOps tooling, and the governance processes that make every subsequent deployment faster and more reliable. That investment is front-loaded and invisible until it pays off. But it is the only path from the pilot graveyard to production.


Learn how Nematix’s Innovation Engineering services help financial institutions take AI from proof-of-concept to production.