Case Study

Credit Card Fraud Detection in Banking

A fraud detection project designed around a real banking constraint: identifying rare, high-impact events while operating within regulatory, operational, and customer experience limits.

284,807

Transactions

492

Fraud cases

0.172%

Fraud rate

99.82%

Naive accuracy

System workflow

Workflow diagram showing transaction data flowing through feature engineering, model ensemble, decision logic, and analyst review.

The workflow was designed as a layered decision pipeline. Transaction data moved through feature engineering, model scoring, anomaly signals, and decision logic before being approved, flagged, or routed for further review. That structure mattered because the final outcome depended on the interaction of multiple controls rather than on a single model prediction.

Project context

During my time working in banking, fraud detection was approached as a system problem rather than a standalone modelling exercise. The goal was not simply to classify transactions, but to help design a more reliable way of identifying rare fraudulent behaviour while fitting into existing operational and regulatory constraints.

Traditionally, fraud systems relied heavily on rules and manual review. Transactions were flagged using thresholds, heuristic checks, and known patterns, with analysts reviewing cases downstream. That approach worked up to a point, but it became harder to sustain as transaction volumes increased and fraud patterns became more adaptive. The result was a system that generated too much noise while still missing subtle fraud cases.

The central challenge we worked against was the detection of extremely rare events. In representative benchmark datasets, fraud can account for less than 0.2% of all transactions. That creates a misleading situation where a model can appear highly accurate while failing to detect the events that actually matter.

In real banking environments, the data itself introduces additional complexity. Transaction data contains sensitive personally identifiable information, which is typically masked, tokenised, or transformed before it is used for modelling. Instead of relying on raw identifiers, systems operate on behavioural features such as transaction velocity, merchant patterns, geographic shifts, and device signals. That requires data pipelines designed to balance privacy, regulatory compliance, and modelling effectiveness.

Problem definition

Fraud events are infrequent, but the cost of missing them is high. At the same time, incorrectly flagging legitimate transactions increases review volume, creates operational drag, and can directly affect the customer experience.

That creates a clear trade-off. Increasing detection sensitivity can improve fraud capture, but it also tends to raise false positives. Reducing false positives improves the experience for genuine users, but risks letting meaningful fraud slip through. The system therefore had to be calibrated around business impact, not just model output.

Approach

Our approach combined multiple techniques to address both the modelling challenge and the operational reality of the problem.

First, class imbalance was handled explicitly. Techniques such as SMOTE helped improve the model's exposure to minority-class patterns, while cost-sensitive learning ensured that missing a fraud event was penalised more heavily than misclassifying a legitimate transaction.

Second, in an actual banking environment, the feature layer goes much further because data is rarely consumed in raw form. It is curated, transformed, masked where necessary, and then represented through behavioural signals that are safer and more useful for modelling. That includes patterns such as transaction velocity, merchant-category shifts, abnormal location changes, channel behaviour, and device-level inconsistencies.

Third, we evaluated multiple modelling approaches rather than assuming a single model would be sufficient. Gradient boosting models were used for supervised classification, while anomaly detection techniques such as Isolation Forest provided an additional lens by identifying transactions that deviated from normal behaviour.

We also explored autoencoder-based approaches, using reconstruction error as an additional signal within the ensemble to help surface transactions that did not align with learned behavioural patterns.

In practice, the strongest design was not a single model but an ensemble. Supervised model scores, anomaly signals, and rule-based checks were combined to produce a stronger decision signal, with human review still playing an important role for edge cases.

Illustrative implementation snippets

The original enterprise implementation is not shared publicly. The examples below are simplified representations of the modelling logic, included to show the structure of the work rather than the exact production code.

Balancing the training data and fitting a boosted model

from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

model = XGBClassifier(
    scale_pos_weight=10,
    max_depth=5,
    n_estimators=300,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
)

model.fit(X_resampled, y_resampled)

Scaling and focusing on the strongest signals

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X[["Amount", "Time"]] = scaler.fit_transform(X[["Amount", "Time"]])

important_features = ["V14", "V12", "V17", "V10", "V4"]
X_model = X[important_features + ["Amount", "Time"]]

Evaluating the model with the right metrics

from sklearn.metrics import average_precision_score, f1_score, recall_score

fraud_prob = model.predict_proba(X_test)[:, 1]
fraud_pred = (fraud_prob >= threshold).astype(int)

pr_auc = average_precision_score(y_test, fraud_prob)
recall = recall_score(y_test, fraud_pred)
f1 = f1_score(y_test, fraud_pred)

print({
    "PR_AUC": pr_auc,
    "Recall": recall,
    "F1": f1,
})

Why accuracy fails

A model that predicts every transaction as legitimate achieves very high accuracy but detects no fraud. That is why fraud detection cannot be evaluated through conventional metrics alone.

More useful measures include precision-recall AUC, recall, and F1 score. Precision-recall AUC focuses attention on the minority fraud class. Recall matters because missed fraud carries direct financial and reputational cost. F1 becomes useful when balancing fraud capture against the operational burden created by false positives.

The threshold itself is not just a technical parameter. It is a business decision shaped by risk appetite, analyst capacity, customer experience, and the acceptable cost of being wrong.

Real-world constraints

In production banking environments, the model is only one part of the system. Decisions need to be explainable, auditable, and aligned with regulatory expectations. Fraud systems operate under continuous scrutiny, and their outputs often need to be justified to internal risk teams, governance functions, and external regulators.

There is also a constant balance between risk and customer experience. A system that aggressively blocks transactions may reduce fraud, but it can also damage trust. A system that is too lenient may reduce friction while allowing material losses to pass through. The right operating point depends on business context, risk appetite, and operational capacity.

This is also where the broader system design matters. Scores need to fit into authorisation flows, rule engines, analyst queues, investigation processes, and monitoring layers. The quality of the outcome depends not only on the model, but on how the whole decision system is designed.

Impact

The value of this work was not in producing a single high-performing model in isolation. It was in strengthening the overall fraud detection approach for a rare-event setting by combining supervised learning, anomaly detection, and decision-layer controls.

In practical terms, that meant moving toward more reliable detection of rare fraud patterns while keeping operational false positives more manageable. Just as importantly, it helped frame fraud detection as a governed decision system rather than a narrow modelling problem.

Reflection

This kind of work shaped how I think about applied data science. The objective is rarely to optimise a metric in isolation. It is to help design systems that behave well under real constraints, including rare events, asymmetric risk, regulation, operational limits, and human consequences.

That perspective extends well beyond fraud. It applies to any domain where decisions matter more than predictions.