Gardens by the Night / Photographed by Myself

Adversarially Robust Sketch-Based Anomaly Detection System for Real-Time Data Streams.

By:
Category:
Date:

I didn't know what image to use for this so I just used a photo I took last summer in Singapore at the famous Gardens by the Bay.

Thanks for Krzysztof Onak for guiding me through this project and teaching me so much about sublinear algorithms and the theory behind algorithms for big data and their applications.

The title of this project is an absolute mouth-full I know but like all projects of this nature require some amount of intellectual responsibility to make it as specific and clear (isn't that ironic) as possible. Let's be honest pretty much every academic paper has that superabundance of technical jargon and now I kinda see why they have to do that.

Introduction

See Github here

The ability to detect anomalies in real-time in vast streams of data is a significant challenge and is crucial for timely decision-making in many domains, such as fraud detection in finance or intrusion detection in cybersecurity. Being able to respond to real-time threats requires systems that can analyse streaming data on-the-fly, identify patterns, and quickly raise alerts. This capability ensures organisations can promptly mitigate risks, protect their assets, and maintain the integrity of their operations.

Streaming data, by its nature, can be voluminous and fast. Designing algorithms that can keep pace with this data flow, providing accurate and timely insights without requiring vast amounts of memory, is a fascinating engineering challenge. This project will use a sketch-based approach to data processing and detection, which enables efficient memory-constrained data structures that are crucial in high-volume, big data environments where using linear storage space is not optimal. These memory-efficient algorithms ensure that anomaly detection systems remain scalable even as data streams grow exponentially.

An important aspect of anomaly detection in real-time streaming data is ensuring adversarial robustness. Cyber attackers and malicious actors often attempt to disguise anomalies to evade detection by intentionally manipulating data streams. Thus, it is essential that the algorithms and models employed in anomaly detection can withstand and identify such adversarial strategies. This project will focus on one such technique called algorithm switching which seeks to withstand adversary attacks by storing an independent number of individual sketches. By focusing on adversarial robustness, the project aims to deliver anomaly detection systems that remain reliable even under targeted attacks, safeguarding systems from sophisticated threats and ensuring the continuity and reliability of operations across various industries.

The versatile nature of anomaly detection techniques designed for streaming data can be applied across various industries, from overseeing IoT device networks to analysing user behaviour on web platforms. This project, however, will primarily analyse transactional data in financial applications which can help detect fraudulent activities swiftly but the system as a whole could just as easily be applied to other industries such as in manufacturing where real-time sensor data can identify potential issues before they escalate. By capitalising on this cross-industry applicability, the project aims to provide solutions that adjust to various settings and demands, ultimately enhancing security, efficiency, and reliability across different sectors.

Objective

The primary objective of this project is to focus on finding fraudulent anomalies in financial transactions by developing an anomaly detection system that is both adversarially robust and scalable. The system will be designed to withstand intentional data manipulations by leveraging adversarial training techniques, ensuring it can reliably identify fraudulent activity even in the face of sophisticated evasion tactics. Moreover, it will be built to handle massive streaming data flows with sub-linear space efficiency, allowing it to scale seamlessly in high-volume data streaming environments.

While this project will prioritise optimising the robustness and scalability of the system, assessing the efficacy of its anomaly detection logic is beyond the scope of this project. It is likely possible that the efficacy of the entire system as a whole could be benefited greatly by more sophisticated anomaly detection algorithms or models, however, the main focus is still on the efficiency and security of the data processing structures underlying the system.

How this Project Differs

By combining different sketching algorithms, the system can leverage their strengths, such as using CountMinSketch for frequency approximation, AMS sketches for F2 frequency estimation and robust algorithms for tracking distinct elements and HyperLogLog for cardinality tracking, to improve overall anomaly detection accuracy.

Specifically focusing on robustness under adversarial conditions (where data might be intentionally manipulated to avoid detection).

System Design

Data Ingestion

Apache Kafka is an optimal choice for handling real-time data feeds in system designs that require low latency, high throughput, and scalability. Its distributed architecture enables Kafka to process massive data streams efficiently, with the capability to handle thousands of messages per second. By utilising Kafka, systems can maintain performance stability even as data volume grows, thanks to its scalable nature. Furthermore, Kafka’s design minimises latency, ensuring that data is processed and made available for consumption almost instantaneously. This is crucial for applications that depend on timely data processing, such as real-time analytics or monitoring systems. Additionally, Kafka’s robust ecosystem supports a wide range of integrations and tools, making it a versatile backbone for any data-driven architecture.
*For more specific information on how Kafka is implemented in this project. Please see the project files on GitHub.

Data Processing

In the system design for data processing, we employ a combination of advanced sketching algorithms to enhance the accuracy and efficiency of anomaly detection. By leveraging the CountMin Sketch for frequency approximation, we can efficiently estimate the frequency of occurrences in data streams with high accuracy and minimal space requirements. Additionally, AMS sketches are utilised for F2 frequency estimation, providing a robust method for estimating the second frequency moment, which is crucial for understanding the variance and distribution of data streams. For tracking distinct elements within data streams, the HyperLogLog algorithm is integrated due to its exceptional ability to estimate cardinalities with high accuracy while using minimal memory. This synergistic use of different sketching algorithms allows the system to capitalise on their individual strengths, leading to significantly improved accuracy in detecting anomalies within large-scale data environments.
Again, this project focuses on the underlying data processing structures rather than the intricacies of the anomaly detection algorithms themselves. So for simplicity's sake this project’s anomaly detection component, we employ a straightforward yet effective method based on quartile multiples.
For this project, we are applying this system specifically to the area of anomaly detection in the financial transaction space. We are employing simulated transactional data to navigate the challenges posed by the scarcity of confidential financial transactional information. Due to the sensitive nature of real financial data, obtaining access can be restrictive and fraught with privacy concerns. By creating simulated datasets, we can mimic the characteristics and behaviours of actual transactional data while adhering to privacy regulations and ethical standards. Although this simulated nature may not capture the complete accuracy and complexity of real-world financial transactions, it is sufficiently detailed to serve as an effective proof of concept. This approach allows us to test and refine our data processing and anomaly detection systems in a controlled environment, providing valuable insights and performance metrics that are indicative of how the system would perform with actual data. This method ensures that we can proceed with development and testing without compromising data security or privacy.

Implementation

Data Ingestion

Apache Kafka: In our system architecture, we integrate Apache Kafka with Confluent Cloud to streamline the processing of simulated data, leveraging Kafka's robust messaging capabilities in a managed cloud environment. This setup involves a single producer that generates and sends simulated data, mirroring the complexity and variability of real-world financial transactions. This producer is responsible for continuously publishing data streams to a Kafka topic, ensuring a consistent flow of data into the system. On the other side, we have a dedicated consumer constantly subscribed to this topic, listening for new messages as they arrive. The use of Confluent Cloud enhances this arrangement by providing a scalable, secure, and resilient infrastructure that manages Kafka clusters efficiently, reducing the operational overhead and complexity associated with self-managed Kafka setups. This configuration not only facilitates real-time data processing and analysis but also ensures high availability and fault tolerance, crucial for maintaining continuous data flow and processing without disruptions. This setup exemplifies a streamlined, cloud-based approach to data streaming that is both efficient and effective for handling large volumes of simulated transactional data.

Data Processing

CountMinSketch: We use CountMin Sketch to track and aggregate several key attributes of transactions: the frequency of transactions associated with specific User IDs, the distribution of transaction amounts, the usage of different currencies, and the prevalence of transactions in distinct item categories. Additionally, we also monitor transaction amounts tied to specific locations.

Checking for: Abnormal changes in frequency

AMS Sketch: We deploy the AMS Sketch to estimate the second frequency moment, or F2, of transactions, which provides crucial insights into the variance and spread of transaction values and frequencies across multiple dimensions. This sketch-based approach is integral for analysing the variance in transactional data, such as the skewness in transaction amounts for each user, indicating how unevenly transaction values are distributed. It also allows us to measure the variance in the number of transactions per given time intervals, such as hourly or daily, which is essential for understanding patterns of activity and potential anomalies.

Checking for: Abnormal changes in skewness

HyperLogLog: The HyperLogLog algorithm plays a crucial role in efficiently tracking the cardinality of various transaction features. This data structure is employed to estimate the number of distinct elements in our data streams, such as unique User IDs, distinct transaction amounts, and different currencies used.

Checking for: Abnormal changes in cardinality i.e., the sudden creation of many users.

Adversarial Robustness

For adversarial robustness in the data processing system, we implement a technique known as "Sketch/algorithm switching," specifically designed for insertion-only data streams. This method builds upon a foundational, albeit non-robust algorithm (designated as A) that typically provides a (1 + ε/20)-multiplicative approximation with a probability of 1−δ/m2 . To tackle the limitations of A in adversarial environments, we employ a series of independent of copies of A, designated A1, A2 …, At, where 𝑡 is determined by 𝑂(𝜖−1log 𝑚), which are adequate to manage the error probability and ensure robust estimations without exhausting space resources prematurely.

During processing, each incoming stream item is independently passed to every copy Ai,. We maintain a dynamic estimate that is updated conditionally—if the estimate from the currently focused copy 𝐴index  exceeds the current estimate by more than (1+ϵ/2), the system adopts this new value and moves to the next index. This conditional updating mitigates the risk of adversarial manipulation by frequently switching the estimation basis among multiple algorithmic instances, thus avoiding the pitfalls of any single instance being overly influenced or corrupted.

This technique offers a strategic balance between accuracy and computational overhead. While keeping multiple copies introduces multiplicative space overhead by a factor of 𝑂(𝜖−1log𝑚), it remains significantly more space-efficient than storing the entire input, especially under high-adversity conditions where robustness is critical. The algorithmic switching method therefore ensures that we maintain high-quality data fidelity and robustness without the prohibitive costs of traditional robust methods.
Testing
                The financial transaction data was simulated through a Python script that generates both typical and anomalous financial transactions to be sent to a Kafka topic. This script selects random values from predefined lists of currencies and transaction categories, including everyday categories like groceries, utilities, and rent, as well as categories indicative of anomalous activity for later segments of data generation. The amounts for transactions are determined using a Pareto distribution, which is a common statistical method to model the distribution of wealth or, in this case, transaction amounts, making it a realistic representation of spending patterns where smaller transactions are more frequent but larger transactions are not uncommon.
For typical transactions, the script generates data points such as user IDs, transaction IDs, categories, amounts, and timestamps. It categorises the amount into intervals to facilitate analysis on various scales of transaction sizes. For the anomalous transactions, larger amounts and specific currencies considered less typical are used to simulate suspicious financial activities. Each transaction is produced to the Kafka topic with a unique key (the transaction ID) and a JSON-encoded value containing all the transaction details. The script includes a callback function to ensure successful delivery to the Kafka server, handling any errors in data transmission and confirming each message's successful delivery. This simulation approach provides a comprehensive dataset for testing financial systems or fraud detection algorithms, reflecting both normal user behaviour and potential fraudulent activities.

Results

Once the simulated non-anomalous data was generated and the multiplicative-based thresholds were established the algorithm was able to clearly detect all anomaly types from frequency (CMS), variance (AMS), and cardinality anomaly (HyperLogLog). In the generation of anomalous transactions, we flooded the system with an influx of new User Id’s to check HyperLogLog functionality; we also flooded the system with an excess of certain transaction categories (Groceries, Utilities, Rent, Mortgage)  and currencies (‘XDR’, ‘CHW’)  along with abnormally high transaction amount totals to check the functionality of the CMS; and we also generated high transaction amount totals to check the variance functionality for the AMS sketch. Overall the system was able to successfully pick up the unusual features rather quickly.
Future Improvements

Besides the obvious flaws such as using non-real world financial transaction data and using a dubious multiplicative based anomaly detection threshold. Further improvements can be made to the algorithm by making these thresholds adaptive or somewhat ‘temporal’ (i.e. certain times of day and certain months have higher spending amounts or more mortgage payments due etc.). There is also the potential to layer these sketches for a more robust anomaly detection system and to potentially add more sketches such as something like T-digest to efficiently estimate percentiles (derivation of k means).
Overall, the current system serves as a good starting template for a highly scalable, efficient, data anomaly processing system and with some slight tweaks, it can be used across a variety of industries.

"The essence of greatness is the perception that virtue is perception that virtue is enough."
— Ralph Waldo Emerson