Skip to content

Latest commit

 

History

History
267 lines (244 loc) · 17.8 KB

README.md

File metadata and controls

267 lines (244 loc) · 17.8 KB

Anomaly Detector Project

This project fulfills requirements for an anomaly-detection package (for the mythical e-commerce firm "Market-ter") which identifies and logs any user's purchase that is found to be an anomaly (unusually high) in comparison to recent purchases made by those within the user's network of friends.

The specifications state that a purchase amount is an "anomaly" for a user if that amount exceeds the mean (average) of the most recent purchases among the user's network of friends by more than three times the standard deviation of all of those purchases.


Javadocs documentation for this project: http://dvimont.github.io/insight-anomaly-detection/


Processing Overview

A batch run of the anomaly_detector package begins with execution of the #main method of the App class, which requires three String arguments to be passed to it representing the local paths for (1) the batch initialization file, (2) the stream log file (containing transactions for anomaly processing), and (3) the output ("flagged purchases") file.

An instance of the TransactionProcessor class instantiates itself by reading in and processing the contents of the batch initialization file, including the "degrees of separation" constraint (D) and the "threshold of purchases" constraint (T) from the first line of batch input. Following instantiation, one of the TransactionProcessor's #process methods is invoked to read in and process the stream log transactions, and when an anomaly purchase is identified, it is written to the output ("flagged purchases") file.

For all transactions processed in the batch initialization phase and the stream log processing phase, each User involved in the transaction is retrieved via the static User#getOrCreateUser method, which (as the method name suggests) either retrieves an existing User object or creates a new one.

Once User retrieval/creation is completed, each event is processed as follows:
  • "befriend" -- The user#befriend method is invoked for each user to establish the reciprocal relationship. (See note below regarding "extra" functionality added to the #befriend method.)
  • "unfriend" -- The user#unfriend method is invoked for each user to establish the reciprocal removal of relationship. (See note below regarding "extra" functionality added to the #unfriend method.)
  • "purchase" -- The user#addPurchase method is invoked to add the purchase to the user's PurchaseManager object, which adds the purchase to its internally-managed purchaseMap, subject to the "threshold of purchases" constraint: the submitted purchase will not be added if its timestamp precedes that of the earliest purchase in an already filled-to-threshold-capacity purchaseMap.
In the stream-log processing phase, one additional step is performed in "purchase" processing: before the user#addPurchase method is invoked, the user#getAnomalyData method is invoked. The purchase amount is passed to this method, and the following sequence transpires: the user's current network of friends is recursively assembled (dependent upon the user's friend connections and the "degrees of separation" system constraint), and a network-specific PurchaseManager object is used to assemble a purchaseMap for the network (dependent upon the "threshold of purchases" system constraint). Calculations are then performed to determine whether the amount of the current purchase is an anomaly or not. If the amount is an anomaly, then a record is written by the TransactionProcessor to the output ("flagged purchases") file.

Dependencies

In order to be executed, this package requires an environment in which Java 1.8 and Maven have been installed.

This being a completely standard Maven-managed Java project, all dependencies are outlined in the project's pom.xml file, and Maven automatically manages these dependencies at build time. A build is automatically executed when either of the customized /run.sh or /insight_testsuite/run_tests.sh shell scripts is executed.

Command-line execution

While this package provides an API (documented in these Javadocs pages) intended to allow direct incorporation of its functionality into any existing Java 1.8-compliant system (via methods of the TransactionProcessor class), command-line execution may be done through submission of the following within this project's root directory. (Note that an example of the following is provided in the customized /run.sh script file.):
   mvn exec:java -Dexec.mainClass=org.commonvox.insight.anomaly_detector.App \
     -Dexec.args="./log_input/batch_log.json ./log_input/stream_log.json ./log_output/flagged_purchases.json"

The arguments on the second line above are mandatory, and should be tailored to represent the relative paths for three files:

  1. INPUT FILE containing batch transactions (used for initialization)
  2. INPUT FILE containing streaming transactions (analyzed for potential anomaly purchases)
  3. OUTPUT FILE (not necessarily an existing file) for flagged purchases (i.e. anomaly purchases)

Note that the transactions contained in all three files adhere to a proprietary JSON layout particular to Market-ter's systems, established via exemplars in the original specifications.


Customization of shell scripts was required

The original specifications provided two shell scripts, (1) run.sh and (2) /insight_testsuite/run_tests.sh, both of which required customization. Notably, the /insight_testsuite/run_tests.sh shell script required the addition of two lines, both corresponding to the standard pom.xml Maven project file.

WARNING: If the originally-provided run.sh and/or run_tests.sh scripts (which do not include inserted lines for Maven-style execution and to properly handle the Maven pom.xml file) are run against this package, a "BUILD FAILURE" will likely be encountered, with ERROR messages outputted potentially including:
"[ERROR] The goal you specified requires a project to execute but there is no POM in this directory ... Please verify you invoked Maven from the correct directory."

Naming conventions

The author of this package is a strong proponent of the dictums "call it what it is" and "don't give it another name if it's already got one". Thus, wherever possible, the names used in classes and methods of this package adhere to all explicit and implied naming conventions spelled out in the original specifications. Experience shows that a persistent adherence to such naming policies can yield dividends as a project grows to potentially involve more people and expand in complexity. In other words, solid and consistent naming conventions can help give a project like this an improved human and organizational "scalability".

Note on JSON format and output processing

The original specifications for the format of JSON input and output used in this package impose an extra constraint outside of established JSON standards: the original specifications require that the elements in each JSON object (the name:value pairs between curly brackets) be kept in a specific order. A relatively recent clarification document from the authorities overseeing JSON standards explicitly states that "[a JSON] object is an unordered collection of zero or more name/value pairs".

For those of us with extensive experience working with interoperability issues among homegrown toolkits and technologies produced by our corporate clients, it is neither surprising nor unusual to come across constraints such as the one imposed in this exercise: that the ordering of object-elements in a particular setting might need to be maintained in a non-standard way.

With that said, the maintaining of order in the elements of an outputted JSON object does not come without a potential hit to the efficiency of such processing. In the solution presented here, a Java StringBuilder is used to (as efficiently as possible, yet rather crudely) "inject" the mean and sd elements into the end of the outputted "flagged_purchases" JSON objects. If this were a real-world engagement with a corporate client, one would inquire whether the applications which consume the "flagged_purchases" are indeed hard-wired to require specifically-ordered JSON objects, or whether this constraint can be removed from the project specifications, allowing for more efficient outputting of "natural" (i.e. unordered) JSON object content.

Validation of JSON input

While there was no explicit mention in the specifications regarding validation of input coming from JSON batch files or streams, it may well be that validation is needed (assuring things like valid timestamp formats, numeric and non-negative amounts in purchases, and whatever might constitute a valid user-id). On the other hand, the distributed architectures of an enterprise such as Market-ter might well provide for such validations "upstream" from this package's processes, obviating the need for the addition of validation logic (and its attendant overhead) in this package. The current implementation includes no validation for inputted JSON values, but validation logic could easily be added, if required. Additionally, a system-level variable or config parameter might be utilized to optionally activate or deactivate this new "layer" of validation processing.

Choice of JSON parser

There are a number of reliable Java-based JSON parser packages available. In searching for an appropriate choice for this project, a brief research endeavor led to the selection of JSON.simple, which seems to offer both reliability and road-tested efficiency. Note that it would be extremely easy to swap out this JSON parser and replace it with any other (the beauty of open standards!).

Unit testing

The standard JUnit package is being employed for unit testing. A complete array of unit tests, covering all non-private methods in the User and PurchaseManager classes, can be found in the subdirectories of ./src/test/java/.

Note that this implementation is not synchronized

This initial prototype is intended for single-threaded use. In its current form, a TransactionProcessor is not intended to be concurrently accessed by multiple threads.

SCALABILITY issue 1: efficiency in maintenance of collections

In choosing which of the standard Java Collection classes to utilize in this package, the focus was on scalability: as the size of users, networks, and purchase-maps potentially expands in the future, it was vital to guarantee the advantages of binary searches [O(log(n) efficiency] over sequential searches [O(n) efficiency]. A quick glance at the "import" statements at the top of the User class and the PurchaseManager class of this package shows that only TreeSet and TreeMap implementations are currently in use, taking advantage of their "guaranteed log(n) time cost for basic operations".

SCALABILITY issue 2: minimizing data conversions

Since it was clear that some data conversion would be necessary to fulfill the project requirements (most obviously the conversion of String-based amounts into a numeric format), a focus was applied to minimizing these conversions (to the extent reasonable and feasible), since such conversions have an impact on efficiency and scalability. The current policies for conversions are as follows:
  • all amounts read in from JSON streams will immediately be converted to numeric format and held and manipulated in memory in numeric format;
  • all timestamps and user IDs will never be converted, but instead kept in their original String format;
  • the System#nanoTime values used to establish unique purchase-transaction-keys are always held in memory in their original, long (numeric) format.

SCALABILITY issue 3: simplifying computations

In the first moments of design of this package, the knee-jerk selection of the BigDecimal class was made as the "container" of choice for amount data (converted from inputted JSON Strings); after all, both the standard Java documentation and a not-so-small army of Java gurus all implore us to use BigDecimal objects when dealing with currency! But just a little thought led to the realization that if amounts were held in memory as pennies instead of dollars, that the complexity (and potential overhead, particularly in calculations) of the BigDecimal class could be completely avoided! Thus, the Integer class (and primitive "int" instances, where appropriate) are used for holding all amounts in memory (in "penny" denominations) and for performing all computations, rounded to the nearest penny.

SCALABILITY issue 4: readiness for distributed processing

While it was outside of the explicit scope of this project, Java 1.8 makes Stream management an integral (and pretty much automatic) part of its architecture. Thus, while the TransactionProcessor class handles batch-style input of a "stream_log" file (in fulfillment of the original specification), it also is architected (though never yet tested) to handle real-time streaming input. It is easy to envision usage of a suitably tested version of this package in a distributed Hadoop, HBase, or other distributed processing framework, directly accepting streaming data via the TransactionProcessor#processStreamInput method.

EXTRA DELIVERABLE: Prevention of out-of-sequence befriending and unfriending

During initial design of the User#befriend and User#unfriend methods, it became obvious that in a real-world implementation, just as purchase transactions might be received from their originating Internet sources out of sequence, so too might befriend and unfriend transactions be received out of sequence.

To handle this eventuality, extra functionality was added to this package seeking to assure that:
  • an unfriend transaction will be ignored if its submitted timestamp precedes the timestamp of the most recent (already-submitted) befriend transaction, and
  • a befriend transaction will be ignored if its submitted timestamp precedes the timestamp of the most recent (already-submitted) unfriend transaction.

EXTRA DELIVERABLE: No overlay of an already-existing "flagged_purchases" output file

It seems potentially dangerous to automatically overlay output files from previous batch runs of this package. Thus, a newly commencing batch run of this package automatically renames any existing "flagged_purchases" output file it finds (by appending a "." + string-timestamp suffix, followed by a final suffix of ".json" to the file name).

AN ADDITIONAL THOUGHT: Regarding outliers

While it was far outside of the scope of this exercise, as various sets of test data were being put through the anomaly-detection algorithms, it became apparent that the addition of a single extremely anomalous purchase could, in effect, shut down anomaly identification for any user within the network of the extreme purchaser. Until the extreme purchase ultimately cycled out of the "recent purchases" pool, almost nothing would be rated as an anomaly. This leads to a suspicion that any real-world system that attempted to accomplish similar goals of anomaly detection would need to employ more advanced algorithms to identify and remove "outlier" purchases from consideration in the anomaly-detection process.