-
Notifications
You must be signed in to change notification settings - Fork 0
Design Perspective
In this section, we will discuss the design considerations taken to develop the LA as a data-processing and data analysis platform for big data and machine learning processes.
The envision platform, should allow self-managed data process mining and which can distribute the processing power over the network by creating a scalable processing overlay. As a first step, we envision the possibility of describing the complete representation of the process, namely the Machine Learning Process Request (MLPR) or the Complex-Event Machine Learning Request (CEMLR), we need to describe in computer readable manner, the CEML processes according to their parts, namely: Pre-Processing Rules and Feature Extraction Rules (for Data Pre-Processing Phase): Learning Description (Learning Phase), Evaluation Description (Continuous Validation Phase), and Actuation Rules (Deployment Phase). The Pre-Processing Rules describes how the fragmented raw input data or data streams are processed and aggregated. The Feature Extraction Rules define how features are extracted from pre-processed data. The Learning Description defines the selection of an Algorithm, Parameters, and the Feature Space for construction of a model. The Evaluation Description is used to construct an Evaluator. The Evaluator is attached to the model and is responsible for providing real-time performance metrics about the model and deciding if it reaches the expected scores. Finally, the Actuation Rules describe actuation of the system whenever the model reaches the expected performance scores. All steps are performed in an Execution Pipeline Environment (EPE) or distributed in a set of interconnected EPE over the network. The output of the platform is the smart actuation of the system based on the predictions of the trained model. Moreover, a real-time monitoring infrastructure enabled tracking of the distributed process. Lastly, the platform allows the export of a trained CEMLR, in order to redistribute and replicate the process elsewhere.
Phase | Pre processing | Learning | Continuous Validation | Deployment |
---|---|---|---|---|
Captured by | Pre-processing & Feature Extraction Rules | Learning Description | Evaluation Description | Deployment Rules |
Produces | Streams | Models | Evaluation State | Streams |
Implemented by | Statements in CEP Engines | Models | Evaluators | Statements in CEP Engines |
Table 1 Analysis between the phase who captures, where are executed and implemented
Figure 1 Design view of a single instance of CEML execution system
The possibility of describing the processes and executing them in execution pipelines allows for a reallocation and distribution of the computations according to available applications and resources. In particular, it allows to split the processes and to redistribute them among the available computational power regardless of the actual location. In parts, this enables the applications to spread them along the communication path and to use the resources available, while reducing the costs. Nevertheless, spreading the processes inevitably adds more complexity to the applications and therefore, a new set of APIs for managing and monitoring the processes is needed.
Figure 2 Design perspective of multiple CEML execution units
This set must be available for the different applications and stakeholders addressing their needs and fulfilling the different requirements to achieve the individual application goals. The APIs can be divided in the I/O, Management, Monitoring, and ML Process, each interacting with different parts of the application processes, which are Application Environment, Application Developer, System Monitor, and the External Model Backends, respectively. Moreover, each API should adapt to the users System Integrator, Application Developer, System Administrator, and Data Scientist and the goals like interconnect the data and responses, develop the application, monitor the status of the system and integrate new algorithms into the system, respectively. The I/O API allows the input of raw streams to the EPE and the output of the already processed information. The Management API is a CRUD API for the CEMLR. The Monitoring API provides tools to monitor the system performance and the machine learning processes in real-time. The ML Process API allows to connect to the execution process to add new ML models to the system.
Figure 3 Structure of the API
Table 2 Mapping between users and APIs
I/O API | Management API | Monitoring API | ML Process API | |
---|---|---|---|---|
User | System Integrator | Application Developer | System Administrator | Data Scientist |
Finally, the APIs should adapt to the deployment of the system and the deployment should adapt to the applications needs. On the one hand, the degree of distribution of the system brings a different set of requirements for the APIs. On the other hand, a distribution of the system provides advantages and disadvantages. When processing data that is closer to the data sources, metrics like latency, privacy, confidentiality concerns and networking dependencies are reduced. On the other hand, if the data is processed closer to the cloud, a higher availability, computation power and reliable energy power sources are given. Thus, all this must be taken into consideration at the moment of building new systems.
For building the platform, we will build up using a set of state-of-the-art technologies and best practices in the Internet of Things. For the data and metadata representation and management, we may consider standards such as OGC SensorThing. For the data streams as well as for the multicast APIs in highly distributed deployments the pub/sub protocol such as MQTT is intended. For uni-cast requests, a RESTful API is considered. For the data ubiquitous data sources, lightweight JSON payloads could be used. In case of heavy data load, Google Buffers protocol could be an alternative. On the system side, Docker based technology as well as Docker Swarm or Prometheus could provide the necessary tools for deployment and monitoring. On the computational part, Complex-Event Processing (CEP) or BPMN engines could serve as Execution Pipeline Environment for the execution of rules. The Rules can be described by the CEP DSL or BPMN as a description language. Additionally, the engines should be extended to support analysis frameworks such as TensorFlow or DeepLearning4J. Moreover, we will consider incorporating well-known developer environments for data scientist such as Weka, R, python tools. However, there are no standardized way to describe a Machine Learning model, which requires fundamental research and development. Finally, security can be added using TLS for channel encryption, XAML for policy management and SAML for authentication.
Originally written by José Ángel Carvajal Soto.