From 17e7afc7796c239d682a35a7bdf1c11de62f49f8 Mon Sep 17 00:00:00 2001 From: cb-unimi <67868247+cb-unimi@users.noreply.github.com> Date: Mon, 15 Apr 2024 16:14:48 +0200 Subject: [PATCH] updated system model description --- system_model.tex | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/system_model.tex b/system_model.tex index d7d0392..8a77995 100644 --- a/system_model.tex +++ b/system_model.tex @@ -1,25 +1,28 @@ \section{System Model and Service Pipeline}\label{sec:requirements} -Big data is highly dependent on cloud-edge computing, which makes extensive use of multitenancy. +\st{Big data is highly dependent on cloud-edge computing, which makes extensive use of multitenancy. Multitenancy permits sharing one instance of infrastructures, platforms or applications by multiple tenants to optimize costs. This leads to common scenarios where a service provider offers subscription-based analytics capabilities in the cloud, or a single data lake is accessed by multiple customers. Big data pipelines then mix data and services which belong to various organizations, posing a serious risk of potential privacy and security violations. - -We propose a data governance framework tailored to contemporary data-driven pipelines, which aims to limit the privacy and security risks. The primary objective of this framework is to facilitate the assembly of data processing services, with a central focus on the selection of those services that optimize data quality, while upholding privacy and security requirements. +We propose a data governance framework tailored to contemporary data-driven pipelines, which aims to limit the privacy and security risks. The primary objective of this framework is to facilitate the assembly of data processing services, with a central focus on the selection of those services that optimize data quality, while upholding privacy and security requirements.} In the following of this section, we present our system model (Section \ref{sec:systemmodel}) and our reference scenario (Section \ref{sec:service_definition}). \subsection{System Model}\label{sec:systemmodel} -In today's data landscape, the coexistence of data quality and data privacy is critical to support high-value services and pipelines. The increase in data production, collection, and usage has led to a split in scientific research priorities. +\st{In today's data landscape, the coexistence of data quality and data privacy is critical to support high-value services and pipelines. The increase in data production, collection, and usage has led to a split in scientific research priorities. %This has resulted in two main focus areas. First, researchers are exploring methods to optimize the usage of valuable data. Here, ensuring data quality is vital, and requires accuracy, reliability, and soundness for analytical purposes. Second, there is a need to prioritize data privacy and security. This involves safeguarding confidential information and complying with strict privacy regulations. These two research directions are happening at the same time, though there are not many solutions that find a good balance between them. -Our approach seeks to harmonize these objectives by establishing a data governance framework that balances privacy and data quality. It implements a system model that is composed of the following parties: +Our approach seeks to harmonize these objectives by establishing a data governance framework that balances privacy and data quality. } +Our system model is derived by a generic big-data framework and is composed of the following parties: \begin{description} - \item[Service,] a software distributed by a \textbf{service provider} that performs a specific task according to access control privileges on data; %, a service can be tagged with some policies %, a service is characterized by two function: the service function and the policy function. - \item[Pipeline,] a sequence of connected services that collect, prepare, process, and analyze data in a structured and automated manner. We distinguish between a \textbf{pipeline template} that acts as a skeleton, specifying the structure of the pipeline and the (non-)functional requirements driving service selection and composition, and a \textbf{pipeline instance} instantiating the template with services according to the specified requirements; + \item[Service,] a software distributed by a \textbf{service provider} that performs a specific task \st{according to access control privileges on data}; %, a service can be tagged with some policies %, a service is characterized by two function: the service function and the policy function. + \item[Pipeline,] a sequence of connected services that collect, prepare, process, and analyze data in a structured and automated manner. \st{We distinguish between a \textbf{pipeline template} that acts as a skeleton, specifying the structure of the pipeline and the (non-)functional requirements driving service selection and composition, and a \textbf{pipeline instance} instantiating the template with services according to the specified requirements}; \item[Data Governance Policy,] a structured set of privacy guidelines, rules, and procedures regulating data access and protection; - \item[User] that executes an analytics pipeline on the data. We assume that the data target of the analytics pipeline are ready for analysis, that is, they underwent a preparatory phase addressing issues such as missing values, outliers, and formatting discrepancies. This ensures that the data are in an optimal state for subsequent analysis. + \item[User,] executing an analytics pipeline on the data. We assume that the data target of the analytics pipeline are ready for analysis, i.e., they underwent a preparatory phase addressing issues such as missing values, outliers, and formatting discrepancies. This ensures that the data are in an optimal state for subsequent analysis. \end{description} +We distinguish between a \textbf{pipeline template} that acts as a skeleton, specifying the structure of the pipeline, i.e., the chosen sequence of desired services, and both the functional and non-functional requirements driving service selection and composition, and a \textbf{pipeline instance} instantiating the template with services according to the specified requirements. +The \user first selects a pipeline template among a set of functionally-equivalent templates according to its non-functional requirements. It then instantiates the template in a pipeline instance. To this aim, for each component service in the template, it retrieves a set of candidate services that satisfy the functional requirements of the component service. Candidate services are filtered to retrieve a list of compatible services that comply with the policies specified in the template. + %The \user starts its analytics by first selecting a pipeline template among a set of functionally-equivalent templates. The template is selected according to the \user\ non-functional requirements and then instantiated in a pipeline instance. In particular, for each component service in the template, a real service is selected among a list of compatible services in the instance. Compatible services are functionally equivalent and comply with the privacy policies specified in the template. The \user first selects a pipeline template among a set of functionally-equivalent templates according to its non-functional requirements. It then instantiates the template in a pipeline instance. To this aim, for each component service in the template, it retrieves a set of candidate services that satisfy the functional requirements of the component service. Candidate services are filtered to retrieve a list of compatible services that comply with the policies specified in the template.