Skip to content

Commit

Permalink
system model
Browse files Browse the repository at this point in the history
  • Loading branch information
cb-unimi committed Apr 19, 2024
1 parent ddd622c commit caa6d9f
Show file tree
Hide file tree
Showing 4 changed files with 43 additions and 12 deletions.
23 changes: 23 additions & 0 deletions introduction copia.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
\section{Introduction}
The usage of multitenancy coupled with cloud infrastructure represents a paradigm shift in the big data scenario, redefining scalability and efficiency in data analytics. Multitenancy enables multiple users to share resources, such as computing power and storage, optimizing resource utilization and reducing operational costs. Leveraging cloud infrastructure further enhances flexibility and scalability.
%allowing organizations to dynamically allocate resources based on demand while ensuring seamless access to cutting-edge data analytics tools and services.
%
The flip side of multitenancy is the increased complexity of data governace: the shared model introduces unique security challenges, as tenants may have different security requirements, levels of access, and data sensitivity. Adequate measures such as encryption, access control mechanisms, and data anonymization techniques must be implemented to safeguard data against unauthorized access and ensure compliance with regulatory requirements such as GDPR or HIPAA.
%
As a consequence, achieving a balance between data protection and data quality is crucial, as the removal or alteration of personally identifiable information from datasets to safeguard individuals' privacy can compromise the accuracy of analytics results.

So far, all research endeavors have been concentrated on exploring these two issues separately: on one hand, the concept of data quality, encompassing accuracy, reliability, and suitability, has been investigated to understand the implications in analytical contexts. Although extensively studied, these investigations often prioritize enhancing the quality of source data rather than ensuring data quality throughout the entire processing pipeline, or the integrity of outcomes derived from data. On the other hand, there is a focus on data privacy and security, entailing the protection of confidential information and adherence to rigorous privacy regulations.
There are very few solutions that that find a good balance between them since it requires a holistic approach that integrates technological solutions, organizational policies, and ongoing monitoring and adaptation to emerging threats and regulatory changes.

To this aim, we propose a data governance framework tailored to contemporary data-driven pipelines, which aims to limit the privacy and security risks. The primary objective of this framework is to facilitate the assembly of data processing services, with a central focus on the selection of those services that optimize data quality, while upholding privacy and security requirements.

The key contributions of this study are as follows:
\begin{enumerate}
\item each service in the pipeline is tagged with \textit{annotations} to specify data protection requirements expressing transformation on data to enforce data protection, as well as functional specifications on services expressing data manipulations carried out during services execution;
\item the annotated pipeline, called \textit{pipeline template}, now acts as a skeleton, specifying the structure of the pipeline and both the functional and non-functional requirements;
\item the \textit{pipeline instance} is built by instantiating the template with services according to the specified requirements. Our service selection approach focuses on maximizing data quality by retaining the maximum amount of information when applying data protection transformations;
\item The composite selection problem is NP-hard, but we present a parametric heuristic tailored to address the computational
complexity. We evaluated the performance and quality of the algorithm by running some experiments on a dataset.
\end{enumerate}

The rest of the paper is organized as follows. %In Section \ref{sec:requirements} In Section \ref{} In Section \ref{}
16 changes: 10 additions & 6 deletions introduction.tex
Original file line number Diff line number Diff line change
Expand Up @@ -7,17 +7,21 @@ \section{Introduction}
As a consequence, achieving a balance between data protection and data quality is crucial, as the removal or alteration of personally identifiable information from datasets to safeguard individuals' privacy can compromise the accuracy of analytics results.

So far, all research endeavors have been concentrated on exploring these two issues separately: on one hand, the concept of data quality, encompassing accuracy, reliability, and suitability, has been investigated to understand the implications in analytical contexts. Although extensively studied, these investigations often prioritize enhancing the quality of source data rather than ensuring data quality throughout the entire processing pipeline, or the integrity of outcomes derived from data. On the other hand, there is a focus on data privacy and security, entailing the protection of confidential information and adherence to rigorous privacy regulations.
There are very few solutions that that find a good balance between them since it requires a holistic approach that integrates technological solutions, organizational policies, and ongoing monitoring and adaptation to emerging threats and regulatory changes.
There are very few solutions that that find a good balance between them since it requires a holistic approach that integrates technological solutions, organizational policies, and ongoing monitoring and adaptation to emerging threats and regulatory changes. A valid solution should implement robust access control mechanisms, ensuring that only authorized users can access specific datasets or analytical tools.
Additionally, data protection requirements should be identified at each stage of the data lifecycle, potentially incorporating techniques like data masking and anonymization to safeguard sensitive information by substituting it with realistic but fictional data, thereby preserving data privacy while enabling analysis. An ideal solution should prioritize data lineage, fostering a comprehensive understanding and optimization of data flows and transformations within complex analytical ecosystems.

To this aim, we propose a data governance framework tailored to contemporary data-driven pipelines, which aims to limit the privacy and security risks. The primary objective of this framework is to facilitate the assembly of data processing services, with a central focus on the selection of those services that optimize data quality, while upholding privacy and security requirements.
To this aim, we propose a data governance framework tailored to contemporary data-driven pipelines, which aims to limit the privacy and security risks. The primary objective of this framework is to facilitate the assembly of data processing services, with a central focus on the selection of those services that optimize data quality, while upholding privacy and security requirements.

The key contributions of this study are as follows:
Each service in the pipeline is tagged with \textit{annotations} to specify data protection requirements expressing transformation on data to enforce data protection, as well as functional specifications on services expressing data manipulations carried out during services execution. There is a catalog of services among which a user can choose. Services may be functionally equivalent (i.e., they perform the same task), but have different security policies, more or less restrictive\st{, depending on the provider's attributes}. Thus a user si trova a avere tutto l'interesse di capire come le sue scelte service selection possano avere un impatto sulla qualità del risultato finale. Our service selection approach focuses on maximizing data quality by retaining the maximum amount of information when applying data protection transformations;


The key contributions are as follows:
\begin{enumerate}
\item each service in the pipeline is tagged with \textit{annotations} to specify data protection requirements expressing transformation on data to enforce data protection, as well as functional specifications on services expressing data manipulations carried out during services execution;
\item the annotated pipeline, called \textit{pipeline template}, now acts as a skeleton, specifying the structure of the pipeline and both the functional and non-functional requirements;
\item the \textit{pipeline instance} is built by instantiating the template with services according to the specified requirements. Our service selection approach focuses on maximizing data quality by retaining the maximum amount of information when applying data protection transformations;
\item Service enrichment (metadati)
\item The composite selection problem is NP-hard, but we present a parametric heuristic tailored to address the computational
complexity. We evaluated the performance and quality of the algorithm by running some experiments on a dataset.
\end{enumerate}

The rest of the paper is organized as follows. %In Section \ref{sec:requirements} In Section \ref{} In Section \ref{}

ecosystem of services
Binary file modified main.pdf
Binary file not shown.
16 changes: 10 additions & 6 deletions system_model.tex
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
\section{System Model and Service Pipeline}\label{sec:requirements}
\st{Big data is highly dependent on cloud-edge computing, which makes extensive use of multitenancy.
Multitenancy permits sharing one instance of infrastructures, platforms or applications by multiple tenants to optimize costs. This leads to common scenarios where a service provider offers subscription-based analytics capabilities in the cloud, or a single data lake is accessed by multiple customers. Big data pipelines then mix data and services which belong to various organizations, posing a serious risk of potential privacy and security violations.
We propose a data governance framework tailored to contemporary data-driven pipelines, which aims to limit the privacy and security risks. The primary objective of this framework is to facilitate the assembly of data processing services, with a central focus on the selection of those services that optimize data quality, while upholding privacy and security requirements.}
Multitenancy permits sharing one instance of infrastructures, platforms or applications by multiple tenants to optimize costs. This leads to common scenarios where a service provider offers subscription-based analytics capabilities in the cloud, or a single data lake is accessed by multiple customers. Big data pipelines then mix data and services which belong to various organizations, posing a serious risk of potential privacy and security violations.
We propose a data governance framework tailored to contemporary data-driven pipelines, which aims to limit the privacy and security risks. The primary objective of this framework is to facilitate the assembly of data processing services, with a central focus on the selection of those services that optimize data quality, while upholding privacy and security requirements.}

In the following of this section, we present our system model (Section \ref{sec:systemmodel}) and our reference scenario (Section \ref{sec:service_definition}).

\subsection{System Model}\label{sec:systemmodel}
\st{In today's data landscape, the coexistence of data quality and data privacy is critical to support high-value services and pipelines. The increase in data production, collection, and usage has led to a split in scientific research priorities.
%This has resulted in two main focus areas.
First, researchers are exploring methods to optimize the usage of valuable data. Here, ensuring data quality is vital, and requires accuracy, reliability, and soundness for analytical purposes.
Second, there is a need to prioritize data privacy and security. This involves safeguarding confidential information and complying with strict privacy regulations. These two research directions are happening at the same time, though there are not many solutions that find a good balance between them.
%This has resulted in two main focus areas.
First, researchers are exploring methods to optimize the usage of valuable data. Here, ensuring data quality is vital, and requires accuracy, reliability, and soundness for analytical purposes.
Second, there is a need to prioritize data privacy and security. This involves safeguarding confidential information and complying with strict privacy regulations. These two research directions are happening at the same time, though there are not many solutions that find a good balance between them.

Our approach seeks to harmonize these objectives by establishing a data governance framework that balances privacy and data quality. }
Our approach seeks to harmonize these objectives by establishing a data governance framework that balances privacy and data quality. }
Our system model is derived by a generic big-data framework and is composed of the following parties:
\begin{description}
\item[Service,] a software distributed by a \textbf{service provider} that performs a specific task \st{according to access control privileges on data}; %, a service can be tagged with some policies %, a service is characterized by two function: the service function and the policy function.
Expand All @@ -20,6 +20,10 @@ \subsection{System Model}\label{sec:systemmodel}
\item[User,] executing an analytics pipeline on the data. We assume that the data target of the analytics pipeline are ready for analysis, i.e., they underwent a preparatory phase addressing issues such as missing values, outliers, and formatting discrepancies. This ensures that the data are in an optimal state for subsequent analysis.
\end{description}

the annotated pipeline, called \textit{pipeline template}, now acts as a skeleton, specifying the structure of the pipeline and both the functional and non-functional requirements;

the \textit{pipeline instance} is built by instantiating the template with services according to the specified requirements.

We distinguish between a \textbf{pipeline template} that acts as a skeleton, specifying the structure of the pipeline, i.e., the chosen sequence of desired services, and both the functional and non-functional requirements driving service selection and composition, and a \textbf{pipeline instance} instantiating the template with services according to the specified requirements.
The \user first selects a pipeline template among a set of functionally-equivalent templates according to its non-functional requirements. It then instantiates the template in a pipeline instance. To this aim, for each component service in the template, it retrieves a set of candidate services that satisfy the functional requirements of the component service. Candidate services are filtered to retrieve a list of compatible services that comply with the policies specified in the template.

Expand Down

0 comments on commit caa6d9f

Please sign in to comment.