-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
22 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,23 @@ | ||
\section{Introduction} | ||
TBW | ||
The usage of multitenancy coupled with cloud infrastructure represents a paradigm shift in the big data scenario, redefining scalability and efficiency in data analytics. Multitenancy enables multiple users to share resources, such as computing power and storage, optimizing resource utilization and reducing operational costs. Leveraging cloud infrastructure further enhances flexibility and scalability. | ||
%allowing organizations to dynamically allocate resources based on demand while ensuring seamless access to cutting-edge data analytics tools and services. | ||
% | ||
The flip side of multitenancy is the increased complexity of data governace: the shared model introduces unique security challenges, as tenants may have different security requirements, levels of access, and data sensitivity. Adequate measures such as encryption, access control mechanisms, and data anonymization techniques must be implemented to safeguard data against unauthorized access and ensure compliance with regulatory requirements such as GDPR or HIPAA. | ||
% | ||
As a consequence, achieving a balance between data protection and data quality is crucial, as the removal or alteration of personally identifiable information from datasets to safeguard individuals' privacy can compromise the accuracy of analytics results. | ||
|
||
So far, all research endeavors have been concentrated on exploring these two issues separately: on one hand, the concept of data quality, encompassing accuracy, reliability, and suitability, has been investigated to understand the implications in analytical contexts. Although extensively studied, these investigations often prioritize enhancing the quality of source data rather than ensuring data quality throughout the entire processing pipeline, or the integrity of outcomes derived from data. On the other hand, there is a focus on data privacy and security, entailing the protection of confidential information and adherence to rigorous privacy regulations. | ||
There are very few solutions that that find a good balance between them since it requires a holistic approach that integrates technological solutions, organizational policies, and ongoing monitoring and adaptation to emerging threats and regulatory changes. | ||
|
||
To this aim, we propose a data governance framework tailored to contemporary data-driven pipelines, which aims to limit the privacy and security risks. The primary objective of this framework is to facilitate the assembly of data processing services, with a central focus on the selection of those services that optimize data quality, while upholding privacy and security requirements. | ||
|
||
The key contributions of this study are as follows: | ||
\begin{enumerate} | ||
\item each service in the pipeline is tagged with \textit{annotations} to specify data protection requirements expressing transformation on data to enforce data protection, as well as functional specifications on services expressing data manipulations carried out during services execution; | ||
\item the annotated pipeline, called \textit{pipeline template}, now acts as a skeleton, specifying the structure of the pipeline and both the functional and non-functional requirements; | ||
\item the \textit{pipeline instance} is built by instantiating the template with services according to the specified requirements. Our service selection approach focuses on maximizing data quality by retaining the maximum amount of information when applying data protection transformations; | ||
\item The composite selection problem is NP-hard, but we present a parametric heuristic tailored to address the computational | ||
complexity. We evaluated the performance and quality of the algorithm by running some experiments on a dataset. | ||
\end{enumerate} | ||
|
||
The rest of the paper is organized as follows. In Section \ref{sec:requirements} In Section \ref{} In Section \ref{} |