Skip to content

Commit

Permalink
Claudio
Browse files Browse the repository at this point in the history
  • Loading branch information
cardagna committed Oct 23, 2024
1 parent 6d5aecb commit 4321f3d
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 5 deletions.
8 changes: 4 additions & 4 deletions major review/introduction.tex
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
\section{Introduction}
The wide success and adoption of cloud infrastructures and their intrinsic multitenancy represent a paradigm shift in the big data scenario, redefining scalability and efficiency in data analytics. Multitenancy enables multiple users to share resources, such as computing power and storage, optimizing their utilization and reducing operational costs. Leveraging cloud infrastructure further enhances flexibility and scalability.
%
The flip side of multitenancy is the increased complexity in data governance: the shared model introduces unique security challenges, as tenants may have different security requirements, access levels, and data sensitivity. Adequate measures such as encryption, access control mechanisms, and data anonymization techniques must be implemented to safeguard data against unauthorized access and ensure compliance with regulatory requirements such as GDPR or HIPAA.
The wide success and adoption of cloud-edge infrastructures and their intrinsic multitenancy radically change the way in which distributed systems are developed, deployed, and executed, redefining IT scalability, flexibility, and efficiency. Multitenancy in fact enables multiple users to share resources, such as computing power, storage, and services, optimizing their utilization and reducing operational costs. In this context, the increasing ability of collecting and managing huge volume of data, coupled with a paradigm shift in service delivery models, has also significantly enhanced scalability and efficiency in data analytics. Data are treated as digital products, which are managed and analyzed by multiple services orchestrated in data pipelines.

The flip side of this scenario, where data pipelines orchestrate services selected at run time and are delivered in the cloud-edge continuum, is the increased complexity in data governance. Data are shared and analyzed by multiple services owned by different providers. This shared and distributed model introduces unique security challenges, where the pipeline and data owner may have different security requirements, access levels, and data sensitivity which depend on the specific orchestrated services. The latter in fact have different profiles that impact on the amount of data they can access and analyze according to the owner's requirements.
%
As a consequence, achieving a balance between data protection and data quality is crucial, as the removal or alteration of personally identifiable information from datasets to safeguard individuals' privacy can compromise the accuracy of analytics results.
Adequate measures such as encryption, access control mechanisms, and data anonymization techniques (e.g., k-anonymity, l-diversity, differential privacy) have been implemented to protect data against unauthorized access and ensure compliance with regulatory requirements such as GDPR or HIPAA. On the other side, data quality is crucial and must be guaranteed, as the removal or alteration of personally identifiable information from datasets to safeguard individuals' privacy can compromise the accuracy of analytics results.

So far, all research endeavors have been mainly concentrated on exploring these two issues separately: on one hand, \emph{data quality}, encompassing accuracy, reliability, and suitability, has been investigated to understand the implications in analytical contexts. Although extensively studied, these investigations often prioritize enhancing the quality of source data rather than ensuring data quality throughout the entire processing pipeline, or the integrity of outcomes derived from data. On the other hand, \emph{data security and privacy} focused on the protection of confidential information and adherence to rigorous privacy regulations.

Expand Down
3 changes: 2 additions & 1 deletion major review/main.tex
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
\usepackage{pifont}
\usepackage{epstopdf}
\usepackage{subcaption}
\colorlet{OurColor}{blue}
\graphicspath{{Images/}}

\theoremstyle{definition}
Expand All @@ -55,7 +56,7 @@
\maketitle

\begin{abstract}
~Today, the increasing ability of collecting and managing huge volume of data, coupled with a paradigm shift in service delivery models, has significantly enhanced scalability and efficiency in data analytics, particularly in multi-tenant environments. Data are today treated as digital products, which are managed and analyzed by multiple services orchestrated in data pipelines. This scenario calls for innovative solutions to data pipeline management that primarily seek to balance data quality and data protection. Departing from the state of the art that traditionally optimizes data protection and data quality as independent factors, we propose a framework that enhances service selection and composition in distributed data pipelines to the aim of maximizing data quality, while providing a minimum level of data protection. Our approach first retrieves a set of candidate services compatible with data protection requirements in the form of access control policies; it then selects the subset of compatible services, to be integrated within the data pipeline, which maximizes the overall data quality. Being our approach NP-hard, a sliding-window heuristic is defined and experimentally evaluated in terms of performance and quality with respect to the exhaustive approach. Our results demonstrate a significant reduction in computational overhead, while maintaining high data quality.
~Today, the increasing ability of collecting and managing huge volume of data, coupled with a paradigm shift in service delivery models, has significantly enhanced scalability and efficiency in data analytics, particularly in multi-tenant environments. Data are today treated as digital products, which are managed and analyzed by multiple services orchestrated in data pipelines. {\color{OurColor}This paradigm shift towards distributed systems built as service-based data pipelines does not find any counterparts in the definition of new data governance techniques that properly manage data across the pipeline lifecycle, calling for innovative solutions to data pipeline management that primarily seek to balance data quality and data protection. Departing from the state of the art that traditionally targets single services and systems, and optimizes data protection and data quality as independent factors}, we propose a framework that enhances service selection and composition in distributed data pipelines to the aim of maximizing data quality, while providing a minimum level of data protection. Our approach first retrieves a set of candidate services compatible with data protection requirements in the form of access control policies; it then selects the subset of compatible services, to be integrated within the data pipeline, which maximizes the overall data quality. Being our approach NP-hard, a sliding-window heuristic is defined and experimentally evaluated in terms of performance and quality with respect to the exhaustive approach. Our results demonstrate a significant reduction in computational overhead, while maintaining high data quality.
\end{abstract}

\tikzset{
Expand Down

0 comments on commit 4321f3d

Please sign in to comment.