Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
cardagna committed Oct 31, 2024
2 parents 671b628 + ec80d4b commit 9dec165
Show file tree
Hide file tree
Showing 4 changed files with 25 additions and 27 deletions.
4 changes: 2 additions & 2 deletions major review/introduction.tex
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ \section{Introduction}
{\color{OurColor}The wide success and adoption of cloud-edge infrastructures and their intrinsic multitenancy radically change the way in which distributed systems are developed, deployed, and executed, redefining IT scalability, flexibility, and efficiency.} Multitenancy in fact enables multiple users to share resources, such as computing power, storage, and services, optimizing their utilization and reducing operational costs.

{\color{OurColor}The increasing ability of collecting and managing huge volume of data, coupled with a paradigm shift in service delivery models, has also significantly enhanced scalability and efficiency in data analytics. Data are treated as digital products, which are managed and analyzed by multiple services orchestrated in pipelines. This shift is fostering the emergence of new platforms and environments, such as data marketplaces and data spaces, where data in critical domains (e.g., law enforcement, healthcare, transportation) can be pooled and shared to maximize data quality and trustworthiness,
and distributed data management systems support data storing, versioning, and sharing for complex analytics processes.}\footnote{\url{https://joinup.ec.europa.eu/collection/elise-europeanlocation-
and distributed data management systems supporting data storing, versioning, and sharing for complex analytics processes.}\footnote{\url{https://joinup.ec.europa.eu/collection/elise-europeanlocation-
interoperability-solutions-e-government/glossary/
term/data-marketplace}, \url{https://internationaldataspaces.org/}, \url{https://digitalstrategy.
ec.europa.eu/en/library/staff-working-documentdata-spaces}}
Expand All @@ -25,7 +25,7 @@ \section{Introduction}
\end{enumerate}

Based on the aforementioned considerations, we propose a data governance framework for {\color{OurColor}service-based data pipelines}.
The primary objective of our framework is to support the selection of data processing services within the pipeline, with a central focus on the selection of those services that {\color{OurColor}maximize} data quality, while upholding security and privacy requirements.\footnote{{\color{OurColor}We note that the assembly of the selected services in an executable pipeline is out of the scope of this paper. Our approach however is agnostic to the specific executable environment.}}
The primary objective of our framework is to support the selection of data processing services within the pipeline, with a central focus on the selection of those services that {\color{OurColor}maximize} data quality, while upholding security and privacy requirements.\footnote{{\color{OurColor}We note that the assembly of the selected services in an executable pipeline is out of the scope of this paper. However, our approach is agnostic to the specific executable environment.}}
To this aim, each element of the pipeline is \textit{annotated} with \emph{i)} data protection requirements expressing transformation on data and \emph{ii)} functional specifications on services expressing data manipulations carried out during each service execution.
Though applicable to a generic scenario, our data governance approach starts from the assumption that maintaining a larger volume of data leads to higher data quality; as a consequence, its service selection algorithm focuses on maximizing data quality {\color{OurColor}in terms of data completeness} by retaining the maximum amount of information when applying data protection transformations.

Expand Down
2 changes: 1 addition & 1 deletion major review/main.tex
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
\usepackage{tikz}
\usetikzlibrary{shapes,patterns,calc,shapes.geometric,arrows,positioning,backgrounds}
\usepackage{float}
\usepackage{makecell}
\usepackage{tabularx}
\usepackage{multirow}
\usepackage{xcolor}
\usepackage{url}
Expand Down
2 changes: 1 addition & 1 deletion major review/pipeline_template.tex
Original file line number Diff line number Diff line change
Expand Up @@ -80,4 +80,4 @@ \subsection{Functional Annotations}\label{sec:funcannotation}
\item a transformation function \tf{d} (out of the scope of this work) that changes the domain of the data.
\end{enumerate*}

For simplicity but with no loss of generality, we assume that all candidate services meet functional annotation \F{} and that \TF{}=\tf{}. As a consequence, all candidate services apply the same transformation to the data during the pipeline execution.
For simplicity but with no loss of generality, we assume that all candidate services meet functional annotation \F{} and that \TF{}= \tf{}. As a consequence, all candidate services apply the same transformation to the data during the pipeline execution.
44 changes: 21 additions & 23 deletions major review/related.tex
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,16 @@ \subsection{Data quality and data protection}\label{sec:dataquality}

%Data quality is a widely studied research topic across various communities and perspectives, such as the database community or when evaluating privacy preserving data mining techniques. In the context of big data, data quality primarily refers to the extent to which big data meets the requirements and expectations of its intended use, encompassing various dimensions and characteristics to ensure the data is reliable, accurate, and valuable for analysis and decision-making. Specifically, accuracy denotes the correctness of the data, ensuring it accurately represents the real-world entities and events it models.
Data quality is a widely studied research topic studied across various communities and perspectives. In the context of (big) data pipelines, data quality primarily refers to the extent to which (big) data meets the requirements and expectations of its intended use, encompassing various dimensions and characteristics to ensure the data are reliable, accurate, and valuable for analysis and decision-making. Specifically, accuracy denotes the correctness of the data, ensuring it accurately represents the real-world entities and events it models.
{\color{OurColor}
With the increasing need to protect sensitive data, the notion of data quality has expanded to include a broader concept of accuracy, particularly in terms of the proximity of a sanitized value to the original value.
This shift has emphasized the need of metrics to assess the quality of data resulting from anonymization processes.
Differential privacy \cite{dwork2008differential}, k-anonymity \cite{k-anon}, and l-diversity \cite{l-diversity} are three distinct techniques used to provide data anonymization, with different protection levels and results on data quality. For example, differential privacy is highly effective in maintaining confidentiality, but the noise added can reduce data precision, impacting analytical accuracy, whereas k-anonymity and l-diversity generally maintain higher data quality than differential privacy, but they might still be unable to protect against sophisticated attacks.
}
{\color{OurColor}
With the increasing need to protect sensitive data, the notion of data quality has expanded to include a broader concept of accuracy, particularly in terms of the proximity of a sanitized value to the original value.
This shift has emphasized the need of metrics to assess the quality of data resulting from anonymization processes.
Differential privacy \cite{dwork2008differential}, k-anonymity \cite{k-anon}, and l-diversity \cite{l-diversity} are three distinct techniques used to provide data anonymization, with different protection levels and results on data quality. For example, differential privacy is highly effective in maintaining confidentiality, but the noise added can reduce data precision, impacting analytical accuracy, whereas k-anonymity and l-diversity generally maintain higher data quality than differential privacy, but they might still be unable to protect against sophisticated attacks.
}
Various data quality metrics have been proposed in existing literature, including generalized information loss (\textit{GenILoss}), discernability metric, minimal distortions, and average equivalence class size ($C_{AVG}$), which may either have broad applicability or be tailored to specific data scenarios \cite{Majeed2021AnonymizationTF,bookMetrics,reviewMetrics}. However, there is currently no metric that is widely accepted by the research community. The main challenge with data quality is its relative nature: its evaluation typically depends on the context in which the data is used and often involves both objective and subjective parameters \cite{dataAccuracy,dataQuality}.
%
A common consideration across all contexts is that accuracy is closely related to the information loss resulting from the anonymization strategy: the lower the information loss, the higher the data quality. In our scenario, we have opted for two generic metrics rooted in data loss assessment (i.e., data completeness) - one quantitative and one qualitative. Nonetheless, our framework and heuristic are designed to be modular and flexible, accommodating the chosen metric.

{\color{OurColor} While existing techniques have provided sound and effective solutions that guarantees data quality and data protection, they often unsuit to scenarios aiming to maximize data quality while ensuring data protection, have limited expressiveness (e.g., the definition of $k$ when $k$-anonymity is used to protect data), are not applicable to pipelines orchestrating services owned by different providers. Our solution fills in the above gaps, by providing a framework for service-based data pipelines that support the selection of data processing services that maximize data quality, while upholding privacy and security requirements. Service selection is driven by high-expressive policies, where data transformations built on data protection techniques (e.g., $k$-anonymity) are applied to data before they are used in the pipeline.}
{\color{OurColor} While existing techniques have provided sound and effective solutions that guarantees data quality and data protection, they often unsuit to scenarios aiming to maximize data quality while ensuring data protection, have limited expressiveness (e.g., the definition of $k$ when $k$-anonymity is used to protect data), are not applicable to pipelines orchestrating services owned by different providers. Our solution fills in the above gaps, by providing a framework for service-based data pipelines that support the selection of data processing services that maximize data quality, while upholding privacy and security requirements. Service selection is driven by high-expressive policies, where data transformations built on data protection techniques (e.g., $k$-anonymity) are applied to data before they are used in the pipeline.}

%%%%%%%%%%%%%%%%%%
\subsection{Data quality and data protection in service-based pipelines}\label{sec:datagov}
Expand All @@ -40,31 +40,29 @@ \subsection{Data quality and data protection in service-based pipelines}\label{s

\begin{table}[t!]
\centering

\caption{Comparative analysis with relevant existing approaches \label{tab:comparative}}
\renewcommand{\arraystretch}{1.5}
\resizebox{\textwidth}{!}{%
\begin{tabular}{lp{3cm}p{3cm}p{3.5cm}p{3cm}p{2.5cm}}
\toprule
\textbf{Solution} & \textbf{F1} & \textbf{F2} & \textbf{F3} & \textbf{F4} \\
\midrule
\textbf{Microsoft Presidio \cite{microsoft_presidio}} & \cmark, can integrate within cloud-edge pipelines & \tmark, focuses on data redaction & \cmark, compatible with diverse techniques & \tmark, pre-built PII detectors with configurable policies \\

\textbf{Apache Ranger \cite{apache_ranger}} & \tmark, mostly limited to cloud settings & \xmark, provides access control rather than service optimization & \cmark, integrates with various techniques & \cmark; high expressiveness with fine-grained policy control \\

\footnotesize{
\begin{tabularx}{\textwidth}{>{\raggedright\arraybackslash}X>{\raggedright\arraybackslash}X>{\raggedright\arraybackslash}X>{\raggedright\arraybackslash}X>{\raggedright\arraybackslash}X}
\toprule
\textbf{Solution} & \textbf{F1} & \textbf{F2} & \textbf{F3} & \textbf{F4} \\
\midrule
\textbf{Microsoft Presidio \cite{microsoft_presidio}} & \cmark, can integrate within cloud-edge pipelines & \tmark, focuses on data redaction & \cmark, compatible with diverse techniques & \tmark, pre-built PII detectors with configurable policies \\

\textbf{Google Cloud DLP \cite{google_cloud_dlp}} & \cmark, primarily within Google Cloud & \tmark, focuses on redaction and anonymization & \cmark, works across data types & \tmark, flexible templates for data masking and redaction policies \\
\textbf{Apache Ranger \cite{apache_ranger}} & \tmark, mostly limited to cloud settings & \xmark, provides access control rather than service optimization & \cmark, integrates with various techniques & \cmark; high expressiveness with fine-grained policy control \\

\textbf{AWS Macie \cite{aws_macie}} & \tmark, suited for AWS cloud infrastructure & \tmark, prioritizes data protection & \cmark, AWS-centric & \tmark, supports predefined PII types but less customizable \\
\textbf{Google Cloud DLP \cite{google_cloud_dlp}} & \cmark, primarily within Google Cloud & \tmark, focuses on redaction and anonymization & \cmark, works across data types & \tmark, flexible templates for data masking and redaction policies \\

\textbf{IBM Guardium \cite{ibm_guardium}} & \cmark, supports hybrid cloud and on-prem setups & \xmark, focuses on monitoring and access control & \cmark, adaptable to multiple frameworks & \cmark, extensive policy-based access control and monitoring \\
\textbf{AWS Macie \cite{aws_macie}} & \tmark, suited for AWS cloud infrastructure & \tmark, prioritizes data protection & \cmark, AWS-centric & \tmark, supports predefined PII types but less customizable \\

\textbf{Apache Sentry \cite{apache_sentry}} & \tmark, Hadoop ecosystems & \xmark, static access control & \xmark, closely tied to Hadoop & \tmark, supports column and row-level access control \\
\textbf{IBM Guardium \cite{ibm_guardium}} & \cmark, supports hybrid cloud and on-prem setups & \xmark, focuses on monitoring and access control & \cmark, adaptable to multiple frameworks & \cmark, extensive policy-based access control and monitoring \\

\textbf{Our paper} & \cmark, suitable for cloud-edge environments & \cmark, enables selection of services that optimize data quality while ensuring data protection & \cmark, data-protection techniques agnostic & \cmark, high expressiveness with fine-grained policy control \\
\textbf{Apache Sentry \cite{apache_sentry}} & \tmark, Hadoop ecosystems & \xmark, static access control & \xmark, closely tied to Hadoop & \tmark, supports column and row-level access control \\

\bottomrule
\textbf{Our paper} & \cmark, suitable for cloud-edge environments & \cmark, selection of services that optimize quality while ensuring protection & \cmark, data-protection techniques agnostic & \cmark, high expressiveness with fine-grained policy control \\

\end{tabular}
\bottomrule
\end{tabularx}
}
\caption{Comparative analysis with relevant existing approaches. Feature support is classified according to \cmark (fully supported), \tmark (partially supported or limited in scope), \xmark (not supported)}
\label{tab:comparative}
Expand Down

0 comments on commit 9dec165

Please sign in to comment.