Skip to content

Commit

Permalink
re-added major simulator
Browse files Browse the repository at this point in the history
  • Loading branch information
antongiacomo committed Oct 23, 2024
1 parent 4321f3d commit f87f16c
Show file tree
Hide file tree
Showing 20 changed files with 110,658 additions and 8 deletions.
19 changes: 19 additions & 0 deletions major review/bib_on_BigDataAccessControl.bib
Original file line number Diff line number Diff line change
@@ -1,4 +1,23 @@
% Access Control
@misc{EuropeanParliament2016a,
date = {2016-05-04},
location = {OJ L 119, 4.5.2016, p. 1--88},
title = {Regulation ({EU}) 2016/679 of the {European} {Parliament} and of the {Council}},
url = {https://data.europa.eu/eli/reg/2016/679/oj},
titleaddon = {of 27 {April} 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing {Directive} 95/46/{EC} ({General} {Data} {Protection} {Regulation})},
abstract = {The General Data Protection Regulation (2016/679, "GDPR") is a Regulation in European Union (EU) law on data protection and privacy in the EU and the European Economic Area (EEA).},
author = {{European Parliament} and {Council of the European Union}},
keywords = {access consumer data data-processing freedom gdpr information justice law personal privacy protection security verification},
urldate = {2023-04-13},
}
@misc{hipaa1996,
title = {Health Insurance Portability and Accountability Act of 1996},
howpublished = {\url{https://www.govinfo.gov/content/pkg/PLAW-104publ191/html/PLAW-104publ191.htm}},
year = {1996},
note = {Public Law 104-191, 110 Stat. 1936},
author = {{U.S. Congress}},
url = {https://www.govinfo.gov/content/pkg/PLAW-104publ191/html/PLAW-104publ191.htm},
}

@inproceedings{Sandhu:ABAC:2018,
address = {New York, NY, USA},
Expand Down
16 changes: 8 additions & 8 deletions major review/introduction.tex
Original file line number Diff line number Diff line change
@@ -1,24 +1,24 @@
\section{Introduction}
The wide success and adoption of cloud-edge infrastructures and their intrinsic multitenancy radically change the way in which distributed systems are developed, deployed, and executed, redefining IT scalability, flexibility, and efficiency. Multitenancy in fact enables multiple users to share resources, such as computing power, storage, and services, optimizing their utilization and reducing operational costs. In this context, the increasing ability of collecting and managing huge volume of data, coupled with a paradigm shift in service delivery models, has also significantly enhanced scalability and efficiency in data analytics. Data are treated as digital products, which are managed and analyzed by multiple services orchestrated in data pipelines.
The wide success and adoption of cloud-edge infrastructures and their intrinsic multitenancy radically change the way in which distributed systems are developed, deployed, and executed, redefining IT scalability, flexibility, and efficiency. Multitenancy in fact enables multiple users to share resources, such as computing power, storage, and services, optimizing their utilization and reducing operational costs. In this context, the increasing ability of collecting and managing huge volume of data, coupled with a paradigm shift in service delivery models, has also significantly enhanced scalability and efficiency in data analytics. Data are treated as digital products, which are managed and analyzed by multiple services orchestrated in data pipelines.

The flip side of this scenario, where data pipelines orchestrate services selected at run time and are delivered in the cloud-edge continuum, is the increased complexity in data governance. Data are shared and analyzed by multiple services owned by different providers. This shared and distributed model introduces unique security challenges, where the pipeline and data owner may have different security requirements, access levels, and data sensitivity which depend on the specific orchestrated services. The latter in fact have different profiles that impact on the amount of data they can access and analyze according to the owner's requirements.
The flip side of this scenario, where data pipelines orchestrate services selected at run time and are delivered in the cloud-edge continuum, is the increased complexity in data governance. Data are shared and analyzed by multiple services owned by different providers. This shared and distributed model introduces unique security challenges, where the pipeline and data owner may have different security requirements, access levels, and data sensitivity which depend on the specific orchestrated services. The latter in fact have different profiles that impact on the amount of data they can access and analyze according to the owner's requirements.
%
Adequate measures such as encryption, access control mechanisms, and data anonymization techniques (e.g., k-anonymity, l-diversity, differential privacy) have been implemented to protect data against unauthorized access and ensure compliance with regulatory requirements such as GDPR or HIPAA. On the other side, data quality is crucial and must be guaranteed, as the removal or alteration of personally identifiable information from datasets to safeguard individuals' privacy can compromise the accuracy of analytics results.
Adequate measures such as encryption, access control mechanisms, and data anonymization techniques (e.g., k-anonymity, l-diversity, differential privacy) have been implemented to protect data against unauthorized access and ensure compliance with regulatory requirements such as GDPR~\cite{EuropeanParliament2016a} or HIPAA~\cite{hipaa1996}. On the other side, data quality is crucial and must be guaranteed, as the removal or alteration of personally identifiable information from datasets to safeguard individuals' privacy can compromise the accuracy of analytics results.

So far, all research endeavors have been mainly concentrated on exploring these two issues separately: on one hand, \emph{data quality}, encompassing accuracy, reliability, and suitability, has been investigated to understand the implications in analytical contexts. Although extensively studied, these investigations often prioritize enhancing the quality of source data rather than ensuring data quality throughout the entire processing pipeline, or the integrity of outcomes derived from data. On the other hand, \emph{data security and privacy} focused on the protection of confidential information and adherence to rigorous privacy regulations.

A valid solution however requires a holistic approach that integrates technological solutions, organizational policies, and ongoing monitoring and adaptation to emerging threats and regulatory changes. The implementation of robust access control mechanisms, ensuring that only authorized users can access specific datasets or analytical tools is just a mandatory but initial step. Additional requirements are emerging. First, data protection requirements should be identified at each stage of the data lifecycle, potentially integrating techniques like data masking and anonymization to safeguard sensitive information, thereby preserving data privacy while enabling data sharing and analysis. Second, data lineage should be prioritized, fostering a comprehensive understanding and optimization of data flows and transformations within complex analytical ecosystems.
A valid solution however requires a holistic approach that integrates technological solutions, organizational policies, and ongoing monitoring and adaptation to emerging threats and regulatory changes. The implementation of robust access control mechanisms, ensuring that only authorized users can access specific datasets or analytical tools is just a mandatory but initial step. Additional requirements are emerging. First, data protection requirements should be identified at each stage of the data lifecycle, potentially integrating techniques like data masking and anonymization to safeguard sensitive information, thereby preserving data privacy while enabling data sharing and analysis. Second, data lineage should be prioritized, fostering a comprehensive understanding and optimization of data flows and transformations within complex analytical ecosystems.

When evaluating a solution meeting these criteria, the following questions naturally arise:
\begin{enumerate}
\item How does a robust data protection policy affect analytics?
\item When considering a (big data) pipeline, should data protection be implemented at each pipeline step rather than filtering all data at the outset?
\item When considering a (big data) pipeline, should data protection be implemented at each pipeline step rather than filtering all data at the outset?
\item In a scenario where a user has the option to choose among various candidate services, how might these choices affect the analytics?
\end{enumerate}

Based on the aforementioned considerations, we propose a data governance framework for modern data-driven pipelines, designed to mitigate privacy and security risks. The primary objective of this framework is to support the selection and assembly of data processing services within the pipeline, with a central focus on the selection of those services that optimize data quality, while upholding privacy and security requirements.
To this aim, each element of the pipeline is \textit{annotated} with \emph{i)} data protection requirements expressing transformation on data and \emph{ii)} functional specifications on services expressing data manipulations carried out during each service execution.
Though applicable to a generic scenario, our data governance approach starts from the assumption that maintaining a larger volume of data leads to higher data quality; as a consequence, its service selection algorithm focuses on maximizing data quality by retaining the maximum amount of information when applying data protection transformations.
Based on the aforementioned considerations, we propose a data governance framework for modern data-driven pipelines, designed to mitigate privacy and security risks. The primary objective of this framework is to support the selection and assembly of data processing services within the pipeline, with a central focus on the selection of those services that optimize data quality, while upholding privacy and security requirements.
To this aim, each element of the pipeline is \textit{annotated} with \emph{i)} data protection requirements expressing transformation on data and \emph{ii)} functional specifications on services expressing data manipulations carried out during each service execution.
Though applicable to a generic scenario, our data governance approach starts from the assumption that maintaining a larger volume of data leads to higher data quality; as a consequence, its service selection algorithm focuses on maximizing data quality by retaining the maximum amount of information when applying data protection transformations.

The primary contributions of the paper can be summarized as follows:
\begin{enumerate*}
Expand Down
11 changes: 11 additions & 0 deletions simulator-swift/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
.DS_Store
.build
/Packages
xcuserdata/
DerivedData/
.netrc
__pycache__
venv
.vscode
.build
.swiftpm
3 changes: 3 additions & 0 deletions simulator-swift/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
FROM swift:latest

RUN apt-get update && apt-get install -y python3 python3-pip python3-venv python3-mysqldb libmysqlclient-dev
Loading

0 comments on commit f87f16c

Please sign in to comment.