diff --git a/bib_on_BigDataAccessControl.bib b/bib_on_BigDataAccessControl.bib index 860c324..06448cb 100644 --- a/bib_on_BigDataAccessControl.bib +++ b/bib_on_BigDataAccessControl.bib @@ -712,7 +712,7 @@ @INPROCEEDINGS{7014544 doi={10.4108/icst.collaboratecom.2014.257649} } - @InProceedings{10.1007/978-3-642-10665-1_11, + @InProceedings{dataProtection, author="Creese, Sadie and Hopkins, Paul and Pearson, Siani @@ -760,4 +760,37 @@ @article{Majeed2021AnonymizationTF url={https://api.semanticscholar.org/CorpusID:231616865} } +@article{dataAccuracy, +author = {Richard Y. Wang and Diane M. Strong}, +title = {Beyond Accuracy: What Data Quality Means to Data Consumers}, +journal = {Journal of Management Information Systems}, +volume = {12}, +number = {4}, +pages = {5--33}, +year = {1996}, +publisher = {Routledge}, +doi = {10.1080/07421222.1996.11518099}, +} + +@article{dataQuality, +author = {Tayi, Giri Kumar and Ballou, Donald P.}, +title = {Examining data quality}, +year = {1998}, +issue_date = {Feb. 1998}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +volume = {41}, +number = {2}, +issn = {0001-0782}, +url = {https://doi.org/10.1145/269012.269021}, +doi = {10.1145/269012.269021}, +journal = {Commun. ACM}, +month = {feb}, +pages = {54–57}, +numpages = {4} +} + + + + \ No newline at end of file diff --git a/pipeline_template.tex b/pipeline_template.tex index bc11cb3..aa80098 100644 --- a/pipeline_template.tex +++ b/pipeline_template.tex @@ -103,25 +103,25 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition} A {\it policy p}$\in$\P{} is 5-uple $<$\textit{subj}, \textit{obj}, \textit{act}, \textit{env}, \textit{\TP}$>$ that specifies who (\emph{subject}) can access what (\emph{object}) with action (\emph{action}), in a specific context (\emph{environment}) and under specific obligations (\emph{data transformation}). \end{definition} \vspace{0.5em} - More in detail, \textit{subject subj} specifies a service $s_i$ issuing an access request to perform an action on an object. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. For instance, (classifier $=$ ``SVM") specifies a service providing a SVM classifier. We note that \textit{subj} can also specify conditions on the service owner (\textit{e.g.}, owner\_location $=$ ``EU") and the service user (\textit{e.g.}, service\_user\_role $=$ ``DOC Director"). + More in detail, \textit{subject subj} specifies a service $s_i$ issuing an access request to perform an action on an object. It is a set \{$pc_i$\} of \emph{Policy Conditions} on the subject's attributes as defined in Definition \ref{def:policy_cond}. For instance, (classifier $=$ ``SVM") specifies a service providing a SVM classifier. We note that \textit{subj} can also specify conditions on the service owner (\textit{e.g.}, owner\_location $=$ ``EU") and the service user (\textit{e.g.}, service\_user\_role $=$ ``DOC Director"). %\item - \textit{Object obj} defines those data whose access is governed by the policy. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. + \textit{Object obj} defines the data governed by the access policy. In this case, it is a set \{$pc_i$\} of \emph{Policy Conditions} on the object's attributes. %as defined in Definition \ref{def:policy_cond}. %It can specify the \emph{type} of object, such as a file (e.g., a video, text file, image, etc.), a SQL or noSQL database, a table, a column, a row, or a cell of a table, or any other characteristics of the data. For instance, \{(type $=$ ``dataset"), (region $=$ ``CT")\} refers to an object of type dataset and whose region is Connecticut. %\item - \textit{Action act} defines those operations that can be performed within a big data environment, from traditional atomic operations on databases (e.g., CRUD operations) to coarser operations, such as an Apache Spark Direct Acyclic Graph (DAG), Hadoop MapReduce, an analytics function call, and an analytics pipeline. + \textit{Action act} specifies the operations that can be performed within a big data environment, from traditional atomic operations on databases (e.g., CRUD operations) to coarser operations, such as an Apache Spark Direct Acyclic Graph (DAG), Hadoop MapReduce, an analytics function call, and an analytics pipeline. %\item - \textit{Environment env} defines a set of conditions on contextual attributes, such as time of the day, location, IP address, risk level, weather condition, holiday/workday, emergency. It is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. For instance, (\textit{e.g.}, time $=$ ``night") refers to a policy that is applicable only at night. + \textit{Environment env} defines a set of conditions on contextual attributes, such as time of the day, location, IP address, risk level, weather condition, holiday/workday, emergency. Also in this case, it is a set \{$pc_i$\} of \emph{Policy Conditions} as defined in Definition \ref{def:policy_cond}. For instance, (time $=$ ``night") refers to a policy that is applicable only at night. %\item - \textit{Data Transformation \TP} defines a set of security and privacy-aware transformations on \textit{obj} that must be enforced before any access to data is given. Transformations focus on data protection, as well as on compliance to regulations and standards, in addition to simple format conversions. For instance, let us define three transformations that can be applied to the dataset in \cref{tab:dataset}: + \textit{Data Transformation \TP} defines a set of security and privacy-aware transformations on \textit{obj} that must be enforced before any access to data is given. Transformations focus on data protection, as well as on compliance to regulations and standards, in addition to simple format conversions. For instance, let us define three transformations that can be applied to the dataset in \cref{tab:dataset}, each performing different levels of anonymization: \begin{enumerate*}[label=\roman*)] - \item \emph{level0} (\tp{0}): no anonymization; - \item \emph{level1} (\tp{1}): partial anonymization with only first and last name being anonymized; - \item \emph{level2} (\tp{2}): full anonymization with first name, last name, identifier and age being anonymized. + \item level \emph{l0} (\tp{0}): no anonymization; + \item level \emph{l1} (\tp{1}): partial anonymization with only first and last name being anonymized; + \item level \emph{l2} (\tp{2}): full anonymization with first name, last name, identifier and age being anonymized. \end{enumerate*} diff --git a/pipeline_template_example.tex b/pipeline_template_example.tex index c0d07ec..42423c4 100644 --- a/pipeline_template_example.tex +++ b/pipeline_template_example.tex @@ -47,7 +47,7 @@ The second stage consists of vertex \vi{4}, merging the three datasets obtained at the first stage. Data protection annotation \myLambda(\vi{4}) refers to policies \p{1} and \p{2}, which apply different data transformations depending on the relation between the dataset and the service owner. % 2° NODO % -If the service owner is also the dataset owner (\pone), the dataset is not anonymized (\tp{0}). If the service owner is a partner of the dataset owner (\ptwo), the dataset is anonymized at \emph{level1} (\tp{1}). If the service owner has no partner relationship with the dataset owner, no policy applies. +If the service owner is also the dataset owner (i.e., \pone), the dataset is not anonymized (\tp{0}). If the service owner is a partner of the dataset owner (i.e., \ptwo), the dataset is anonymized at \emph{level1} (\tp{1}). If the service owner has no partner relationship with the dataset owner, no policy applies. %if the service owner is neither the dataset owner nor a partner of the dataset owner (\pthree), the dataset is anonymized level2 (\tp{2}). Functional requirement \F{4} prescribes $n$ datasets as input and the merged dataset as output. diff --git a/related.tex b/related.tex index 7932bd4..bf0b17e 100644 --- a/related.tex +++ b/related.tex @@ -1,50 +1,34 @@ \section{Related Work}\label{sec:related} + +Given the breadth of topics covered, the related work is discussed separately to provide a more detailed and organized review. +In Section \ref{sec:dataquality} In Section \ref{sec:datagov} data governance solution security aware + %%%%%%%%%%%%%%%%%%%% -\subsection{Data quality} +\subsection{Data protection and data quality}\label{sec:dataquality} %%%%%%%%%%%%%%%%%%%% -Data quality is a largely studied research topic, from different communities. +Data quality is a largely studied research topic, from different communities and from different points of view. perspectives +Nell'ambito big data, principalmente Big data quality refers to the degree to which big data meets the requirements and expectations of its intended use. It encompasses various dimensions and characteristics to ensure that the data is reliable, accurate, and valuable for analysis and decision-making. Key dimensions of big data quality include: + Accuracy: The correctness of the data, ensuring it accurately represents the real-world entities and events it is supposed to model. + Consistency: The uniformity of the data across different datasets and over time, ensuring that data does not contradict itself. + Validity: The data must conform to defined formats and constraints, ensuring it is suitable for the intended use. -to perform the data mining tasks in -a privacy-preserving way. These techniques for performing privacy-preserving -data mining are drawn from a wide array of related topics such as data mining, -cryptography and information hiding. -The main problem with data quality is that its evaluation is relative [18], in -that it usually depends on the context in which data are used. -non c'è un unica metrica, ci sono diverse metriche +Con l'avvento della necessità di proteggere i dati sensibili la questione della qualità ha iniziato anche a encompass a larger notion of accuracy, inteso nel senso di quanto sia valido. - Accuracy: it measures the proximity of a sanitized value to the original +value. +Per altro nell'ambito di Privacy Preserving Data Mining, il problema è opposto, perché si vuole calcolare la qualità del dato quality of data resulting from the anonymization process. data quality metrics are very important in the evaluation of PPDM +techniques. -In our selection we tried to choose una che potesse andare bene. E soprattutto abbiamo cercato di fare delle considerazioni rispetto a tutto il ciclo di vita del dato +e anche la necessità di avere delle metriche che misurino la qualità -citare nostro journal -The main feature of the most PPDM algorithms is that they usually modify -the database through insertion of false information or through the blocking of -data values in order to hide sensitive information. Such perturbation techniques -cause the decrease of the data quality. It is obvious that the more the changes -are made to the database, the less the database reflects the domain of interest. -Therefore, data quality metrics are very important in the evaluation of PPDM -techniques. In existing works, several data quality metrics have been proposed that are -either generic or data-use-specific. However, currently, there is no metric that -is widely accepted by the research community. -In evaluating the data quality after the privacy preserving process, it can be -useful to assess both the quality of the data resulting from the PPDM process -and the quality of the data mining results. -The quality of the data themselves -can be considered as a general measure evaluating the state of the individual -items contained in the database after the enforcement of a privacy preserving -technique. The quality of the data mining results evaluates the alteration in the -information that is extracted from the database after the privacy preservation -process, on the basis of the intended data use. -The main problem with data quality is that its evaluation is relative [18], in -that it usually depends on the context in which data are used. - +either generic or data-use-specific \cite{Majeed2021AnonymizationTF}. METTERE ESEMPI E RIFERIMENTI PRESI DA LI However, currently, there is no metric that is widely accepted by the research community. The main problem with data quality is that its evaluation is relative \cite{dataAccuracy,dataQuality}, in that it usually depends on the context in which data are used. In the scientific literature data quality is generally considered a multi-dimensional concept that in certain contexts involves -both objective and subjective parameters [3, 34]. Among the various possible +both objective and subjective parameters. Among the various possible parameters, the following ones are usually considered the most relevant: - Accuracy: it measures the proximity of a sanitized value to the original value. @@ -61,6 +45,14 @@ \subsection{Data quality} accuracy. +In our selection we tried to choose una che potesse andare bene. E soprattutto abbiamo cercato di fare delle considerazioni rispetto a tutto il ciclo di vita del dato + +citare nostro journal + + +In evaluating the data quality after the privacy preserving process, it can be +useful to assess both the quality of the data resulting from the PPDM process + The database management research community mainly focused on increasing the quality of the source data rather than guaranteeing data quality along the whole processing pipeline or the quality of outcomes built on data. In \cite{BigDataQaulitySurvey}, a survey on big data quality is proposed mentioning the well known categories of big data quality grouped by intrinsic, contextual representational and accessibility categories. @@ -68,16 +60,19 @@ \subsection{Data quality} In this paper we depart from this idea on data quality at pre processing time only measuring it at each step of the big data pipeline. tecniche di crittografia -\cite{8863330} \cite{Majeed2021AnonymizationTF} +%\cite{8863330} + cose simile a noi, nel senso che si rendono conto del problema e lo studiano Privacy and analytics can work in tandem, but the mining outcome of a privacy aware design suffers from data quality in quello che segue ci sono anche in una tabella gli articoli citati NON SO SE METTERLI QUI O NELLA SEZIONE DOPO - \cite{10.1007/978-981-15-0372-6_19} + \cite{dataProtection} + +Nel nostro caso abbiamo scelto due metriche, una di un tipo l'altra di un altro, ma il framework e l'approccio rimangono validi anche scegliendo un'altra metrica. %%%%%%%%%%%%%%%%%% -\subsection{Data protection} +\subsection{Data protection and data governance}\label{ssec:datagov} %%%%%%%%%%%%%%%%%% Research on data governance and protection focuses on the definition of new approaches and techniques aimed to protect the security and privacy of big data (e.g., CIA triad), as well as managing their life cycle with security and privacy in mind. Often, the research community is targeting specific security and privacy problems, resulting in a proliferation of solutions and tools, which are difficult to integrate in a coherent framework. Many solutions have been developed to protect the users' identity (e.g., anonymity \cite{wallace1999anonymity}, pseudonimity \cite{pfitzmann2001pseudonymity}, k-anonymity \cite{k-anon}), to guarantee data confidentiality and integrity (e.g., encryption \cite{thambiraja2012survey}, differential privacy \cite{hassan2019differential}, access control \cite{tolone2005access,servos2017current}), and to govern data sharing and analysis (e.g., data lineage \cite{woodruff1997supporting}, ETL/ELT ingestion \cite{vassiliadis2009survey}). @@ -99,6 +94,6 @@ \subsection{Data protection} \cite{balancingInMedicine} However centralized data-storage has its pitfalls, especially regarding data privacy. We therefore drafted an IT infrastructure that uses decentralized storage to ensure data privacy, but still enables data transfer between participating hospitals. It implements an independent information broker to ensure anonymity of patients. Still it provides a way for researchers to request data and hospitals to contribute data on an opt-in basis. Although not an entirely new approach, the emphasis on data privacy throughout the design is a novel aspect providing a better balance between the need for big sample sizes and patient privacy. %%%%%%%%%%%%%%%%%%% -\subsection{Service Selection} +\subsection{Service Selection based on data quality} %%%%%%%%%%%%%%%%%%% citare TWEB diff --git a/system_model.tex b/system_model.tex index f531134..2f942ad 100644 --- a/system_model.tex +++ b/system_model.tex @@ -7,7 +7,7 @@ \subsection{System Model}\label{sec:systemmodel} \item[Service,] a software distributed by a \textbf{service provider} that performs a specific task; \item[Pipeline,] a sequence of connected services that collect, prepare, process, and analyze data in a structured and automated manner; \item[Data Governance Policy,] a structured set of privacy guidelines, rules, and procedures regulating data access and protection. \textcolor{red}{In particular, each component service in the pipeline is annotated with data protection requirements and functional specifications.} - \item[User,] executing an analytics pipeline on the data. + \item[User,] executing an analytics pipeline on the data. \textcolor{red}{ We assume the user is authorized to perform this operation, either as the data owner or as a data processor with the owner's consent.} \textcolor{red}{ \item[Dataset,] data target of the analytics pipeline. We assume they are ready for analysis, i.e., they underwent a preparatory phase addressing issues such as missing values, outliers, and formatting discrepancies. This ensures that the data are in an optimal state for subsequent analysis.} \end{description} % @@ -15,7 +15,7 @@ \subsection{System Model}\label{sec:systemmodel} \begin{definition}[\pipeline]\label{def:pipeline} % A \pipeline is as a direct acyclic graph G(\V,\E), where \V\ is a set of vertices and \E\ is a set of edges connecting two vertices \vi{i},\vi{k}$\in$\V. The graph has a root \vi{r}$\in$\V, a vertex \vi{i}$\in$\V$_S$ for each service $s_i$, two additional vertices \vi{c},\vi{m}$\in$\V$_{\timesOperator}$$\subset$\V\ for each alternative ($\timesOperator$) structure modeling the alternative execution (\emph{choice}) of operations and the retrieval (\emph{merge}) of the results, respectively, and one additional vertex \vi{f} $\in$\V$_{\plusOperator}$$\subset$\V\ for each parallel ($\plusOperator$) structure modeling the contemporary execution (\emph{fork}) of operations. A \pipeline is as a direct acyclic graph G(\V,\E), where \V\ is a set of vertices and \E\ is a set of edges connecting two vertices \vi{i},\vi{k}$\in$\V. - The graph has a root \vi{r}$\in$\V ($\bullet$), a vertex \vi{i}$\in$\V$_S$ for each service $s_i$, an additional vertex \vi{f}$\in$\V\ for each parallel ($\plusOperator$) structure modeling the contemporary execution (\emph{fork}) of services. + The graph has a root ($\bullet$) vertex \vi{r}$\in$\V , a vertex \vi{i}$\in$\V$_S$ for each service $s_i$, an additional vertex \vi{f}$\in$\V\ for each parallel ($\plusOperator$) structure modeling the contemporary execution (\emph{fork}) of services. \end{definition} We note that \V$=$\{\vi{r},\vi{f}\}$\cup$\V$_S$, with vertices \vi{f} modeling branching for parallel structures, and root \vi{r} possibly representing the orchestrator. In addition, for simplicity but no lack of generality, alternative structures modeling the alternative execution of services are specified as alternative service pipelines, that is, there is no alternative structure in a single service pipeline.