diff --git a/Simulator/__pycache__/node.cpython-311.pyc b/Experiments/Simulator/__pycache__/node.cpython-311.pyc similarity index 100% rename from Simulator/__pycache__/node.cpython-311.pyc rename to Experiments/Simulator/__pycache__/node.cpython-311.pyc diff --git a/Simulator/__pycache__/nodeList.cpython-311.pyc b/Experiments/Simulator/__pycache__/nodeList.cpython-311.pyc similarity index 100% rename from Simulator/__pycache__/nodeList.cpython-311.pyc rename to Experiments/Simulator/__pycache__/nodeList.cpython-311.pyc diff --git a/Simulator/__pycache__/service.cpython-311.pyc b/Experiments/Simulator/__pycache__/service.cpython-311.pyc similarity index 100% rename from Simulator/__pycache__/service.cpython-311.pyc rename to Experiments/Simulator/__pycache__/service.cpython-311.pyc diff --git a/Simulator/main.py b/Experiments/Simulator/main.py similarity index 100% rename from Simulator/main.py rename to Experiments/Simulator/main.py diff --git a/Simulator/node.py b/Experiments/Simulator/node.py similarity index 100% rename from Simulator/node.py rename to Experiments/Simulator/node.py diff --git a/Simulator/nodeList.py b/Experiments/Simulator/nodeList.py similarity index 100% rename from Simulator/nodeList.py rename to Experiments/Simulator/nodeList.py diff --git a/Simulator/requirements.txt b/Experiments/Simulator/requirements.txt similarity index 100% rename from Simulator/requirements.txt rename to Experiments/Simulator/requirements.txt diff --git a/Simulator/service.py b/Experiments/Simulator/service.py similarity index 100% rename from Simulator/service.py rename to Experiments/Simulator/service.py diff --git a/pipeline_instance_example.tex b/pipeline_instance_example.tex index 42060f5..bc5e125 100644 --- a/pipeline_instance_example.tex +++ b/pipeline_instance_example.tex @@ -4,40 +4,6 @@ It includes three key stages in our reference scenario: data anonymization (\vi{1}), data enrichment (\vi{2}), and data aggregation (\vi{3}), each stage with its policy $p$. - \begin{table*} - \caption{Services and their quality metrics.} - \label{tab:services} - \centering - \begin{tabular}[t]{ccc} - \toprule - \textbf{Stage} & \textbf{Transformation} & \textbf{Service} \\ - \midrule - \vi{1} & $p_1$ & $s_1$ \\ - \vi{1} & $p_1$ & $s_2$ \\ - \vi{2} & $p_2$ & $s_3$ \\ - \vi{2} & $p_2$ & $s_4$ \\ - \vi{3} & $p_3$ & $s_5$ \\ - \vi{3} & $p_3$ & $s_6$ \\ - \bottomrule - \end{tabular} - \hspace{1em} - \begin{tabular}[t]{c|c} - \toprule - \textbf{Type} & \textbf{Transformation} \\ - \midrule - $\TF{\epsilon}$ & $Empty $ \\ - $\TF{a}$ & $Additive$ \\ - $\TF{t}$ & $Transformation$ \\ - $\TF{d}$ & $Domain Change$ \\ - \bottomrule - \end{tabular} - - \end{table*} - - The second stage \vi{1} is a preprocessing and cleaning serivce, - which implements - - The filtering algorithm then returns the set $S'=\{s_1,s_2\}$. The comparison algorithm is finally applied to $S'$ and returns a ranking of the services according to quality metrics, where $s_1$ is ranked first. $s_1$ is then selected and integrated in $\vii{1}\in \Vp$. diff --git a/pipeline_template.tex b/pipeline_template.tex index bdf74df..ca19e42 100644 --- a/pipeline_template.tex +++ b/pipeline_template.tex @@ -24,7 +24,8 @@ \subsection{Pipeline Template Definition}\label{sec:templatedefinition} \end{enumerate} \end{definition} -We note that, at this stage, the template is not yet linked to any services, nor it is possible to determine the policy modeling the specific data protection requirements. We also note that policies $p_j$$\in$\P{i} annotated with \myLambda(\vi{i}) are ORed, meaning that the access decision is positive if at least one policy $p_j$ is evaluated to \emph{true}. +We note that, at this stage, the template is not yet linked to any services, nor it is possible to determine the policy modeling the specific data protection requirements. +We also note that policies $p_j$$\in$\P{i} annotated with \myLambda(\vi{i}) are ORed, meaning that the access decision is positive if at least one policy $p_j$ is evaluated to \emph{true}. %We also note that functional description $F_i$ includes the specific data transformation triggered as the result of a service execution. An example of pipeline template is depicted in \cref{fig:service_composition_template} diff --git a/pipeline_template_example.tex b/pipeline_template_example.tex index e43dece..88b9c50 100644 --- a/pipeline_template_example.tex +++ b/pipeline_template_example.tex @@ -1,50 +1,64 @@ \subsection{Example}\label{sec:example} +\newcommand{\pone}{$\langle service,owner=dataset.owner\rangle$} +\newcommand{\ptwo}{$\langle service,owner=partner(dataset.owner) \rangle$} +\newcommand{\pthree}{$\langle service, owner \neq dataset.owner AND owner \neq partner(dataset.owner)$} + In this section, we present an illustrative pipeline template, concentrating on the policy annotations. -The pipeline template consists of six stages, and each stage is noted with a policy. +The pipeline template consists of five stages, and each stage is noted with a policy. All these policies are outlined in \cref{tab:anonymization}. -Additionally, \cref{tab:dataset} shows a sample of the dataset. -It is assumed that the Connecticut Prison (CTP) is the data owner, with partnerships with two other facilities, namely New York Prison and -New Hampshire Prison. +we recall that, \cref{tab:dataset} shows a sample of the dataset. +\hl{It is assumed that the Connecticut Prison (CTP) is the data owner, with partnerships with two other facilities, namely New York Prison and + New Hampshire Prison.}\hl{SPOSTARE NEL SYSTEM MODEL?} In the following we will make reference to three different type of anonymization: \begin{enumerate*}[label=\roman*)] - \item \emph{none} (\tf{1}): no anonymization is performed; - \item \emph{light} (\tf{2}): the data is partially anonymized, only the first name and last name are anonymized; - \item \emph{full} (\tf{3}): the data is fully anonymized: first name, last name, identifier and age are anonymized. + \item \emph{level0} (\tf{0}): no anonymization is performed; + \item \emph{level1} (\tf{1}): the data is partially anonymized, only the first name and last name are anonymized; + \item \emph{level2} (\tf{2}): the data is fully anonymized: first name, last name, identifier and age are anonymized. \end{enumerate*} + Let us consider the pipeline template \tChartFunction in \cref{sec:example}, % 1° NODO % -The first vertex is responsible for data anonymization and is associated with three policies (\p{1},\p{2},\p{3}). -During the node execution, the policies are assessed: -if the service profile matches with the data owner ($owner = ``CTP"$), \p{1} is satisfied and the data is not anonymized (\tf{1}); -if the service profile matches with a partner of the owner ($owner = ``CTP"$), \p{2} is satisfied and the data is partially anonymized (\tf{2}); -if the service profile doesn't match with a partner nor with the owner ($owner = ``CTP"$), \p{3} is satisfied and the data is fully anonymized (\tf{3}). +The first stage consists of three parallel vertices (\vi{1}, \vi{2}, \vi{3}) and focuses on data collection without applying any policies. +The functional requirement necessitates a URI as input, and the output is the downloaded dataset. + +The second stage incorporates a sole vertex, which merges the three datasets obtained from the previous stages and is associated with three policies (\p{1},\p{2},\p{3}). +The policies are evaluated during the node execution: +%if the service profile matches with the data owner ($owner = ``CTP"$), \p{1} is satisfied and the data is not anonymized (\tf{1}); +%if the service profile matches with a partner of the owner ($owner = ``CTP"$), \p{2} is satisfied and the data is partially anonymized (\tf{2}); +%if the service profile doesn't match with a partner nor with the owner ($owner = ``CTP"$), \p{3} is satisfied and the data is fully anonymized (\tf{3}). % 2° NODO % -The second vertex is responsible for enriching the data. -The service downloads the dataset from partner facilities and enhances the dataset of the Connecticut facility. -The policies are consistent with those of the first stage (\p{1},\p{2},p{3}). -if the service is \hl{made} by the data owner ($\langle owner = ``CTP" \rangle$), the owner dataset remains unaltered (\tf{0}), whereas the partner dataset is partially anonymized . -if the service is \hl{made} by their partners ($\langle owner = ``CTP" \rangle$), the owner dataset is partially anonymized as well as the partner dataset. -if the service is \hl{made} by a third party ($\langle owner = ``CTP" \rangle$), the owner dataset is fully anonymized as well as the partner dataset. +%he second vertex is responsible for enriching the data. +%The service downloads the dataset from partner facilities and enhances the dataset of the Connecticut facility. + +if the service is by the data owner (\pone), which means that if the service owner is the same as the dataset owner, the owner dataset is not anonymized (\tf{0}). +if the service is by their partners (\ptwo), which means that if the service owner is a partner of the dataset owner, the dataset is level2 anonymized (\tf{1}). +if the service is by a third party (\pthree), which means that if the service owner is neither the dataset owner nor a partner of the dataset owner, the dataset is level3 anonymized (\tf{2}). +The functional requirement necessitates $n$ datasets as input, and the output is the merged dataset. % 3° NODO % -The third vertex, is responsible for data analysis and statistics, -it adopts policies analogous to the first stage. The logic remains consistent: -if the service profile matches with the data owner ($\langle owner = ``CTP" \rangle$), \p{1} is satisfied and the data computation is made on non anonymized data (\tf{1}); -if the service profile matches with a partner of the owner ($\langle owner = partner(``CTP") \rangle$), \p{2} is satisfied and the data computation is made on partially anonymized data (\tf{2}); -if the service profile doesn't match with a partner nor with the owner ($\langle owner = ``any" \rangle$), \p{3} is satisfied and the data computation is made on fully anonymized data (\tf{3}). +The third stage, is responsible both for data analysis/statistics and machine learning tasks. +The stage is composed of two alternative vertices respectively \vi{4}, \vi{5}. + +Data analytics vertex adopts policies analogous to the second stage. The logic remains consistent: +if the service profile matches with the data owner (\pone), \p{1} is satisfied and the data computation is made level0 anonymized data (\tf{0}); +if the service profile matches with a partner of the owner (\ptwo), \p{2} is satisfied and the data computation is made on level1 anonymized data (\tf{1}); +if the service profile doesn't match with a partner nor with the owner (\pthree), \p{3} is satisfied and the data computation is made on level3 data (\tf{2}). +The functional requirement necessitates a dataset as input, and the output is the computes statistics. % 4° NODO % -The fourth vertex is responsible for machine learning tasks: -The policy guidelines recommend anonymizing all datasets to prevent personal identifiers from entering into the machine learning algorithm/model (\tf{3}). +Machine Learning vertex adopts always a level3 anonymization (\p(4)) to prevent personal identifiers from entering into the machine learning algorithm/model (\tf{2}). +The functional requirement necessitates a dataset as input, and the output is the trained model or an inference. % 5° NODO % -The fifth vertex manages data storage. +The fifth stage manages data storage. If the service is within the facility itself ($\langle service,region=FACILITY"\rangle$), \p{5} is satisfied, resulting in data anonymization (\tf{1}). Otherwise, if the service is in a partner region ($\langle service,region={CT,NY,NH}"\rangle$), the data undergo partial anonymization (\tf{2}). +The functional requirement necessitates some data as input, and the output is the URI of the stored data. % 6° NODO % -The sixth vertex is responsible for data visualization. -As stated in policy annotation \p{6}, if the user is member of the facility itself, the data are not anonymized (\tf{1}). -If the user is member of a partner facility, the data are partially anonymized (\tf{2}). -If the user is not member of the facility nor a partner, the data are fully anonymized (\tf{3}). +The sixth stage is responsible for data visualization. +As stated in policy annotation \p{6}, if the user is member of the facility itself, the data are level0 anonymized (\tf{0}). +If the user is member of a partner facility, the data are level2 anonymized (\tf{2}). +If the user is not member of the facility nor a partner, the data are level2 anonymized (\tf{3}). +The functional requirement necessitates a dataset as input, and the output is the visualization of the data. In summary, this section has delineated a comprehensive pipeline template. @@ -57,52 +71,43 @@ \subsection{Example}\label{sec:example} \def\arraystretch{1.5} \begin{tabular}[t]{c|c|l} - \textbf{Vertex} & \textbf{Policy} & \policy{subject}{object}{action}{environment}{transformation} \\ \hline + \textbf{Vertex} & \textbf{Policy} & \policy{subject}{object}{action}{environment}{transformation} \\ \hline - \vi{1},\vi{2},\vi{3} & $\p{1}$ & \policy{$\langle service,owner=``CTP"\rangle$}{dataset}{READ}{ANY}{ \tf{1} } \\ - \vi{1},\vi{2},\vi{3} & $\p{2}$ & \policy{$\langle service,owner=partner(``CTP") \rangle$}{dataset}{READ}{ANY}{ \tf{2} } \\ - \vi{1},\vi{2},\vi{3} & $\p{3}$ & \policy{$\langle service,owner=``Any"$}{dataset}{READ}{ANY}{ \tf{3} } \\ - \vi{4} & $\p{4}$ & \policy{ANY}{dataset}{READ}{ANY}{ \tf{3} } \\ - \vi{5} & $\p{5}$ & \policy{$\langle service,region=``FACILITY"\rangle$}{dataset}{WRITE}{ANY}{ \tf{1} } \\ - \vi{5} & $\p{6}$ & \policy{$\langle service,region=``\{CT,NY,NH\}"\rangle$}{dataset}{WRITE}{ANY}{ \tf{2} } \\ - \vi{6} & $\p{7}$ & \policy{$\langle user,role= ``Connecticut Prison Officer"$}{dataset} {READ}{ANY}{ \tf{1} } \\ - \vi{6} & $\p{7}$ & \policy{$\langle user,role= ``Partener Prison Officer"$}{dataset} {READ}{ANY}{ \tf{2} } \\ - \vi{6} & $\p{8}$ & \policy{$\langle user,role= ``Any"$}{dataset} {READ}{ANY}{ \tf{3} } \\ + \vi{M} & $\p{1}$ & \policy{\pone}{dataset}{READ}{ANY}{ \tf{1} } \\ + \vi{M} & $\p{2}$ & \policy{\ptwo}{dataset}{READ}{ANY}{ \tf{2} } \\ + \vi{M} & $\p{3}$ & \policy{\pthree}{dataset}{READ}{ANY}{ \tf{3} } \\ + \vi{4} & $\p{4}$ & \policy{ANY}{dataset}{READ}{ANY}{ \tf{3} } \\ + \vi{5} & $\p{5}$ & \policy{$\langle service,region=``FACILITY"\rangle$}{dataset}{WRITE}{ANY}{ \tf{1} } \\ + \vi{5} & $\p{6}$ & \policy{$\langle service,region=``\{CT,NY,NH\}"\rangle$}{dataset}{WRITE}{ANY}{ \tf{2} } \\ + \vi{6} & $\p{7}$ & \policy{$\langle user,role= ``Connecticut Prison Officer"$}{dataset} {READ}{ANY}{ \tf{1} } \\ + \vi{6} & $\p{7}$ & \policy{$\langle user,role= ``Partener Prison Officer"$}{dataset} {READ}{ANY}{ \tf{2} } \\ + \vi{6} & $\p{8}$ & \policy{$\langle user,role= ``Any"$}{dataset} {READ}{ANY}{ \tf{3} } \\ \end{tabular} \begin{tabular}[t]{c|c|c} - \textbf{\tf{i}} & \textbf{Anonymization} & \textbf{Columns Anonymized} \\\hline - \tf{1} & none & $\varnothing$ \\ - \tf{2} & light & \{ FIRST NAME, LAST NAME \} \\ - \tf{3} & full & \{ FIRST NAME, LAST NAME, IDENTIFIER,AGE \} \\ + \textbf{\tf{i}} & \textbf{Level} & \textbf{Columns Anonymized} \\\hline + \tf{0} & Level0 & $anon(\varnothing) $ \\ + \tf{1} & level1 & $anon(FIRST NAME, LAST NAME)$ \\ + \tf{2} & level2 & $anon(FIRST NAME, LAST NAME, IDENTIFIER,AGE$ \\ \end{tabular} - \egroup + % % \begin{tabular}[t]{ccc} + % % \toprule + % % \textbf{Stage} & \textbf{Policy} & \textbf{Service} \\ + % % \midrule + % % \vi{1} & $p_1$ & $s_1$ \\ + % % \vi{1} & $p_1$ & $s_2$ \\ + % % \vi{2} & $p_2$ & $s_3$ \\ + % % \vi{2} & $p_2$ & $s_4$ \\ + % % \vi{3} & $p_3$ & $s_5$ \\ + % % \vi{3} & $p_3$ & $s_6$ \\ + % % \bottomrule + % % \end{tabular} + % % \hspace{1em} + + % \egroup \end{table*} \vspace{2em} -\begin{table*}[!ht] - \caption{Dataset sample} - \label{tab:dataset} - \centering - \begin{adjustbox}{max totalsize={.99\linewidth}{\textheight},center} - \bgroup - \def\arraystretch{1.5} - \begin{tabular}{|l|l|l|l|l|l|l|l|l|l|l|l|} - \hline - \textbf{DOWNLOAD DATE} & \textbf{IDENTIFIER} & \textbf{FIRST NAME} & \textbf{LAST NAME} & \textbf{LAD} & \textbf{RACE} & \textbf{GENDER} & \textbf{AGE} & \textbf{BOND} & \textbf{OFFENSE} & \textbf{\dots} \\ \hline - 05/15/2020 & ZZHCZBZZ & ROBERT & PIERCE & 08/16/2018 & BLACK & M & 27 & 150000 & CRIMINAL POSS \dots & \dots \\ \hline - 05/15/2020 & ZZHZZRLR & KYLE & LESTER & 03/28/2019 & HISPANIC & M & 41 & 30100 & VIOLATION OF P\dots & \dots \\ \hline - 05/15/2020 & ZZSRJBEE & JASON & HAMMOND & 04/03/2020 & HISPANIC & M & 21 & 150000 & CRIMINAL ATTEM\dots & \dots \\ \hline - 05/15/2020 & ZZHBJLRZ & ERIC & TOWNSEND & 01/15/2020 & WHITE & M & 36 & 50500 & CRIM VIOL OF P\dots & \dots \\ \hline - 05/15/2020 & ZZSRRCHH & MICHAEL & WHITE & 12/26/2018 & HISPANIC & M & 29 & 100000 & CRIMINAL ATTEM\dots & \dots \\ \hline - 05/15/2020 & ZZEJCZWW & JOHN & HARPER & 01/03/2020 & WHITE & M & 54 & 100000 & CRIM VIOL OF P\dots & \dots \\ \hline - 05/15/2020 & ZZHJBJBR & KENNETH & JUAREZ & 03/19/2020 & HISPANIC & M & 35 & 100000 & CRIM VIOL ST C\dots & \dots \\ \hline - 05/15/2020 & ZZESESZW & MICHAEL & SANTOS & 12/03/2018 & WHITE & M & 55 & 50000 & ASSAULT 2ND, V\dots & \dots \\ \hline - 05/15/2020 & ZZRCSHCZ & CHRISTOPHER & JONES & 05/13/2020 & BLACK & M & 43 & 10000 & INTERFERING WIT\dots & \dots \\ \hline - \end{tabular} - \egroup - \end{adjustbox} -\end{table*} \begin{figure}[ht!] \centering \begin{tikzpicture}[scale=0.85] diff --git a/system_model.tex b/system_model.tex index 5cb7f5e..bd7f596 100644 --- a/system_model.tex +++ b/system_model.tex @@ -33,61 +33,129 @@ \subsection{Service Pipeline and Reference Scenario}\label{sec:service_definitio We define a service pipeline as a graph defined as follows. % and depicted in \cref{fig:service_pipeline}. \begin{definition}[\pipeline]\label{def:pipeline} A \pipeline is as a direct acyclic graph G(\V,\E), where \V\ is a set of vertices and \E\ is a set of edges connecting two vertices \vi{i},\vi{k}$\in$\V. - The graph has a root \vi{r}$\in$\V, a vertex \vi{i}$\in$\V$_S$ for each service $s_i$, two additional vertices \vi{c},\vi{m}$\in$\V$_{\timesOperator}$$\subset$\V\ for each alternative ($\timesOperator$) structure modeling the alternative execution (\emph{choice}) of operations and the retrieval (\emph{merge}) of the results, respectively, and two additional vertices \vi{f},\vi{j}$\in$\V$_{\plusOperator}$$\subset$\V\ for each parallel ($\plusOperator$) structure modeling the contemporary execution (\emph{fork}) of operations and the integration (\emph{join}) of their results, respectively. + The graph has a root \vi{r}$\in$\V, a vertex \vi{i}$\in$\V$_S$ for each service $s_i$, two additional vertices \vi{c},\vi{m}$\in$\V$_{\timesOperator}$$\subset$\V\ for each alternative ($\timesOperator$) structure modeling the alternative execution (\emph{choice}) of operations and the retrieval (\emph{merge}) of the results, respectively, + and one additional vertex \vi{f} $\in$\V$_{\plusOperator}$$\subset$\V\ for each parallel ($\plusOperator$) structure modeling the contemporary execution (\emph{fork}) of operations. \end{definition} -We note that \{\vi{r}\}$\cup$\V$_S$$\cup$\V$_{\timesOperator}$$\cup$V$_{\plusOperator}$$=$\V, and \vi{c}, \vi{m}, \vi{f}, and \vi{j} model branching for alternative/parallel structures. +We note that \{\vi{r}\}$\cup$\V$_S$$\cup$\V$_{\timesOperator}$$\cup$V$_{\plusOperator}$$=$\V, and \vi{c}, \vi{m} and \vi{f} model branching for alternative/parallel structures. We also note that root \vi{r} possibly represents the orchestrator. % A service pipeline is as a direct acyclic graph G(\V,\E), where \V\ is a set of vertices, one for each service $s_i$ in the pipeline, \E\ is a set of edges connecting two services $s_i$ and $s_j$, and \myLambda\ is an annotation function that assigns a label \myLambda(\vi{i}), corresponding to a data transformation \F\ implemented by the service $s_i$, for each vertex \vi{i}$\in$\V. -Our reference scenario considers a service pipeline analyzing a dataset of individuals detained in Department of Correction facilities in the state of Connecticut while awaiting trial. In particular, the user, a member of the Connecticut Department of Correction (DOC), seeks to compare admission trends in Connecticut prisons with those in other US states. +Our reference scenario considers a service pipeline analyzing a dataset of individuals detained in Department of Correction facilities in the state of Connecticut while awaiting trial. +In particular, the user, a member of the Connecticut Department of Correction (DOC), seeks to compare admission trends in Connecticut prisons with those in other US states. The user's preferences align with a predefined pipeline template that orchestrates the following sequence of operations: \begin{enumerate*}[label=(\roman*)] - \item \emph{Data preparation and protection}, including data cleaning and anonymization; - \item \emph{Data enrichment}, including the integration of data from other states; + \item \emph{Data fetching}, including the download of the dataset from other states; + \item \emph{Data preparation and protection}, including data merging, cleaning and anonymization; \item \emph{Data analysis}, including statistical measures like averages, medians, and clustering-based statistics; \item \emph{Machine learning task}, including training and inference; \item \emph{Data storage}, including the storage of the results in the corresponding states. Specifically, one copy remains in Connecticut (where sensitive information in the source dataset is not protected), while two additional copies are distributed to New York and New Hampshire (with sensitive information from the source dataset being safeguarded). + \item \emph{Data visualization}, including the visualization of the results. \end{enumerate*} -\begin{enumerate*}[label=(\roman*)] - \item \emph{Data preparation and protection}, including data cleaning and anonymization; - \item \emph{Data enrichment}, including the integration of data from other states; - \item \emph{Data analysis}, including statistical measures like averages, medians, and clustering-based statistics; - \item \emph{Machine learning task}, including training and inference; - \item \emph{Data storage}, including the storage of the results in the corresponding states. Specifically, one copy remains in Connecticut (where sensitive information in the source dataset is not protected), - while two additional copies are distributed to New York and New Hampshire (with sensitive information from the source dataset being safeguarded). -\end{enumerate*} We note that the template requires the execution of the entire service within a single country. If the data needs to be transmitted beyond the boundaries of Connecticut, data protection measures must be implemented. A visual representation of the flow is presented in Figure \ref{fig:reference_scenario}. +\begin{table*}[ht!] + \caption{Dataset sample} + \label{tab:dataset} + \centering + \begin{adjustbox}{max totalsize={.99\linewidth}{\textheight},center} + \bgroup + \def\arraystretch{1.5} + \begin{tabular}{|l|l|l|l|l|l|l|l|l|l|l|l|} + \hline + \textbf{DOWNLOAD DATE} & \textbf{IDENTIFIER} & \textbf{FIRST NAME} & \textbf{LAST NAME} & \textbf{LAD} & \textbf{RACE} & \textbf{GENDER} & \textbf{AGE} & \textbf{BOND} & \textbf{OFFENSE} & \textbf{\dots} \\ \hline + 05/15/2020 & ZZHCZBZZ & ROBERT & PIERCE & 08/16/2018 & BLACK & M & 27 & 150000 & CRIMINAL POSS \dots & \dots \\ \hline + 05/15/2020 & ZZHZZRLR & KYLE & LESTER & 03/28/2019 & HISPANIC & M & 41 & 30100 & VIOLATION OF P\dots & \dots \\ \hline + 05/15/2020 & ZZSRJBEE & JASON & HAMMOND & 04/03/2020 & HISPANIC & M & 21 & 150000 & CRIMINAL ATTEM\dots & \dots \\ \hline + 05/15/2020 & ZZHBJLRZ & ERIC & TOWNSEND & 01/15/2020 & WHITE & M & 36 & 50500 & CRIM VIOL OF P\dots & \dots \\ \hline + 05/15/2020 & ZZSRRCHH & MICHAEL & WHITE & 12/26/2018 & HISPANIC & M & 29 & 100000 & CRIMINAL ATTEM\dots & \dots \\ \hline + 05/15/2020 & ZZEJCZWW & JOHN & HARPER & 01/03/2020 & WHITE & M & 54 & 100000 & CRIM VIOL OF P\dots & \dots \\ \hline + 05/15/2020 & ZZHJBJBR & KENNETH & JUAREZ & 03/19/2020 & HISPANIC & M & 35 & 100000 & CRIM VIOL ST C\dots & \dots \\ \hline + 05/15/2020 & ZZESESZW & MICHAEL & SANTOS & 12/03/2018 & WHITE & M & 55 & 50000 & ASSAULT 2ND, V\dots & \dots \\ \hline + 05/15/2020 & ZZRCSHCZ & CHRISTOPHER & JONES & 05/13/2020 & BLACK & M & 43 & 10000 & INTERFERING WIT\dots & \dots \\ \hline + \end{tabular} + \egroup + \end{adjustbox} + +\end{table*} \begin{figure}[ht!] \centering - \begin{tikzpicture}[scale=0.9] + + \tikzset{ + do path picture/.style={% + path picture={% + \pgfpointdiff{\pgfpointanchor{path picture bounding box}{south west}}% + {\pgfpointanchor{path picture bounding box}{north east}}% + \pgfgetlastxy\x\y% + \tikzset{x=\x/2,y=\y/2}% + #1 + } + }, + cross/.style={do path picture={ + \draw [line cap=round] (-1,-1) -- (1,1) (-1,1) -- (1,-1); + }}, + plus/.style={do path picture={ + \draw [line cap=round] (-3/4,0) -- (3/4,0) (0,-3/4) -- (0,3/4); + }} + } + + + \begin{tikzpicture}[scale=0.9,y=-1cm] % Nodes - \node[draw ] (node1) at (0,8) {$\s{r}$}; - \node[draw] (node2) at (0,7){Data Preparation }; - \node[draw] (node25) at (0,6){Data Enrichment}; - \node[draw] (node3) at (0,5) {$\timesOperator$}; - \node[draw] (node4) at (-2,4) {Data Analysis}; - \node[draw] (node5) at (2,4) {Machine Learning}; - \node[draw] (node6) at (0,3) {$\timesOperator$}; - \node[draw] (node7) at (-2,2) {Data Storage}; - \node[draw] (node8) at (2,2) {Data Visualization}; - \node[draw] (node9) at (0,1) {$\timesOperator$}; - \draw[->] (node1) -- (node2); - \draw[->] (node2) -- (node25); - \draw[->] (node25) -- (node3); - \draw[->] (node3) -- (node4); - \draw[->] (node3) -- (node5); - \draw[->] (node5) -- (node6); - \draw[->] (node4) -- (node6); - \draw[->] (node6) -- (node7); - \draw[->] (node6) -- (node8); - \draw[->] (node8) -- (node9); - \draw[->] (node7) -- (node9); + + \node[draw, circle ] (root) at (0,0) {$\vi{r}$}; + \node[draw, circle, plus , below = 1em, minimum size=1.5em] (split) at (root.south) {}; + + \node[draw, circle,below =1em] (node2) at (split.south) {$\vi{2}$}; + + \node[draw, circle,left=1em] (node1) at (node2.west) {$\vi{1}$}; + \node[draw, circle,right=1em] (node3) at (node2.east) {$\vi{3}$}; + + \node[draw, circle,below=1em] (merge) at (node2.south) {$M$}; + + \node[draw, circle, cross , minimum size=1.5em,below=1em] (fork) at (merge.south) {}; + \node[draw, circle,below =1.5em, left=2em] (ml) at (fork.south) {$\vi{4}$}; + \node[draw, circle,below =1.5em, right=2em] (analysis) at (fork.south) {$\vi{5}$}; + \node[draw, circle, cross , minimum size=1.5em,below=3em] (join) at (fork.south) {}; + \node[draw, circle,below =1.5em, ] (storage) at (join.south) {$\vi{5}$}; + \node[draw, circle,below =1.5em] (visualization) at (storage.south) {$\vi{6}$}; + + % Labels + + \node[right=1em] at (node3.east) {Dataset fetch}; + \node[right=1em] at (merge.east) { $merge$}; + \node[right=1em] at (split.east) { $parallel$}; + \node[right=1em] at (analysis.east) { Data analysis}; + \node[left=1em] at (ml.west) { ML task}; + \node[right=1em] at (storage.east) { Storage}; + \node[right=1em] at (visualization.east) { Visualization}; + % Connection + + \draw[->] (root) -- (split); + \draw[->] (split) -- (node1); + \draw[->] (split) -- (node2); + \draw[->] (split) -- (node3); + + \draw[->] (node1) -- (merge); + \draw[->] (node2) -- (merge); + \draw[->] (node3) -- (merge); + + \draw[->] (fork) -- (ml); + \draw[->] (fork) -- (analysis); + \draw[->] (join) -- (storage); + \draw[->] (analysis) -- (join); + \draw[->] (ml) -- (join); + \draw[->] (merge) -- (fork); + \draw[->] (storage) -- (visualization); + % \draw[->] (node3) -- (node6); + % \draw[->] (node4) -- (node6); + % \draw[->] (node5) -- (node6); + % \draw[->] (node6) -- (node7); + \end{tikzpicture} \caption{Reference Scenario} \label{fig:reference_scenario} @@ -96,5 +164,10 @@ \subsection{Service Pipeline and Reference Scenario}\label{sec:service_definitio The adopted dataset\footnote{https://data.ct.gov/Public-Safety/Accused-Pre-Trial-Inmates-in-Correctional-Faciliti/b674-jy6w} exhibits a straightforward row-and-column structure. Each row represents an inmate; each column includes the following attributes: date of download, a unique identifier, last entry date, race, gender, age of the individual, the bound value, offense, entry facility, and detainer. To serve the objectives of our study, we have extended this dataset by introducing randomly generated first and last names. +In \cref{tab:dataset}, a sample of the dataset is presented, showcasing a representative subset of the collected information. +% Scarichiamo tre dataset, nessuna anonimizzazione, nodo di merge, anonimizzo e pulisco tutto, +%nodi alternativa ML e analisi, merge, storage, visulazzionezione +%aggiungere nodo finale +%agigungere nodo \ No newline at end of file