This repository has been archived by the owner on Aug 9, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
main.tex
106 lines (71 loc) · 12 KB
/
main.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
\documentclass[sigconf, nonacm]{acmart}
\usepackage{listings}
\newcommand\vldbauthors{\authors}
\newcommand\vldbtitle{\shorttitle}
\newcommand\vldbavailabilityurl{https://github.com/EdgarYepez/RepSchemaDiscovery}
% whether page numbers should be shown or not, use 'plain' for review versions, 'empty' for camera ready
\newcommand\vldbpagestyle{plain}
\begin{document}
\title{Experiment Replication: JSON Schema Discovery}
\author{Edgar Yepez}
\affiliation{
\institution{Universität Passau}
\city{Passau}
\state{Germany}
}
\email{[email protected]}
\maketitle
\ifdefempty{\vldbavailabilityurl}{}{
\vspace{.3cm}
\begingroup\small\noindent\raggedright\textbf{Artifact Availability:}\\
The source code, data, and/or other artifacts have been made available on Zenodo at DOI \url{10.5281/zenodo.10702962}.
\endgroup
}
\section{Introduction}
The current work introduces a setting for replication of an experiment from the paper ``An Approach for Schema Extraction of JSON and Extended JSON Document
Collections'' by Frozza A. et al~\cite{angelo2018}.
In their work, the authors presented an approach and code implementation to perform schema discovery over a collection of JSON documents. A JSON document is regarded as a document that complies with the JSON format for representing structured data. By that token, the JSON format defines a text syntax capable of representing a hierarchy of named values, known as fields, of different data types~\cite{mdn2024}. An important characteristic of the JSON format is that it does not enforce a specific structure over the names and data types of the values that form a document. This fact is denoted as schemaless, and allows for the storage of data without prior knowledge of its structure~\cite{angelo2018}. Therefore, the schema discovery task involves determining and describing the underlying structure of some JSON document.
In this regard, the authors of the aforementioned paper introduced a method to extract the schema form a collection of JSON documents by a four-step process. It starts by traversing the document collection and analyses the named values of each document in order to identify its individual schema, referred to as a raw schema. Next, after a sorting and organisation stage, it combines the identified raw schemas into a single schema that would represent the entire document collection. To that end, a tree data structure is employed, which aids in the summarisation of raw schemas by recording information on objects, arrays and other data types. The resulting single JSON schema complies with the Draft 6 for JSON schema specification~\cite{schemadraft6}.
In order to assess the quality and performance of the proposed method, the authors conducted several experiments focusing on different evaluation criteria. One of which focused on the quality of the mappings from JSON documents to JSON schemas, by determining an accuracy rate for both completeness and correctness.
%\subsection{Base quality experiments}
%In order to assess the quality and performance of the proposed method, the authors conducted three experiments focusing on different evaluation criteria.
%\begin{itemize}
% \item The first experiment focused on the quality of the mappings from JSON documents to JSON schemas, by determining an accuracy rate for both completeness and correctness.
% \item Next, an experiment was conducted to measure the processing time of the combined four steps of the proposed method. In this experiment, the final measurement assessment corresponded to the proportional relationship between the time for raw schema extraction over the total ruining time of the process.
% \item Finally, the third experiment focused on comparing the quality of the generated JSON schemas against related methods proposed by previous research works. In this experiment, no comparison was performed regarding running time between related works and the author's proposed method.
%\end{itemize}
\section{Setting}
The current work aims at replicating the mentioned quality experiment. Specifically, it is of interest to determine the accuracy rate for correctness and completeness of the mappings from JSON documents to JSON schemas. In their original experiment, the authors employed a collection of five JSON documents to assess the relevant criteria and thus determine the accuracy rate. However, the authors did not specifically mention the process of comparing the resulting JSON schemas against a ground-truth associated with the input JSON documents in order to compute the accuracy rate, nor did they mention the process of obtaining said ground-truth. Nonetheless, the authors claim to have accomplished an accuracy rate of 100\% across all expected data types in the input JSON documents.
In this sense, the hypothesis to be verified in the current work is that the accuracy rate of the mappings from JSON documents to JSON schemas is 100\% across all expected data types in the given input JSON documents. To that end, a different collection of JSON documents is employed alongside its corresponding pre-existing JSON schema which would serve as ground-truth. Next, the author's code implementation is executed over the entire document collection in order to extract the associated JSON schema. A deep-difference comparison is then performed between the extracted JSON schema and the pre-existing ground-truth. Finally, the criteria for a successful hypothesis confirmation corresponds to the output of the comparison stating equality, or no difference, between the ground-truth and the extracted JSON schema.
\section{Experiment}
The author's code implementation corresponded to a web application built on top of the Mongo, Express, Angular, and NodeJS development stack. For the current experiment setting, it was installed and configured on a mid-range consumer laptop with 16GB RAM and a 2.8GHz Intel Core i7 seventh generation processor. The employed dataset consisted of a collection of 900 JSON documents from Foursquare about check-ins in the city of Florence~\cite{emre2017}, which were loaded and stored in a local Mongo database instance. Alongside, the third-party online JSON Schema Tool~\cite{schemagen} was used in order to obtain the ground-truth JSON schema for the collection. The tool was given as input a randomly sampled JSON document out of the entire collection, and was configured to produce a JSON schema according to Draft 6. The obtained JSON schema was then manually modified to meet the format of the schemas generated by the author's code implementation, as the later included only a limited subset of the fields defined by Draft 6. For instance, Draft 6 defines a field named ``description'' that the schemas generated by the author's code implementation did not include. The modified JSON schema was lastly stored for further usage during comparison.
%All documents in the collection contained the following fields, each of which was mandatory.
%\begin{itemize}
% \item \verb|_id|: A unique identifier for the check-in.
% \item \verb|foursquareURL|: The public Foursquare URL for the check-in.
% \item \verb|venueId|: A unique identifier for the venue where the check-in was performed.
% \item \verb|timestamp|: The posix timestamp of the check-in.
% \item \verb|foursquareUserId|: A unique identifier for the associated user.
% \item \verb|timeZoneOffset|: The number of minutes offset from Greenwich time.
% \item \verb|swarmULR|: Is the public Swarm URL for the check-in.
%\end{itemize}
Over the entire JSON document collection, the author's code implementation was then executed in order to extract the associated JSON schema. To this end, a script was developed to automate the steps involved in running the author's code implementation. Specifically, the script first registered an account with custom username and password, and then attempted to log in using said account. Upon success, a new schema extraction job was initiated by providing the connection credentials of the local Mongo database where the JSON documents were stored. Next, the script periodically checked for the status of the job. Finally, once notified about job's completion, the script downloaded the resulting JSON schema and stored it in a local directory.
In order to assess the correctness and completeness of the generated JSON schema, a deep-difference comparison was performed against the previously obtained ground-truth. To this end, a second script loaded both the generated JSON schema and the ground-truth schema as objects into memory. Then, the script employed a third-party library specialised in deep-difference comparison~\cite{deepdiff} to perform an analysis over the loaded objects. The library was configured to consider additions, removals, and case-sensitive value changes across the objects' fields regardless of their order. Finally, the result of the comparison is stored into a file, which contains the number and description of the identified additions, removals, and value changes.
An accuracy rate of a 100\% for correctness and completeness of the generated JSON schema is considered when the output of the comparison states zero additions, zero removals, and zero value changes.
\section{Results}
The experiment was performed on the aforementioned JSON document collection, and the following results were obtained.
Listing \ref{lst:obj} shows the number of identified additions, removals, and value changes of the generated JSON schema with respect to the ground-truth schema.
\lstinputlisting[caption={Count of changes between the generated JSON schema and the ground-truth schema}, label={lst:obj}]{../experiment/firenze_checkins_summary.txt}
Listing \ref{lst:summary} shows a description of the affected fields as a JSON document. An empty JSON document represents that no fields were affected.
\lstinputlisting[caption={Description of affected fields}, label={lst:summary}]{../experiment/firenze_checkins_diff.json}
This results reveal that there is no difference between the generated JSON schema and the ground-truth schema, thus concluding that the accuracy rate of the mappings from JSON documents to JSON schemas is 100\% across all expected data types in the given input JSON documents.
\section{Discussion}
The conducted experiment showed a successful confirmation of the formulated hypothesis. However, two threats to its validity are identified.
On the one hand, the process to obtain the ground-truth schema is prone to bias introduction. Since this process involved a manually performed update phase, human errors could lead to removing or adding field descriptions that directly influence the experiment's result in a positive or negative way, at will of the experimenter. To overcome this issue, an unrelated third-party should validate the ground-truth schema.
On the other hand, the process to obtain the ground-truth schema is not completely suitable for the overall experiment. The author's proposed method for schema extraction focuses on a collection of JSON documents; however, the ground-truth schema in the current setting was based on a single document randomly sampled from the collection of documents. By this token, the ground-truth schema might potentially not reflect on all the fields present throughout the documents in the collection, which ultimately leads to an unfair assessment of the final results. To overcome this threat, the ground-truth schema should be determined through a process that considers the entire document collection.
\section{Conclusion}
The current work introduced a setting for replication of an experiment conducted by the authors of a proposal for JSON schema extraction from a collection of JSON documents. The experiment corresponded to assessing the correctness and completeness of mappings from JSON documents to JSON schemas. Despite the lack of specification on how the authors conducted their original experiment in the source paper, the replication performed in the current work was able to successfully confirm the authors' hypothesis. However two threats to its validity were identified, and the possible actions to overcome this threats were described.
\bibliographystyle{ACM-Reference-Format}
\bibliography{references}
\end{document}
\endinput