-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(fixtures): new blog post "Ten years of CERN Open Data portal"
- Loading branch information
1 parent
6db0662
commit c4d52a1
Showing
2 changed files
with
133 additions
and
0 deletions.
There are no files selected for viewing
19 changes: 19 additions & 0 deletions
19
...ures/data/docs/ten-years-of-cern-open-data-portal/ten-years-of-cern-open-data-portal.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
[ | ||
{ | ||
"author": "CERN Open Data team", | ||
"body": { | ||
"content": "ten-years-of-cern-open-data-portal.md", | ||
"format": "md" | ||
}, | ||
"date_published": "2024-12-10", | ||
"short_description": { | ||
"content": "The CERN Open Data portal celebrates its ten year birthday! Find out about its journey and today's challenges." | ||
}, | ||
"featured": 1, | ||
"slug": "ten-years-of-cern-open-data-portal", | ||
"title": "Ten years of CERN Open Data portal", | ||
"type": { | ||
"primary": "News" | ||
} | ||
} | ||
] |
114 changes: 114 additions & 0 deletions
114
...a/docs/ten-years-of-cern-open-data-portal/ten-years-of-cern-open-data-portal.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,114 @@ | ||
Ten years ago, a handful of enthusiastic researchers from the | ||
[ALICE](https://alice.cern), [ATLAS](https://atlas.cern/), | ||
[CMS](https://cms.cern/) and [LHCb](https://lhcb.web.cern.ch/) collaboration | ||
open access teams, together with a handful of software engineers from the CERN | ||
[Department of Information | ||
Technology](https://information-technology.web.cern.ch/) and the information | ||
specialists from the CERN [Scientific Information | ||
Service](https://sis.web.cern.ch/), grouped together to build the CERN Open | ||
Data portal. | ||
|
||
Under the umbrella of the [Data Preservation in High Energy | ||
Physics](https://dphep.web.cern.ch/), the work started in summer 2014 by | ||
devising a metadata schema that would neatly describe the open data from the | ||
LHC experiments in both their technical and physics-oriented facets. We had to | ||
design the web site portal (based on the | ||
[Invenio](https://inveniosoftware.org/) digital repository framework) to | ||
provide web pages for general public, including visualising collision events, | ||
all the while ensuring the best data preservation practices to describe, manage | ||
and disseminate the data for researchers. The data needed to be made searchable | ||
and downloadable in a way that would be attractive to both general public for | ||
educational purposes and usable by independent researchers for referencing and | ||
independent theoretical analysis. (Some of the technical challenges behind the | ||
CERN Open Data portal were described in a [later | ||
interview](https://superuser.openinfra.org/articles/cern-open-data-portal/) for | ||
the SuperUser magazine.) | ||
|
||
The efforts concluded in the launch of the CERN Open Data portal on [November | ||
20th | ||
2014](https://home.web.cern.ch/news/news/accelerators/cern-makes-public-first-data-lhc-experiments). | ||
The portal managed of about 30 terabytes of open data from LHC experiments in a | ||
ground-breaking service at the time. The [Reddit AskMeAnything | ||
session](https://www.reddit.com/r/IAmA/comments/2nxwkb/a_few_days_ago_cern_launched_an_open_data_portal/) | ||
organised alongside the release attracted large attention and many tens of | ||
thousands of portal visitors, more than the total number of particle physicists | ||
in the world. | ||
|
||
Fast-forward ten years to the present time. The CERN Open Data portal now | ||
disseminates more than 5 petabytes of open data, which is a whopping 200 times | ||
more data than at launch. More particle physics experiments have joined the | ||
open data portal, with [DELPHI](https://delphi-www.web.cern.ch/delphi-www/), | ||
[OPERA](https://en.wikipedia.org/wiki/OPERA_experiment), | ||
[PHENIX](https://www.phenix.bnl.gov/), and | ||
[TOTEM](https://totem-experiment.web.cern.ch/) releasing data samples or even | ||
full data collections. More experiments are in the pipeline, such as | ||
[JADE](https://www.mpp.mpg.de/en/research/data-preservation/jade/). The CERN | ||
Open Data portal is becoming a sort of "HEP Open Data" portal, covering not | ||
only the LHC experiments, but the particle physics domain at large, further | ||
demonstrating success of the original idea. | ||
|
||
Looking back at the origins and the path travelled in the past ten years, any | ||
sceptical concerns whether these data would be understandable and usable for | ||
independent theoretical research have been positively answered. The [leading | ||
publication](https://news.mit.edu/2017/first-open-access-data-large-collider-subatomic-particle-patterns-0929) | ||
by Jesse Thaler's team in MIT analysing CMS open data showed that independent | ||
theoretical publications are not only possible, but that they enrich the | ||
collaboration research practices themselves, with CMS collaboration starting to | ||
cite the independent theoretical work in their own publications. There are now | ||
more than 70 research papers published on the CMS open data and the [number of | ||
published papers is | ||
growing](https://cms.cern/news/cms-celebrates-decade-open-data). Matt Strassler | ||
published a series of blog posts [on the importance of open | ||
data](https://profmattstrassler.com/2019/03/19/the-importance-and-challenges-of-open-data-at-the-large-hadron-collider/) | ||
in this realm. | ||
|
||
The independent usage of the released data for research has led to the | ||
strengthening of published data provenance information when releasing the data | ||
in order to provide physics context and auxiliary information about the data as | ||
accurately and as completely as possible. The data are being published together | ||
with analysis examples demonstrating how to extract physics objects out of the | ||
data and how to use them in one's own analyses. The care about the data usage | ||
patterns and the further usability and reinterpretability of data [has | ||
naturally led](https://www.nature.com/articles/s41567-018-0342-2) to sister | ||
projects dedicated towards facilitating [reproducible | ||
analyses](https://www.reana.io) and [continuous | ||
reuse](https://zenodo.org/records/10263204) of the data. | ||
|
||
Besides independent theoretical research, the data are being used in numerous | ||
masterclasses and education programs to train the next generation of | ||
scientists, as well as by software engineers in the efforts to benchmark | ||
software tools to ensure their feasibility in the forthcoming high-luminosity | ||
experimental data-taking era. | ||
|
||
The bottom-up efforts on preparing and releasing open data were complemented by | ||
the top-down efforts and support from CERN laboratory management towards open | ||
science. The efforts by CERN as the supporting hosting lab together with LHC | ||
collaboration management boards as the data producers and owners paved the way | ||
towards the formal establishment of the [CERN Open Data | ||
policy](https://cds.cern.ch/record/2745133/files/CERN-OPEN-2020-013.pdf) in | ||
2020, and, two years later, the [CERN Open Science | ||
Policy](https://cds.cern.ch/record/2835057/files/CERN-OPEN-2022-013.pdf). It is | ||
under these auspices that the open data pilot efforts progressively took shape | ||
to what they are today whilst seeking the long-term sustainability of making | ||
science open. | ||
|
||
Looking into the future, there are clear challenges ahead. The growing number | ||
of open data releases calls for using more efficient data publishing workflows | ||
leveraging scientific data managers used in collaborations, such as | ||
[Rucio](https://rucio.cern.ch/) in ATLAS and CMS and | ||
[DIRAC](https://dirac.readthedocs.io/) in LHCb. The vast quantities of | ||
published data calls for implementing an efficient "hot" and "cold" storage | ||
mechanisms behind the portal in order to save on storage costs. All this | ||
content necessitates efficient tape backups and the on-demand data staging for | ||
users from the cold storage, when necessary. Finally, the experimental | ||
collaborations plan to release even more data during the LHC Run-3 phase, which | ||
calls for novel approaches to open data publishing that are going beyond the | ||
digital repository domain, such as the nascent system allowing theorists to ask | ||
for custom LHCb open data production via a dedicated [Ntupling | ||
Wizard](https://arxiv.org/pdf/2302.14235v2) service. | ||
|
||
It has been a blast working together between software engineers, information | ||
specialists and particle physicists on fostering open and reproducible science | ||
practices in particle physics. | ||
|
||
Looking forward to working together in the next decennial! |