Name		Name	Last commit message	Last commit date
parent directory ..
alembic		alembic
mocks		mocks
python		python
scripts		scripts
Dockerfile		Dockerfile
Dockerfile.prod		Dockerfile.prod
README.md		README.md
alembic.ini		alembic.ini
cli.py		cli.py
env.docker.sample		env.docker.sample
env.native.sample		env.native.sample
etl.config.json		etl.config.json
fill_ml_index.py		fill_ml_index.py
run_ml_on_old_files.py		run_ml_on_old_files.py

README.md

Data Pipelines

This directory contains the ETL (Extract, Transform, Load) scripts and related files for ingesting DQMIO files.

Overview

The etl directory is responsible for managing the entire data pipeline from DIALS. It is going to discover raw DQMIO data from Data Bookkeeping Service (DBS) indexing all relevant files in each workspace data mart and schedule ETL jobs in its Job Queue backed by Celery. The jobs are simply responsible for copying the files from the worldwide grid, extract and transform the data and load in each workspace data table.

Workspaces

Workspace? Data Marts? What are that wizardry language? Multiple groups in CMS analyze different kinds of data, when data-taking is taking place general and more specific datasets are generated. Some groups analyze Monitoring Elements (MEs) in a specific dataset to make sure his detector sub-system are working properly. Then, from a data engineering stand point it makes sense to create multiple databases for each group, members of each group will have their own Workspace = Data Marts. Thus benefiting from performance gains and data isolation for having their own data mart.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etl

etl

README.md

Data Pipelines

Overview

Workspaces

Files

etl

Directory actions

More options

Directory actions

More options

Latest commit

History

etl

Folders and files

parent directory

README.md

Data Pipelines

Overview

Workspaces