Standard Industrial Classification (SIC) Utilities, initially developed for Survey Assist API and complements the SIC Classification Library code.
SIC classification utilities used in the classification of industry. This repository contains core code used by the SIC Classification Library.
- Embeddings. Functionality for embedding SIC hierarchy data, managing vector stores, and performing similarity searches
- Data Access. Functions to load CSV data files related to SIC.
Ensure you have the following installed on your local machine:
- Python 3.12 (Recommended: use
pyenv
to manage versions) -
poetry
(for dependency management) - Colima (if running locally with containers)
- Terraform (for infrastructure management)
- Google Cloud SDK (
gcloud
) with appropriate permissions
The Makefile defines a set of commonly used commands and workflows. Where possible use the files defined in the Makefile.
git clone https://github.com/ONSdigital/sic-classification-utils.git
cd sic-classification-utils
poetry install
Git hooks can be used to check code before commit. To install run:
pre-commit install
There is example source for using the SIC Embedding functionality in sic_embedding_example.py to run:
poetry run python src/industrial_classification_utils/embed/sic_embedding_example.py
This will output semantic search of the files in src/industrial_classification_utils/data/sic_index based on the query "school teacher primary education"
docs - documentation as code using mkdocs
scripts - location of any supporting scripts (e.g data cleansing etc)
src/industrial_classification_utils/data - example data and SIC classification data used for embeddings
src/industrial_classification_utils/embed - ChromaDB vector store and embedding code, includes an example use of the store.
src/industrial_classification_utils/models - common data structures that need to be shared
src/industrial_classification_utils/utils - common utility functions such as xls file read for embeddings.
tests - PyTest unit testing for code base, aim is for 80% coverage.
Placeholder
Code quality and static analysis will be enforced using isort, black, ruff, mypy and pylint. Security checking will be enhanced by running bandit.
To check the code quality, but only report any errors without auto-fix run:
make check-python-nofix
To check the code quality and automatically fix errors where possible run:
make check-python
Documentation is available in the docs folder and can be viewed using mkdocs
make run-docs
Pytest is used for testing alongside pytest-cov for coverage testing. /tests/conftest.py defines config used by the tests.
Unit testing for embedding functions is added to the /tests/test_embedding.py Unit testing for utility functions is added to the /tests/test_sic_data_access.py
make embed-tests
make utils-tests
All tests can be run using
make all-tests
Placeholder