SIC Classification Utils

Standard Industrial Classification (SIC) Utilities, initially developed for Survey Assist API and complements the SIC Classification Library code.

Overview

SIC classification utilities used in the classification of industry. This repository contains core code used by the SIC Classification Library.

Features

Embeddings. Functionality for embedding SIC hierarchy data, managing vector stores, and performing similarity searches
Data Access. Functions to load CSV data files related to SIC.

Prerequisites

Ensure you have the following installed on your local machine:

Python 3.12 (Recommended: use pyenv to manage versions)
poetry (for dependency management)
Colima (if running locally with containers)
Terraform (for infrastructure management)
Google Cloud SDK (gcloud) with appropriate permissions

Local Development Setup

The Makefile defines a set of commonly used commands and workflows. Where possible use the files defined in the Makefile.

Clone the repository

git clone https://github.com/ONSdigital/sic-classification-utils.git
cd sic-classification-utils

Install Dependencies

poetry install

Add Git Hooks

Git hooks can be used to check code before commit. To install run:

pre-commit install

Run Locally

There is example source for using the SIC Embedding functionality in sic_embedding_example.py to run:

poetry run python src/industrial_classification_utils/embed/sic_embedding_example.py

This will output semantic search of the files in src/industrial_classification_utils/data/sic_index based on the query "school teacher primary education"

Structure

docs - documentation as code using mkdocs

scripts - location of any supporting scripts (e.g data cleansing etc)

src/industrial_classification_utils/data - example data and SIC classification data used for embeddings

src/industrial_classification_utils/embed - ChromaDB vector store and embedding code, includes an example use of the store.

src/industrial_classification_utils/models - common data structures that need to be shared

src/industrial_classification_utils/utils - common utility functions such as xls file read for embeddings.

tests - PyTest unit testing for code base, aim is for 80% coverage.

GCP Setup

Placeholder

Code Quality

Code quality and static analysis will be enforced using isort, black, ruff, mypy and pylint. Security checking will be enhanced by running bandit.

To check the code quality, but only report any errors without auto-fix run:

make check-python-nofix

To check the code quality and automatically fix errors where possible run:

make check-python

Documentation

Documentation is available in the docs folder and can be viewed using mkdocs

make run-docs

Testing

Pytest is used for testing alongside pytest-cov for coverage testing. /tests/conftest.py defines config used by the tests.

Unit testing for embedding functions is added to the /tests/test_embedding.py Unit testing for utility functions is added to the /tests/test_sic_data_access.py

make embed-tests

make utils-tests

All tests can be run using

make all-tests

Environment Variables

Placeholder

Name		Name	Last commit message	Last commit date
Latest commit History 167 Commits
.github		.github
docs		docs
notebooks		notebooks
scripts		scripts
src/industrial_classification_utils		src/industrial_classification_utils
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CODEOWNERS		CODEOWNERS
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mkdocs.yml		mkdocs.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SIC Classification Utils

Overview

Features

Prerequisites

Local Development Setup

Clone the repository

Install Dependencies

Add Git Hooks

Run Locally

Structure

GCP Setup

Code Quality

Documentation

Testing

Environment Variables

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

ONSdigital/sic-classification-utils

Folders and files

Latest commit

History

Repository files navigation

SIC Classification Utils

Overview

Features

Prerequisites

Local Development Setup

Clone the repository

Install Dependencies

Add Git Hooks

Run Locally

Structure

GCP Setup

Code Quality

Documentation

Testing

Environment Variables

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages