Staging/main/0.10.0 (#943)

* feat: add dev to workfow for testing (#897) * Reservoir sampling (#826) * add code for reservoir sampling and insert sample_nrows options * pre commit fix * add tests for reservoir sampling * fixed mypy issues * fix import to relative path --------- Co-authored-by: Taylor Turner <[email protected]> Co-authored-by: Richard Bann <[email protected]> * [WIP] staging/dev/options (#909) * New preset implementation and test (#867) * memory optimization preset ttrying again ttrying again 3 ttrying again 4 accidentally pushed my updated makefile * Wrote catch for invalid presets, wrote test for catch for invalid presets, debugged new optimization preset * Forgot to run pre-commit, fixed those issues * black doing weird things * made preset validation more maintainable by moving it to the constructor and getting rid of preset list * RowStatisticsOptions: Add option (#865) * RowStatisticsOptions: Add null row count Added null_row_count as an option in RowStatisticsOptions. It toggles the functionality for row_has_null_ratio and row_is_null_ratio in _update_row_statistics. * Unit test for RowStatisticOptions: * Black formatting * RowStatisticsOptions: Add null row count Added null_row_count as an option in RowStatisticsOptions. It toggles the functionality for row_has_null_ratio and row_is_null_ratio in _update_row_statistics. * Unit test for RowStatisticOptions: * Black formatting * added a unit test for RowStatisticsOptions * Deleted test cases that were written in the wrong file * updated testing for null_count toggle in _update_row_statistics * removed the RowStatisticsOptions from test_profiler_options imports * add line * Created toggle option for null_count * RowStatisticsOptions: Add implementation * Revert "RowStatisticsOptions: Add implementation" This reverts commit 2da6a93. * RowStatsticsOptions: Create option * fixed pre-commit error * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Taylor Turner <[email protected]> * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Taylor Turner <[email protected]> * fixed documentation --------- Co-authored-by: Taylor Turner <[email protected]> * Preset test updated w new names and different toggles (#880) * memory optimization preset ttrying again ttrying again 3 ttrying again 4 accidentally pushed my updated makefile * trying * trying * black doing weird things * trying * made preset validation more maintainable by moving it to the constructor and getting rid of preset list * Update to open-source in prep for wrapper changes for mem op preset * updated preset toggles and preset name (mem op -> large data) * updated tests to match * continued name and test and toggle updates * fix comments * RowStatisticsOptions: Implementing option (#871) * Implementing option * Implementing option * took out redundant if statement. added test case for when null_count is disabled. * attempt to check for conflicts between profile merges * added test to check if two profilers have null_count enabled before merging them together * fixed typo and added a trycatch to prevent failing test * No mocks needed. Fixed assertRaisesRegex error * Changed variables names and added a new test to check for check the null_count when null_count is disabled. * Changed name of test, moved tests to TestStructuredProfilerRowStatistics. Fixed position of if statement to prevent unnecessary code from running. * added null_count test cases * fixed indentation mistake * fixed typo * removed a useless commented a line * Updated test name * update --------- Co-authored-by: Liz Smith <[email protected]> Co-authored-by: Richard Bann <[email protected]> * Cms for categorical (#892) * WIP cms implementation * add heavy hitters implementation * add heavy hitters implementation * WIP: mypy issue * WIP: mypy issue * add cms bool and refactor options handler * WIP: testing for CMS * WIP: testing for CMS * use new heavy_hitters_threshold, add test for it * Reservoir sampling refactor (#910) * refactored all but tests * removed some superfluous tests * moved variables around * Staging/dev/profile serialization (#940) * initial changes to categoricalColumn decoder (#818) * Implemented decoding for numerical stats mixin and integer profiles (#844) * hot fixes for encode and decode of numeric stats mixin and intcol profiler (#852) * Float column profiler encode decode (#854) * hot fixes for encode and decode of numeric stats mixin and intcol profiler * cleaned up type checking and updated numericstatsmixin readin helper to give type conversions to more attributes * Added docstring to the _load_stats_helper function * Update dataprofiler/profilers/numerical_column_stats.py Co-authored-by: Taylor Turner <[email protected]> * Update dataprofiler/profilers/numerical_column_stats.py * fix for nan values issue in pytesting * Implementation of float profiler encode and decode process --------- Co-authored-by: Taylor Turner <[email protected]> * Json decode date time column (#861) * more verbose error log with types for easy debug * add load_from_dict to handle tiimestamps * add json decode tests * include DateTimeColumn class * Added decoding for encoding of ordered column profiles (#864) * Added ordered col test to ensure correct response to update when different ordering of values is introduced (#868) * added decode text_column_profiler functionality and tests (#870) * Created encoder for the datalabelercolumn (#869) * feat: add test and compiler serialization (#884) * [WIP] Adds tests validating serialization with Primitive type for compiler (#885) * feat: add test and compiler serialization * fix: move primitive tests to own class * feat: add primitive col compiler save tests * fix: float serializers asserts * Adds deserialization for compilers and validates tests for Primitive; fixes numerical deserialization (#886) * feat: add test and compiler serialization * fix: move primitive tests to own class * feat: add primitive col compiler save tests * fix: float serializers asserts * feat: add tests and allow primitive compiler to deserialize * fix: bug in numeric stats deserial * fix: missing `)` after conflict resolution * Add Serialization and Deserialization Tests for Stats Compiler, plus refactors for order Typing (#887) * fix: organize categorical and add get function * refactor: reorganize tests and add stats test * feat: order typing * feat: add serial and deserial for stats compiler * fix: bug when sample_size == 0 * ready datalabeler for deserialization and improvement on serialization for datalabeler (#879) * Deserialization of datalabeler (#891) * Added initial profiler decoding for datalabeler column (WIP) * Intialial implementation for deserialization of datalabelercolumn * Fix LSP violations (#840) * Make profiler superclasses generic Makes the superclasses BaseColumnProfiler, NumericStatsMixin, and BaseCompiler generic, to avoid casting in subclass diff() methods and violating LSP in principle. * Add needed cast import --------- Co-authored-by: Junho Lee <[email protected]> * Encode Options (#875) * encode testing * encode dataLabeler testing * encode structuredOptions testing * cleaned up datalabeler test * added text options * [WIP] ColumnDataLabelerCompiler: serialize / deserialize (#888) * formatting * update formatting * setting up full test suite for DataLabelerCompiler * update isort * updates to test -- still failing * update * Quick Test update (#893) * update * string in list * formatting * Decode options (#894) * refactored options encode testing * updated test name * updated class names * fixing test * initial base option decode * inital tests * refactor: allow options to go through all (#902) * refactor: allow options to go through all * fix: bug * StructuredColProfiler Encode / Decode (#901) * refactor: allow options to go through all * fix: bug * update * update * update * updates * update * Fixes for taylors StructuredCol Issue * update * update * remove try/except --------- Co-authored-by: Jeremy Goodsitt <[email protected]> Co-authored-by: ksneab7 <[email protected]> * fix: bug and add tests for structuredcolprofiler (#904) * fix: bug and add tests * fix: limit scipy requirements till problem understood and fixed * Stuctured profiler encode decode (#903) * refactor: allow options to go through all * fix: bug in loading options * update * update * Fixes for taylors StructuredCol Issue * Created load and save code from structuredprofiler * intermidiate commit for fixing structured profile --------- Co-authored-by: Jeremy Goodsitt <[email protected]> Co-authored-by: taylorfturner <[email protected]> * [WIP] Added NoImplementationError for UnstructuredProfiler (#907) * refactor: allow options to go through all * fix: bug in loading options * update * update * Fixes for taylors StructuredCol Issue * Created load and save code from structuredprofiler * intermidiate commit for fixing structured profile * test fix * mypy fixes for typing issues * fix for none case of the datalabler in options * Added mock of datalabeler to structured profile test * Added tests for encoding of the Structured profiler * Update dataprofiler/profilers/json_decoder.py Co-authored-by: Michael Davis <[email protected]> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <[email protected]> * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Michael Davis <[email protected]> * Pr fixes * Fixed typo in test * Update dataprofiler/profilers/json_decoder.py Co-authored-by: Taylor Turner <[email protected]> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <[email protected]> * Update dataprofiler/tests/profilers/utils.py Co-authored-by: Taylor Turner <[email protected]> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <[email protected]> * Fixes for unneeeded callout for _profile check * small change --------- Co-authored-by: Jeremy Goodsitt <[email protected]> Co-authored-by: taylorfturner <[email protected]> Co-authored-by: ksneab7 <[email protected]> Co-authored-by: ksneab7 <[email protected]> * Added testing for values for test_json_decode_after_update (#915) * Reuse passed labeler (#924) * refactor: loading labeler for reuse and abstract loading * refactor: use for DataLabelerColumn as well * fix: don't error if doesn't exist * refactor: allow for config dict to be passed entire way * fix: compiler tests * fix: structCol tests * fix: test * BaseProfiler save() for json (#923) * added save for top level and tests * small refactor * small fix * refactor: use seed for sample for consistency (#927) * refactor: use seed for sample for consistency * fix: formatting and variables * WIP top level load (#925) * quick hot fix for input validation on save() save_metho (#931) * BaseProfiler: `load_method` hotfix (#932) * added load_method * updated tests * fix: null_rep mat should calculate even if datetime (#933) * Notebook Example save/load Profile (#930) * update example data profiler demo save/load * update notebook cells * Update examples/data_profiler_demo.ipynb * Update examples/data_profiler_demo.ipynb * fix: order bug (#939) * fix: typo on rebase * fix: typing and bugs from rebase * fix: options tests due to merge and loading new options --------- Co-authored-by: Michael Davis <[email protected]> Co-authored-by: ksneab7 <[email protected]> Co-authored-by: Taylor Turner <[email protected]> Co-authored-by: Tyler <[email protected]> Co-authored-by: Junho Lee <[email protected]> Co-authored-by: ksneab7 <[email protected]> * Hotfix: fix post feature serialization merge (#942) * fix: to use config instead of options * fix: comment * fix: maxdiff * version bump (#944) --------- Co-authored-by: JGSweets <[email protected]> Co-authored-by: Rushabh Vinchhi <[email protected]> Co-authored-by: Richard Bann <[email protected]> Co-authored-by: Liz Smith <[email protected]> Co-authored-by: Richard Bann <[email protected]> Co-authored-by: Tyler <[email protected]> Co-authored-by: Michael Davis <[email protected]> Co-authored-by: ksneab7 <[email protected]> Co-authored-by: Junho Lee <[email protected]> Co-authored-by: ksneab7 <[email protected]>
capitalone · Jun 29, 2023 · 77ddb29 · 77ddb29
1 parent 2f94db1
commit 77ddb29
Show file tree

Hide file tree

Showing 68 changed files with 5,068 additions and 556 deletions.
diff --git a/.github/workflows/test-python-package.yml b/.github/workflows/test-python-package.yml
@@ -8,6 +8,7 @@ on:
     branches:
       - 'main'
       - 'feature/**'
+      - 'dev'
 
 jobs:
   build:

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -63,6 +63,7 @@ repos:
             networkx>=2.5.1,
             typing-extensions>=3.10.0.2,
             HLL>=2.0.3,
+            datasketches>=4.1.0,
 
             # requirements-dev.txt
             check-manifest>=0.48,
@@ -109,7 +110,7 @@ repos:
         additional_dependencies: ['h5py', 'wheel', 'future', 'numpy', 'pandas',
         'python-dateutil', 'pytz', 'pyarrow', 'chardet', 'fastavro',
         'python-snappy', 'charset-normalizer', 'psutil', 'scipy', 'requests',
-        'networkx','typing-extensions', 'HLL']
+        'networkx','typing-extensions', 'HLL', 'datasketches']
   # Pyupgrade - standardize and modernize Python syntax for newer versions of the language
   - repo: https://github.com/asottile/pyupgrade
     rev: v3.3.0

diff --git a/dataprofiler/data_readers/csv_data.py b/dataprofiler/data_readers/csv_data.py
@@ -87,6 +87,7 @@ def __init__(
         self._checked_header: bool = "header" in options and self._header != "auto"
         self._default_delimiter: str = ","
         self._default_quotechar: str = '"'
+        self._sample_nrows: Optional[int] = options.get("sample_nrows", None)
 
         if data is not None:
             self._load_data(data)
@@ -115,6 +116,11 @@ def header(self) -> Optional[Union[str, int]]:
         """Return header."""
         return self._header
 
+    @property
+    def sample_nrows(self) -> Optional[int]:
+        """Return sample_nrows."""
+        return self._sample_nrows
+
     @property
     def is_structured(self) -> bool:
         """Determine compatibility with StructuredProfiler."""
@@ -168,6 +174,10 @@ def _check_and_return_options(options: Optional[Dict]) -> Dict:
                 raise ValueError(
                     "'record_samples_per_line' must be an int " "more than 0"
                 )
+        if "sample_nrows" in options:
+            value = options["sample_nrows"]
+            if not isinstance(value, int) or value < 0:
+                raise ValueError("'sample_nrows' must be an int more than 0")
         return options
 
     @staticmethod
@@ -549,6 +559,7 @@ def _load_data_from_str(self, data_as_str: str) -> pd.DataFrame:
             data_buffered,
             self.delimiter,
             cast(Optional[int], self.header),
+            self.sample_nrows,
             self.selected_columns,
             read_in_string=True,
         )
@@ -595,6 +606,7 @@ def _load_data_from_file(self, input_file_path: str) -> pd.DataFrame:
             input_file_path,
             self.delimiter,
             cast(Optional[int], self.header),
+            self.sample_nrows,
             self.selected_columns,
             read_in_string=True,
             encoding=self.file_encoding,

diff --git a/dataprofiler/data_readers/data_utils.py b/dataprofiler/data_readers/data_utils.py
@@ -1,9 +1,13 @@
 """Contains functions for data readers."""
 import json
+import os
+import random
 import re
 import urllib
 from collections import OrderedDict
 from io import BytesIO, StringIO, TextIOWrapper
+from itertools import islice
+from math import floor, log, log1p
 from typing import (
     Any,
     Dict,
@@ -24,7 +28,7 @@
 from chardet.universaldetector import UniversalDetector
 from typing_extensions import TypeGuard
 
-from .. import dp_logging
+from .. import dp_logging, settings
 from .._typing import JSONType, Url
 from .filepath_or_buffer import FileOrBufferHandler, is_stream_buffer  # NOQA
 
@@ -268,10 +272,106 @@ def read_json(
     return lines
 
 
+def reservoir(file: TextIOWrapper, sample_nrows: int) -> list:
+    """
+    Implement the mathematical logic of Reservoir sampling.
+
+    :param file: wrapper of the opened csv file
+    :type file: TextIOWrapper
+    :param sample_nrows: number of rows to sample
+    :type sample_nrows: int
+
+    :raises: ValueError()
+
+    :return: sampled values
+    :rtype: list
+    """
+    # Copyright 2021 Oscar Benjamin
+    #
+    # Permission is hereby granted, free of charge, to any person obtaining a copy
+    # of this software and associated documentation files (the "Software"), to deal
+    # in the Software without restriction, including without limitation the rights
+    # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    # copies of the Software, and to permit persons to whom the Software is
+    # furnished to do so, subject to the following conditions:
+    #
+    # The above copyright notice and this permission notice shall be included in
+    # all copies or substantial portions of the Software.
+    #
+    # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+    # SOFTWARE.
+    # https://gist.github.com/oscarbenjamin/4c1b977181f34414a425f68589e895d1
+
+    iterator = iter(file)
+    values = list(islice(iterator, sample_nrows))
+
+    irange = range(len(values))
+    indices = dict(zip(irange, irange))
+
+    kinv = 1 / sample_nrows
+    W = 1.0
+    rng = random.Random(x=settings._seed)
+    if "DATAPROFILER_SEED" in os.environ and settings._seed is None:
+        seed = os.environ.get("DATAPROFILER_SEED")
+        if seed:
+            rng = random.Random(int(seed))
+
+    while True:
+        W *= rng.random() ** kinv
+        # random() < 1.0 but random() ** kinv might not be
+        # W == 1.0 implies "infinite" skips
+        if W == 1.0:
+            break
+        # skip is geometrically distributed with parameter W
+        skip = floor(log(rng.random()) / log1p(-W))
+        try:
+            newval = next(islice(iterator, skip, skip + 1))
+        except StopIteration:
+            break
+        # Append new, replace old with dummy, and keep track of order
+        remove_index = rng.randrange(sample_nrows)
+        values[indices[remove_index]] = str(None)
+        indices[remove_index] = len(values)
+        values.append(newval)
+
+    values = [values[indices[i]] for i in irange]
+    return values
+
+
+def rsample(file_path: TextIOWrapper, sample_nrows: int, args: dict) -> StringIO:
+    """
+    Implement Reservoir Sampling to sample n rows out of a total of M rows.
+
+    :param file_path: path of the csv file to be read in
+    :type file_path: TextIOWrapper
+    :param sample_nrows: number of rows being sampled
+    :type sample_nrows: int
+    :param args: options to read the csv file
+    :type args: dict
+    """
+    header = args["header"]
+    result = []
+
+    if header is not None:
+        result = [[next(file_path) for i in range(header + 1)][-1]]
+        args["header"] = 0
+
+    result += reservoir(file_path, sample_nrows)
+
+    fo = StringIO("".join([i if (i[-1] == "\n") else i + "\n" for i in result]))
+    return fo
+
+
 def read_csv_df(
     file_path: Union[str, BytesIO, TextIOWrapper],
     delimiter: Optional[str],
     header: Optional[int],
+    sample_nrows: Optional[int] = None,
     selected_columns: List[str] = [],
     read_in_string: bool = False,
     encoding: Optional[str] = "utf-8",
@@ -314,19 +414,28 @@ def read_csv_df(
 
     # account for py3.6 requirement for pandas, can remove if >= py3.7
     is_buf_wrapped = False
+    is_file_open = False
     if isinstance(file_path, BytesIO):
         # a BytesIO stream has to be wrapped in order to properly be detached
         # in 3.6 this avoids read_csv wrapping the stream and closing too early
         file_path = TextIOWrapper(file_path, encoding=encoding)
         is_buf_wrapped = True
-
-    fo = pd.read_csv(file_path, **args)
+    elif isinstance(file_path, str):
+        file_path = open(file_path, encoding=encoding)
+        is_file_open = True
+
+    file_data = file_path
+    if sample_nrows:
+        file_data = rsample(file_path, sample_nrows, args)
+    fo = pd.read_csv(file_data, **args)
     data = fo.read()
 
     # if the buffer was wrapped, detach it before returning
     if is_buf_wrapped:
         file_path = cast(TextIOWrapper, file_path)
         file_path.detach()
+    elif is_file_open:
+        file_path.close()
     fo.close()
 
     return data

diff --git a/dataprofiler/data_readers/graph_data.py b/dataprofiler/data_readers/graph_data.py
@@ -255,7 +255,7 @@ def _format_data_networkx(self) -> nx.Graph:
             self.input_file_path,
             self._delimiter,
             cast(Optional[int], self._header),
-            [],
+            selected_columns=[],
             read_in_string=True,
             encoding=self.file_encoding,
         )

diff --git a/dataprofiler/labelers/base_data_labeler.py b/dataprofiler/labelers/base_data_labeler.py
@@ -637,7 +637,9 @@ def load_from_library(cls, name: str) -> BaseDataLabeler:
         :return: DataLabeler class
         :rtype: BaseDataLabeler
         """
-        return cls(os.path.join(default_labeler_dir, name))
+        labeler = cls(os.path.join(default_labeler_dir, name))
+        labeler._default_model_loc = name
+        return labeler
 
     @classmethod
     def load_from_disk(cls, dirpath: str, load_options: dict = None) -> BaseDataLabeler:

diff --git a/dataprofiler/labelers/data_labelers.py b/dataprofiler/labelers/data_labelers.py
@@ -102,7 +102,7 @@ def __new__(  # type: ignore
         trainable: bool = False,
     ) -> BaseDataLabeler:
         """
-        Create structured and unstructred data labeler objects.
+        Create structured and unstructured data labeler objects.
 
         :param dirpath: Path to load data labeler
         :type dirpath: str
@@ -143,6 +143,9 @@ def load_from_library(cls, name: str, trainable: bool = False) -> BaseDataLabele
         """
         if trainable:
             return TrainableDataLabeler.load_from_library(name)
+        for _, labeler_class_obj in cls.labeler_classes.items():
+            if name in labeler_class_obj._default_model_loc:
+                return labeler_class_obj()
         return BaseDataLabeler.load_from_library(name)
 
     @classmethod

diff --git a/dataprofiler/profilers/__init__.py b/dataprofiler/profilers/__init__.py
@@ -1,12 +1,98 @@
 """Package for providing statistics and predictions for a given dataset."""
+from . import json_decoder
 from .base_column_profilers import BaseColumnProfiler
 from .categorical_column_profile import CategoricalColumn
+from .column_profile_compilers import (
+    BaseCompiler,
+    ColumnDataLabelerCompiler,
+    ColumnPrimitiveTypeProfileCompiler,
+    ColumnStatsProfileCompiler,
+)
 from .data_labeler_column_profile import DataLabelerColumn
 from .datetime_column_profile import DateTimeColumn
 from .float_column_profile import FloatColumn
 from .int_column_profile import IntColumn
 from .numerical_column_stats import NumericStatsMixin
 from .order_column_profile import OrderColumn
-from .profile_builder import Profiler, StructuredProfiler, UnstructuredProfiler
+from .profile_builder import (
+    Profiler,
+    StructuredColProfiler,
+    StructuredProfiler,
+    UnstructuredProfiler,
+)
+from .profiler_options import (
+    BaseInspectorOptions,
+    BooleanOption,
+    CategoricalOptions,
+    CorrelationOptions,
+    DataLabelerOptions,
+    DateTimeOptions,
+    FloatOptions,
+    HistogramOption,
+    HyperLogLogOptions,
+    IntOptions,
+    ModeOption,
+    NumericalOptions,
+    OrderOptions,
+    PrecisionOptions,
+    ProfilerOptions,
+    RowStatisticsOptions,
+    StructuredOptions,
+    TextOptions,
+    TextProfilerOptions,
+    UniqueCountOptions,
+    UnstructuredOptions,
+)
 from .text_column_profile import TextColumn
 from .unstructured_labeler_profile import UnstructuredLabelerProfile
+
+# set here to avoid circular imports
+json_decoder._profiles = {
+    CategoricalColumn.__name__: CategoricalColumn,
+    FloatColumn.__name__: FloatColumn,
+    IntColumn.__name__: IntColumn,
+    DateTimeColumn.__name__: DateTimeColumn,
+    OrderColumn.__name__: OrderColumn,
+    DataLabelerColumn.__name__: DataLabelerColumn,
+    TextColumn.__name__: TextColumn,
+}
+
+
+json_decoder._compilers = {
+    ColumnDataLabelerCompiler.__name__: ColumnDataLabelerCompiler,
+    ColumnPrimitiveTypeProfileCompiler.__name__: ColumnPrimitiveTypeProfileCompiler,
+    ColumnStatsProfileCompiler.__name__: ColumnStatsProfileCompiler,
+}
+
+json_decoder._options = {
+    BooleanOption.__name__: BooleanOption,
+    HistogramOption.__name__: HistogramOption,
+    ModeOption.__name__: ModeOption,
+    BaseInspectorOptions.__name__: BaseInspectorOptions,
+    NumericalOptions.__name__: NumericalOptions,
+    IntOptions.__name__: IntOptions,
+    PrecisionOptions.__name__: PrecisionOptions,
+    FloatOptions.__name__: FloatOptions,
+    TextOptions.__name__: TextOptions,
+    DateTimeOptions.__name__: DateTimeOptions,
+    OrderOptions.__name__: OrderOptions,
+    CategoricalOptions.__name__: CategoricalOptions,
+    CorrelationOptions.__name__: CorrelationOptions,
+    UniqueCountOptions.__name__: UniqueCountOptions,
+    HyperLogLogOptions.__name__: HyperLogLogOptions,
+    RowStatisticsOptions.__name__: RowStatisticsOptions,
+    DataLabelerOptions.__name__: DataLabelerOptions,
+    TextProfilerOptions.__name__: TextProfilerOptions,
+    StructuredOptions.__name__: StructuredOptions,
+    UnstructuredOptions.__name__: UnstructuredOptions,
+    ProfilerOptions.__name__: ProfilerOptions,
+}
+
+
+json_decoder._profilers = {
+    StructuredProfiler.__name__: StructuredProfiler,
+}
+
+json_decoder._structured_col_profiler = {
+    StructuredColProfiler.__name__: StructuredColProfiler,
+}