Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context column cannot be a sequence key: Need better error message for this case #2097

Open
npatki opened this issue Jun 27, 2024 · 0 comments · May be fixed by #2108
Open

Context column cannot be a sequence key: Need better error message for this case #2097

npatki opened this issue Jun 27, 2024 · 0 comments · May be fixed by #2108
Assignees
Labels
bug Something isn't working data:sequential Related to timeseries datasets

Comments

@npatki
Copy link
Contributor

npatki commented Jun 27, 2024

Environment Details

  • SDV version: 1.14.0 (latest)

Error Description

For sequential data, it should not be possible for the sequence key to be the same as a context column. This is because the sequence key is an identifier for each sequence, whereas a context column is just another column that happens to never vary within a sequence. There is no need to declare a sequence key as a context column because a sequence key is already guaranteed not to vary within a sequence -- rather, it is defining what a sequence is.

Yet, the code somehow allows me to instantiate a PARSynthesizer with a context column the same as the sequence key. When I try to fit it, I get an error that is not really relevant to the issue.

import pandas as pd
from sdv.metadata import SingleTableMetadata
from sdv.sequential import PARSynthesizer

metadata = SingleTableMetadata.load_from_dict({
    'columns': {
        'A': { 'sdtype': 'id' },
        'B': { 'sdtype': 'datetime', 'datetime_format': '%Y-%m-%d' },
        'C': { 'sdtype': 'numerical' },
        'D': { 'sdtype': 'categorical' }
    },
    'sequence_key': 'A'
})

data = pd.DataFrame(data={
    'A': [0, 0, 0, 1, 1, 1],
    'B': ['2020-03-02', '2020-03-04', '2020-03-05', '2020-03-01', '2020-03-03', '2020-03-06'],
    'C': [12, 13, 34, 10, 45, 21],
    'D': ['Yes', 'Yes', 'Yes', 'No', 'No', 'No']
})

synth = PARSynthesizer(metadata, context_columns=['A'])
synth.fit(data)

Error:

[/usr/local/lib/python3.10/dist-packages/sdv/sequential/par.py](https://localhost:8080/#) in update_transformers(self, column_name_to_transformer)
    298         """
    299         if set(column_name_to_transformer).intersection(set(self.context_columns)):
--> 300             raise SynthesizerInputError(
    301                 'Transformers for context columns are not allowed to be updated.')
    302 

SynthesizerInputError: Transformers for context columns are not allowed to be updated.

Expected Behavior

I should not be allowed to even instantiate a PARSynthesizer if any of the context columns are sequence keys. This should immediately throw an error explaining that it is not allowed.

synth = PARSynthesizer(metadata, context_columns=['A'])
SynthesizerInputError: The sequence key ('A') cannot be a context column. To proceed, please remove the sequence key from the 'context_columns' parameter.
@npatki npatki added bug Something isn't working data:sequential Related to timeseries datasets labels Jun 27, 2024
@gsheni gsheni self-assigned this Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data:sequential Related to timeseries datasets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants