Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues with GaussianCopula training on tabular data #194

Open
jalr4ever opened this issue Jul 3, 2024 · 0 comments
Open

Performance issues with GaussianCopula training on tabular data #194

jalr4ever opened this issue Jul 3, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@jalr4ever
Copy link

Problem

When dealing with tabular data at the scale of millions of rows and hundreds of columns, the current GaussianCopulaSynthesizer encounters significant memory usage problems (approximately 37 GB on MacOS M3 MAX).

Proposed Solution

A reduction in resource consumption (e.g., achieving around 4 GB of memory usage for the given data case), alongside the capability to train on larger datasets while maintaining good performance.

Additional context

Reproduction Code & Files:

test file: test.csv

data_connector = CsvConnector(path="./test.csv")
    data_loader = DataLoader(data_connector)
    loan_metadata = Metadata.from_dataloader(data_loader)
    model = GaussianCopulaSynthesizer()
    model.fit(loan_metadata, data_loader)
    sampled_data = model.sample(10)
    sampled_data.to_csv("./aaaa.csv", index=False)
@jalr4ever jalr4ever added the enhancement New feature or request label Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant