Performance issues with GaussianCopula training on tabular data #194

jalr4ever · 2024-07-03T03:33:23Z

Problem

When dealing with tabular data at the scale of millions of rows and hundreds of columns, the current GaussianCopulaSynthesizer encounters significant memory usage problems (approximately 37 GB on MacOS M3 MAX).

Proposed Solution

A reduction in resource consumption (e.g., achieving around 4 GB of memory usage for the given data case), alongside the capability to train on larger datasets while maintaining good performance.

Additional context

Reproduction Code & Files:

test file: test.csv

data_connector = CsvConnector(path="./test.csv")
    data_loader = DataLoader(data_connector)
    loan_metadata = Metadata.from_dataloader(data_loader)
    model = GaussianCopulaSynthesizer()
    model.fit(loan_metadata, data_loader)
    sampled_data = model.sample(10)
    sampled_data.to_csv("./aaaa.csv", index=False)

The text was updated successfully, but these errors were encountered:

jalr4ever added the enhancement New feature or request label Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues with GaussianCopula training on tabular data #194

Performance issues with GaussianCopula training on tabular data #194

jalr4ever commented Jul 3, 2024

Performance issues with GaussianCopula training on tabular data #194

Performance issues with GaussianCopula training on tabular data #194

Comments

jalr4ever commented Jul 3, 2024

Problem

Proposed Solution

Additional context