Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.0.0] CTGAN Optimization #77

Open
MooooCat opened this issue Dec 19, 2023 · 2 comments
Open

[1.0.0] CTGAN Optimization #77

MooooCat opened this issue Dec 19, 2023 · 2 comments
Assignees
Labels
difficulty-hard documentation Improvements or additions to documentation enhancement New feature or request
Milestone

Comments

@MooooCat
Copy link
Contributor

Problem

When large amount of real data is used to train a CTGAN model, the current implementation is not working well.

Since all the data (DataFrame) is loaded into the memory when training, this will cause huge memory consumption, which is not an elegant solution.

Proposed Solution

Fortunately, in this refactoring, sdgx provides the new DataLoader and the NDArryLoader under development.

We can use these new data-related components to modify the Data transformer, Data sampler, and CTGAN model.

The data will not be loaded into the memory all at once, instead, the data will be loaded into the memory in rows or columns (chunks) according to needs, then the data will be used to train the model.

This will effectively reduce memory consumption and provide larger data processing capabilities.

Additional context

TBD

@MooooCat MooooCat added documentation Improvements or additions to documentation enhancement New feature or request difficulty-hard labels Dec 19, 2023
@MooooCat MooooCat added this to the 0.1.0 milestone Dec 19, 2023
@Wh1isper
Copy link
Collaborator

CTGAN encodes all discrete columns one-hot, if there are random strings present, they will form a huge matrix during vectorisation, leading to memory overflow.

Based on this, we need to identify random discrete columns (and things like home addresses, names, etc. for random discrete columns) in DataProcessor and Metadata, and process them probabilistically or on-the-fly using tools like Faker.

@MooooCat
Copy link
Contributor Author

CTGAN encodes all discrete columns one-hot, if there are random strings present, they will form a huge matrix during vectorisation, leading to memory overflow.

Based on this, we need to identify random discrete columns (and things like home addresses, names, etc. for random discrete columns) in DataProcessor and Metadata, and process them probabilistically or on-the-fly using tools like Faker.

In response to this problem, I will start the design of metadata and data processor, and update it in the issue or descussion section.

@Wh1isper Wh1isper modified the milestones: 0.1.0, 0.2.0 Dec 20, 2023
@Wh1isper Wh1isper changed the title [0.1.0] CTGAN Optimization [0.2.0] CTGAN Optimization Dec 20, 2023
@Wh1isper Wh1isper removed their assignment Feb 2, 2024
@MooooCat MooooCat modified the milestones: 0.2.0, 1.0.0 Feb 26, 2024
@MooooCat MooooCat changed the title [0.2.0] CTGAN Optimization [1.0.0] CTGAN Optimization Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty-hard documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants