Surprising results on a simple regression dataset #231

Sun-Haotian · 2022-07-15T19:37:45Z

Environment details

If you are already running CTGAN, please indicate the following details about the environment in
which you are running it:

CTGAN version: 0.5.1
Python version: 3.7.13
Operating System: Google Colab

Problem description

I am trying to use CTGAN to generate some synthetic data for my regression dataset. The full dataset has 159 data points and I have manually split them into a training dataset with 111 data points and a test dataset with 48 data points. I have 7 input features (the first 7 columns) and 1 output feature "Normerr" (the last column). Due to their mechanical implications, "su/sy", "D/t" and "a/t" should be smaller than 1, larger than 22, and smaller than 0.8, respectively, and these limits have been reflected in the data.

My goal is to generate synthetic data that can exhibit ML utility. I used Gaussian process regression in sklearn to train on 111 data points and test on 48 data points (i.e. original training and test datasets) and the coefficient of determination on the training set is 0.97 and on the test dataset is 0.87, which is satisfactory to me. I then tried to use the 111 data points in my training dataset to generate 300 synthetic data points. I considered 2 scenarios to show the ML utility. I tried to: 1. train on real 111 data points and test on synthetic 300 data points; 2. train on synthetic 300 data points using the same hyperparameters as the GPR model trained by 111 data points, and test on 48 data points. Both scenarios give poor results. I also notice that generated synthetic data contains a lot of extreme values (e.g. the lowest value in Cv column, 15.2, repeats more than 10 times in many rounds of trials)

What I already tried

I think I have tried to adjust every parameter that is included in the package. However, I have not noticed any improvements yet.

The successful application of CTGAN will be a great help to my study and I would sincerely appreciate it if anyone could help generate high-quality synthetic data that has ML utility as described above. Please forgive my poor coding skills.

Another thing is that my colleague successfully used TGAN to generate synthetic data on a similar dataset. However, he used GPU to run in Colab, and takes hours to get the results. However, using CTGAN I can get results with CPU in minutes with epochs = 300 or 500. Is there anything wrong here? Sorry, I am not familiar with Deep learning and what I did is just follow the manual.

Thanks in advance!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
----------------------------------------------------------------------------------------------------------
pip install sdv
----------------------------------------------------------------------------------------------------------
# read in our training data
try:
  data = pd.read_excel('159trainingCTGAN.xlsx')
except Exception as e:
  print('reading failed')
  print(e)
----------------------------------------------------------------------------------------------------------
from sdv.tabular import CTGAN
from sdv.tabular import TVAE
from sdv.tabular import CopulaGAN
----------------------------------------------------------------------------------------------------------
model = CTGAN(embedding_dim = 128,
        generator_dim = (256, 256),
        discriminator_dim = (256, 256),
        generator_lr = 2e-4,
        generator_decay = 1e-6,
        discriminator_lr = 2e-4,
        discriminator_decay = 1e-6,
        batch_size = 10,
        discriminator_steps = 1,
        verbose = True,
        epochs = 300,
        pac = 10)
model.fit(data)
----------------------------------------------------------------------------------------------------------
fake_data = model.sample(num_rows=300, randomize_samples=True)
fake_data.head(10)
----------------------------------------------------------------------------------------------------------
# read in our test data
try:
  test_data = pd.read_excel('159testCTGAN.xlsx')
except Exception as e:
  print('reading failed')
  print(e)
----------------------------------------------------------------------------------------------------------
# use GPR to train on 111 and test on 48
X_training = data[['sy','su/sy','D/t','t','a/t','2c','Cv']]
y_training = data[['Normerr']]
X_test = test_data[['sy','su/sy','D/t','t','a/t','2c','Cv']]
y_test = test_data[['Normerr']]

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF
from sklearn.gaussian_process.kernels import WhiteKernel
from sklearn.metrics import r2_score

kernel = WhiteKernel(noise_level=0.000000000000001) + 1**2 * RBF(length_scale=[1,1,1,1,1,1,1])
gp = GaussianProcessRegressor(kernel=kernel,random_state=0)
gp.fit(X_training, y_training)
print(gp.kernel_)
print(gp.log_marginal_likelihood_value_)

# summarize results
pred_test = gp.predict(X_test)
'''print(pred_test)'''
testovsp = []
for i in range(48):
  testovsp.append(y_test.values[i]/pred_test[i])
'''print(testovsp)'''
testovspmean = np.mean(testovsp)
'''print('Test Mean of Observed over predicted is: ' + str(testovspmean))'''
testovspstd = np.std(testovsp)
'''print('Test Std of Observed over predicted is: ' + str(testovspstd))
print('Test COV of Observed over predicted is: ' + str(testovspstd/testovspmean))'''
print('Test score of Observed over predicted is: ' + str(r2_score(y_test.values,pred_test)))
plt.scatter(pred_test,y_test.values)
plt.xlim((-0.5,0.5))
plt.ylim((-0.5,0.5))
plt.xlabel('Predicted Burst Capacity')
plt.ylabel('Observed Burst Capacity')
plt.show()

pred_training = gp.predict(X_training)
'''print(pred_training)'''
trainingovsp = []
for i in range(111):
  trainingovsp.append(y_training.values[i]/pred_training[i])
'''print(trainingovsp)'''
trainingovspmean = np.mean(trainingovsp)
'''print('Training Mean of Observed over predicted is: ' + str(trainingovspmean))'''
trainingovspstd = np.std(trainingovsp)
'''print('Training Std of Observed over predicted is: ' + str(trainingovspstd))
print('Training COV of Observed over predicted is: ' + str(trainingovspstd/trainingovspmean))'''
print('Training score of Observed over predicted is: ' + str(r2_score(y_training.values,pred_training)))
plt.scatter(pred_training,y_training.values)
plt.xlim((-0.5,0.5))
plt.ylim((-0.5,0.5))
plt.xlabel('Predicted Burst Capacity')
plt.ylabel('Observed Burst Capacity')
plt.show()
----------------------------------------------------------------------------------------------------------
# train on 111 and test on generated fake 300

X_training = data[['sy','su/sy','D/t','t','a/t','2c','Cv']]
y_training = data[['Normerr']]
X_test = fake_data[['sy','su/sy','D/t','t','a/t','2c','Cv']]
y_test = fake_data[['Normerr']]

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF
from sklearn.gaussian_process.kernels import WhiteKernel
from sklearn.metrics import r2_score

kernel = WhiteKernel(noise_level=0.000000000000001) + 1**2 * RBF(length_scale=[1,1,1,1,1,1,1])
gp = GaussianProcessRegressor(kernel=kernel,random_state=0)
gp.fit(X_training, y_training)
print(gp.kernel_)
print(gp.log_marginal_likelihood_value_)

# summarize results
pred_test = gp.predict(X_test)
'''print(pred_test)'''
testovsp = []
for i in range(300):
  testovsp.append(y_test.values[i]/pred_test[i])
'''print(testovsp)'''
testovspmean = np.mean(testovsp)
'''print('Test Mean of Observed over predicted is: ' + str(testovspmean))'''
testovspstd = np.std(testovsp)
'''print('Test Std of Observed over predicted is: ' + str(testovspstd))
print('Test COV of Observed over predicted is: ' + str(testovspstd/testovspmean))'''
print('Test score of Observed over predicted is: ' + str(r2_score(y_test.values,pred_test)))
plt.scatter(pred_test,y_test.values)
plt.xlim((-0.5,0.5))
plt.ylim((-0.5,0.5))
plt.xlabel('Predicted Burst Capacity')
plt.ylabel('Observed Burst Capacity')
plt.show()

pred_training = gp.predict(X_training)
'''print(pred_training)'''
trainingovsp = []
for i in range(111):
  trainingovsp.append(y_training.values[i]/pred_training[i])
'''print(trainingovsp)'''
trainingovspmean = np.mean(trainingovsp)
'''print('Training Mean of Observed over predicted is: ' + str(trainingovspmean))'''
trainingovspstd = np.std(trainingovsp)
'''print('Training Std of Observed over predicted is: ' + str(trainingovspstd))
print('Training COV of Observed over predicted is: ' + str(trainingovspstd/trainingovspmean))'''
print('Training score of Observed over predicted is: ' + str(r2_score(y_training.values,pred_training)))
plt.scatter(pred_training,y_training.values)
plt.xlim((-0.5,0.5))
plt.ylim((-0.5,0.5))
plt.xlabel('Predicted Burst Capacity')
plt.ylabel('Observed Burst Capacity')
plt.show()
----------------------------------------------------------------------------------------------------------
# train on fake 300 and test on 48 using fixed hyperparameters of GPR

X_training = fake_data[['sy','su/sy','D/t','t','a/t','2c','Cv']]
y_training = fake_data[['Normerr']]
test_data = pd.read_excel('159testCTGAN.xlsx')
X_test = test_data[['sy','su/sy','D/t','t','a/t','2c','Cv']]
y_test = test_data[['Normerr']]

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF
from sklearn.gaussian_process.kernels import WhiteKernel
from sklearn.metrics import r2_score

kernel = WhiteKernel(noise_level=0.000711, noise_level_bounds="fixed") + 0.121**2 * RBF(length_scale=[327, 0.184, 41.9, 1.53e+04, 0.311, 27.4, 13.1], length_scale_bounds= "fixed")
gp = GaussianProcessRegressor(kernel=kernel,random_state=0)
gp.fit(X_training, y_training)
print(gp.kernel_)
print(gp.log_marginal_likelihood_value_)

# summarize results
pred_test = gp.predict(X_test)
'''print(pred_test)'''
testovsp = []
for i in range(48):
  testovsp.append(y_test.values[i]/pred_test[i])
'''print(testovsp)'''
testovspmean = np.mean(testovsp)
'''print('Test Mean of Observed over predicted is: ' + str(testovspmean))'''
testovspstd = np.std(testovsp)
'''print('Test Std of Observed over predicted is: ' + str(testovspstd))
print('Test COV of Observed over predicted is: ' + str(testovspstd/testovspmean))'''
print('Test score of Observed over predicted is: ' + str(r2_score(y_test.values,pred_test)))
plt.scatter(pred_test,y_test.values)
plt.xlim((-0.5,0.5))
plt.ylim((-0.5,0.5))
plt.xlabel('Predicted Burst Capacity')
plt.ylabel('Observed Burst Capacity')
plt.show()

pred_training = gp.predict(X_training)
'''print(pred_training)'''
trainingovsp = []
for i in range(300):
  trainingovsp.append(y_training.values[i]/pred_training[i])
'''print(trainingovsp)'''
trainingovspmean = np.mean(trainingovsp)
'''print('Training Mean of Observed over predicted is: ' + str(trainingovspmean))'''
trainingovspstd = np.std(trainingovsp)
'''print('Training Std of Observed over predicted is: ' + str(trainingovspstd))
print('Training COV of Observed over predicted is: ' + str(trainingovspstd/trainingovspmean))'''
print('Training score of Observed over predicted is: ' + str(r2_score(y_training.values,pred_training)))
plt.scatter(pred_training,y_training.values)
plt.xlim((-0.5,0.5))
plt.ylim((-0.5,0.5))
plt.xlabel('Predicted Burst Capacity')
plt.ylabel('Observed Burst Capacity')
plt.show()

159testCTGAN.xlsx
159trainingCTGAN.xlsx

The text was updated successfully, but these errors were encountered:

npatki · 2022-07-25T20:50:09Z

Hi @Sun-Haotian, thanks for the detailed issue, code snippets and data.

The CTGAN model is designed & tested on enterprise datasets that contain records from real-world user behavior or natural events.

I’m curious how you are identifying that your dataset is “simple”. It contains several properties that may not be suitable for CTGAN —

There are not that many rows.
Judging by the names and descriptions, there seems to be a mathematical relationship between the columns.
The rows don’t appear to be fully independent or naturally collected. For example, the first 9 rows of the training data have the same exact floating point value for 5 out of the 8 columns.

Was this dataset created using a formula or manual properties? What problem are you hoping to solve using the synthetic data?

Sun-Haotian · 2022-07-26T19:24:06Z

Hi @npatki, thank you very much for your reply. I will answer your questions in detail and propose some other thoughts based on my observations since my last post.

Was this dataset created using a formula or manual properties? What problem are you hoping to solve using the synthetic data?
Generally, this dataset is real-world civil engineering test data thus it is natural. I am trying to generate synthetic data to augment my dataset to build up some machine learning models. Therefore, I need the synthetic data to have similar statistical properties (i.e. CDF, PDF) and machine learning properties (e.g. A Gaussian Process Regression model trained on the real dataset performs well on the synthetic data; or, two GPR models respectively trained on the real dataset and synthetic dataset both works well on a separate test dataset).

There are not that many rows.
This is because such civil engineering data is precious as carrying out such experiments is expensive.

Judging by the names and descriptions, there seems to be a mathematical relationship between the columns.
I think the correlation between different columns is not too high.

The rows don’t appear to be fully independent or naturally collected. For example, the first 9 rows of the training data have the same exact floating point value for 5 out of the 8 columns.
I think each row is collected from an independent civil engineering experiment. Some rows involve the same floating point values as people tend to use the same material or specimen properties for these experiments.

I’m curious how you are identifying that your dataset is “simple”
I think the dataset is simple because it is a small dataset with only 7 continuous input features and 1 continuous output feature. My colleague applied TGAN, also developed by Xu et al. at MIT, to a very similar dataset as mine and got very good synthetic data. I think CTGAN is an advanced version of TGAN which works generally better, so it should be working well on my dataset. However, I have not noticed any good machine learning performances on the synthetic data yet.

Another interesting observation that I am aware of is that, if I set both the batch_size and epochs to very large values, like 50,000, I can get good synthetic data. However, according to the definition, batch_size cannot exceed the number of rows in the dataset. In TGAN, when setting batch_size larger than the number of rows, an error will appear. However, for the CTGAN code, I am not sure why setting batch_size to a very large value will not lead to an error but leads to good synthetic data. Would you please provide some instructions or explanations for this behavior?

Thank you again for your reply and attention to my question. I sincerely look forward to discussing this with you further.

Sun-Haotian added pending review This issue needs to be further reviewed, so work cannot be started question General question about the software labels Jul 15, 2022

npatki added under discussion Issue is currently being discussed and removed pending review This issue needs to be further reviewed, so work cannot be started labels Jul 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surprising results on a simple regression dataset #231

Surprising results on a simple regression dataset #231

Sun-Haotian commented Jul 15, 2022

npatki commented Jul 25, 2022

Sun-Haotian commented Jul 26, 2022

Surprising results on a simple regression dataset #231

Surprising results on a simple regression dataset #231

Comments

Sun-Haotian commented Jul 15, 2022

Environment details

Problem description

What I already tried

npatki commented Jul 25, 2022

Sun-Haotian commented Jul 26, 2022