Skip to content

Adding a structured dataset

Rick Wierenga edited this page Feb 4, 2020 · 1 revision

Adding a S5TF dataset

With the CSVDataLoader it's very easy to make a PR adding a structured dataset.

This document shows how to add a structured dataset to S5TF in 5 easy steps.

  1. Create a new file
  2. Write a header with a title, description, and BibTeX citation
  3. Copy and paste the boilerplate code
  4. Add general information about your dataset.
  5. Filling in the column names

1. Create a new file

$ touch Sources/Datasets/datasets/structured/AdultIncome.swift

2. Write a header with a title, description, and BibTeX citation

// The Adult Data Set. // TODO: title
//
// Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records
// was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)) // TODO: description
//
// Prediction task is to determine whether a person makes over 50K a year. 
// 
// BibTeX citation: // TODO: BibTeX
// @misc{Dua:2019,
//     author = "Dua, Dheeru and Graff, Casey",
//     year = "2017",
//     title = "{UCI} Machine Learning Repository",
//     url = "http://archive.ics.uci.edu/ml",
//     institution = "University of California, Irvine, School of Information and Computer Sciences"
// }

3. Copy and paste the boilerplate code

Use S5TFCategoricalBatch for classification and S5TFNumericalBatch for regression problems.

import Foundation
import S5TF
import TensorFlow

public struct AdultDataSet: S5TFDataset {
    typealias DataLoader = CSVDataLoader
    public static var train: CSVDataLoader<S5TFCategoricalBatch> { // TODO: S5TFCategoricalBatch/S5TFNumericalBatch
        guard let localURL = Downloader.download(
            fileAt: URL(string: "TODO://add.a.url.com")!,
            cacheName: "cacheName", // the name of the dataset
            fileName: "descriptive_filename.csv"
        ) else {
            fatalError("File not downloaded correctly.")
        }

        return CSVDataLoader<S5TFCategoricalBatch>(
            fromFileAt: localURL.absoluteString,
            columnNames: // TODO(5)
            inputColumnNames: // TODO(5)
            outputColumnNames: // TODO(5)
        )
    }

    public static let info = yourInfo

    private init() {}
}

// swiftlint:disable:next private_over_fileprivate
fileprivate let yourInfo = S5TFDatasetInfo(
    name: // TODO(4) ,
    version: // TODO(4) ,
    description: """
    TODO(4)
    """,
    homepage: // TODO(4) ,
    numberOfTrainExamples: // TODO(4) ,
    numberOfValidExamples: // TODO(4) ,
    numberOfTestExamples: // TODO(4) ,
    numberOfFeatures: // TODO(4) ,
)

4. Replace TODO(4) with general information about your dataset.

fileprivate let adultIncomeInfo = S5TFDatasetInfo(
    name: "Adult Data set",
    version: "0.0.1",
    description: """
    Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records
    was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
    """,
    homepage: URL(string: "http://archive.ics.uci.edu/ml")!,
    numberOfTrainExamples: 48842,
    numberOfValidExamples: 0,
    numberOfTestExamples: 0,
    numberOfFeatures: 15
)

5. Enter the column names. You can find these on the dataset page

If your csv file contains column names in the first row you can pass nil to columnNames.

return CSVDataLoader<S5TFCategoricalBatch>(
    fromFileAt: localURL.absoluteString,
    columnNames: ["age",
                  "workclass",
                  "fnlwgt",
                  "education",
                  "education-num",
                  "marital-status",
                  "occupation",
                  "relationship",
                  "race",
                  "sex",
                  "capital-gain",
                  "capital-loss",
                  "hours-per-week",
                  "native-country",
                  "income"],
    inputColumnNames: ["age",
                       "workclass",
                       "fnlwgt",
                       "education",
                       "education-num",
                       "marital-status",
                       "occupation",
                       "relationship",
                       "race",
                       "sex",
                       "capital-gain",
                       "capital-loss",
                       "hours-per-week",
                       "native-country"],
    outputColumnNames: ["income"]
)