Kernel Regression

A tiny collection of functions and classes to do some machine learning research on multidimensional datasets based on kernel regression.

Why another regression system

Regression systems are usable tools to analyze datasets which concludes inputs and targets. Some of these kind of machine learning systems are based on hypothesis. The success is bounded to the quality of the hypothesis. Often it is practical to define a specific hypothesis function with particular features, so that the dataset can be analyzed on this features. In other cases the regression model should fit the dataset as good as possible. In this case the hypothesis should only fit the data. An automatically generated hypothesis is usable, if no special features are needed.

What does my kernel_regression

Kernel Regression is a nonparametric regression concept that produces its own hypothesis. The given feature tuples {x, y} will be used to generate the hypothesis. A kernel function K(u) evaluates the significance of the several feature points. The hypothesis will be calculated based on the Nadaraya-Watson-Estimator-Concept m_i = sum(y_j Kh(u_ij))/sum(Kh(u_ij)). As cost function the mean-squared-error (MSE) is implemented.

My kernel regression supports different modes for parameterizing the kernel function. Its possible to use a general bandwidth h over all feature points or to optimize separately for each. Also you can choose if you want to use a scaled Kh(u) = 1/h*K(u) or unscaled kernel function K(u). As kernel function the common ones are implemented (gaussian, cauchy, picard, uniform, triangle, cosinus and epanechnikov). Own kernel functions can be built in.

How to use it

Create a Kernel Regression dataset

First you have to load your data into the KRDataSet class. This class prepare and manage the dataset. This class splits, normalizes and reduces the dataset. Also a method to save the data is applied.

The dataset has to consist of matrices and vectors. The rows has to be the single examples and the columns the features. For input and target date are multiple features allowed.

Splitting the data

The dataset have to be split into minimal three subsets. This three subsets are needed for the three calculation steps:

Generating the feature data
Validating the regression parameter
Testing the learned behavior to estimate the learn success

The data will be split at the initialization of the class. The default setting is distribution = (60,20,20)P. This means that the first subset (features) contains 60%, the second subset (validation) 20% and the third subset (testing) also 20% tuples of the whole dataset. If you set four or more entries into this list, the method produces four or more subsets. The different subsets can get names with the option nameString = ('feature', 'validate', 'test'). A call by the name is current not implemented, but is planned.

data = KrDataSet(inputData = x, targetData = y, distribution = (70, 10, 20), nameString = ('subset1', 'subset2', 'subset3'))

Reducing the feature subset

Into the KrDataSet class a method is implemented that provides the option to reduce the feature subset. This is useful to reduce the calculation time. Every row in the feature subset provides one base of the solution. But not every feature is useful like the others. The method .reduceFeature(N) filters the feature subset. The parameter N defines the degree of reduction. If N >= 1, it represents the number of features that will be reduced. If 0 <= N < 1, it represents the percent of features that will be reduced.

data.reduceFeature(0.5)

Create a Kernel Regression model

The class KRModell provides the methods to calculate the Kernel Regression model. Some options are implemented for the calculation:

multiH: boolean; if true each feature gets its own optimized bandwidth, else one bandwidth for all
scaleKernel: boolean; if true the area below the kernel function is 1, else the output of the kernel function is between 0 and 1
kernel: string; divines the kernel function
- gaussian
- cauchy
- picard
- uniform
- triangle
- cosinus
- epanechnikov
  - epanechnikov1
  - epanechnikov2
  - epanechnikov3
powerList: list; extends the features with each entry as power
- Example 1: [1] => Kh(u) = f(u_ij)
- Example 2: [1, 2] => Kh(u) = f(u_ij + (u_ij)^2)
- Example 3: [0.5, 1, 3] => Kh(u) = f((u_ij)^0.5 + u_ij + (u_ij)^3)

The default is:

options = {'multiH':False, 'scaleKernel':True, 'kernel':'gaussian', 'powerList':[1]}

The data has to be the class KrDataSet. The data can be set at initializing:

model = KRModel(krData = data, options = {'multiH':True, 'scaleKernel':False, 'kernel':'gaussian', 'powerList':[1, 2]})

or set later:

model.setData(krData = data)

Also the options can be change:

model.setOptions(options = {'multiH':True, 'scaleKernel':True, 'kernel':'epanechnikov2', 'powerList':[1/2, 2]})

Learning

The method .learnModel runs the learn algorithm. The method provides also the option to update the learn options.

model.learnModel(options = {'multiH':True, 'scaleKernel':False, 'kernel':'cauchy', 'powerList':[1]})

Estimate values

To estimate values from the model some informations are needed. The main information are the points, where the model estimate the function. The point matrix have to be configure like the input vector. To handle normalized and unnormalized data set the options normalInput=False and normalOutput=False. To choose the targets use the option i_target = 'all'. Valid values for i_target are:

'all': means all target will be estimated
number: means this target only, e.g: [1]
list: means all listed target only. e.g: [2,3]

estimatedTarget = model.estimateFunction(x = points, normalInput = False, normalOutput = True, i_target = 'all')

Validate model

Currently is one value and one graphical based method implemented to validate the model. The method .goodnessOfFit calculates the goodness of fit based on the test subset. The option i_target = 'all' (equal as by the method .estimateFunction) can be used to set the interesting targets. If more than one target is chose, Nadaraya the returned goodness of fit represents the value offer all chose data.

r2 = model.goodnessOfFit(i_target = 1)

The graphical method .plotRegression provides the option to plot the relation between the true and the estimated data. Also the goodness of fit will be shown in the title. As option the different subsets dataset = 'all' and targets i_target = 'all' (equal as by the method .estimateFunction) can be choose.

model.potRegression(dataset = [2,3], i_target = 2)

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
kernel_regression.py		kernel_regression.py
kernelfunctions.py		kernelfunctions.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kernel Regression

Why another regression system

What does my kernel_regression

How to use it

Create a Kernel Regression dataset

Splitting the data

Reducing the feature subset

Create a Kernel Regression model

Learning

Estimate values

Validate model

About

Releases

Packages

Languages

License

Digusil/kernel_regression_python

Folders and files

Latest commit

History

Repository files navigation

Kernel Regression

Why another regression system

What does my kernel_regression

How to use it

Create a Kernel Regression dataset

Splitting the data

Reducing the feature subset

Create a Kernel Regression model

Learning

Estimate values

Validate model

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages