Online learning is very important in machine learning as it allows for the inclusion of new data samples without having to recalculate model parameters for the rest of the data. The aim of this exercise is to explore this concept.
We will now look into online estimation of a mean vector. Yhe objective is to apply the following formula for estimating a mean (see Bishop Section 2.3.5):
Let's first create a data generator. Create a function gen_data(n, k, mean, var)
which returns a
-
$N_k()$ is the k-variate normal distribution -
$\mu$ (ormean
) is the mean vector of dimension$k$ -
$\sigma$ (orvar
) is the variance. -
$I_k$ is the identity matrix
You should use np.random.multivariate_normal
for this.
Example inputs and outputs:
gen_data(2, 3, np.array([0, 1, -1]), 1.3)
[[-0.60340102 0.8046998 1.39181858]
[-0.46788591 0.73089018 0.01772348]]
gen_data(5, 1, np.array([0.5]), 0.5)
[[-0.42461036]
[ 0.45739507]
[ 0.25729006]
[-0.17926144]
[ 0.1403905 ]]
Answer this question via Mimir
Lets create some data
You can visualize your data using tools.scatter_3d_data
to get a plot similar to the following
You can also use tools.bar_per_axis
to visualize the distribution of the data per dimension:
Do you expect the batch estimate to be exactly
We will now implement the sequential estimate.
We want a function that returns
Create a function update_sequence_mean(mu, x, n)
which performs the update in the equation above.
Example inputs and outputs:
mean = np.mean(X, 0)
new_x = gen_data(1, 3, np.array([0, 0, 0]), 1)
update_sequence_mean(mean, new_x, X.shape[0])
Results in [[-0.21653761 -0.00721158 -0.15876203]]
Lets plot the estimates on all dimensions as the sequence estimate gets updated. You can use _plot_sequence_estimate()
as a template. You should:
- Generate 100 3-dimensional points with the same mean and variance as above.
- Set the initial estimate as
$(0, 0, 0)$ - And perform
update_sequence_mean
for each point in the set. - Collect the estimates as you go
For a different set of points this plot looks like the following:
Turn in your plot as 1_5_1.png
Lets now plot the squared error between the estimate and the actual mean after every update.
The squared error between e.g. a ground truth
Of course our data will be 3-dimensional so after calculating the squared error you will have a 3-dimensional error. Take the mean of those three values to get the average error across all three dimensions and plot those values.
You can use _plot_square_error
and _square_error
for this.
For a different distribution this plot looks like the following:
Turn in your plot as 1_6_1.png
What happens if the mean value changes (perhaps slowly) with time? What if
Create this type of data and formulate a method for tracking the mean.
Plot the estimate of all dimensions and the mean squared error over all three dimensions. Turn in these plots as bonus_1.png
and bonus_2.png
.
Write a short summary how your method works.