Description
So it took me ages to figure out the WHY, but I finally pinpointed some extremely weird behavior.
Namely, here's the simples code that reproduces the issue:
$dataset = \Rubix\ML\Datasets\Labeled::build([
[5.1, 3.5, 1.4, 0.2],
[4.9, 3, 1.4, 0.2]
], ['setosa', 'variosa'])->apply(new NumericStringConverter());
$transformer = new \Rubix\ML\Transformers\TruncatedSVD(2);
$dataset->apply($transformer);
var_dump($dataset);
Output
object(Rubix\ML\Datasets\Labeled)#2 (2) {
["labels":protected]=>
array(2) {
[0]=>
string(6) "setosa"
[1]=>
string(7) "variosa"
}
["samples":protected]=>
array(2) {
[0]=>
array(2) {
[0]=>
float(0)
[1]=>
float(0)
}
[1]=>
array(2) {
[0]=>
float(0)
[1]=>
float(0)
}
}
}
As you can see - it's all zeros.
Now, removing the NumericStringConverter:
$dataset = \Rubix\ML\Datasets\Labeled::build([
[5.1, 3.5, 1.4, 0.2],
[4.9, 3, 1.4, 0.2]
], ['setosa', 'variosa']);
$transformer = new \Rubix\ML\Transformers\TruncatedSVD(2);
$dataset->apply($transformer);
var_dump($dataset);
Gives following output:
object(Rubix\ML\Datasets\Labeled)#2 (2) {
["labels":protected]=>
array(2) {
[0]=>
string(6) "setosa"
[1]=>
string(7) "variosa"
}
["samples":protected]=>
array(2) {
[0]=>
array(2) {
[0]=>
float(-6.3431263560806)
[1]=>
float(-0.1573150685585)
}
[1]=>
array(2) {
[0]=>
float(-5.9145190147327)
[1]=>
float(0.16871521675666)
}
}
}
Now, it took me hours to figure out WTF is happening, because, apparently, nothing spectacular is ... BUT ... BUT! I pinpointed the issue to the following line in NumericStringCoverter:
protected function convertToNumber(array &$sample) : void
{
foreach ($sample as &$value) {
if (is_string($value)) {
if (is_numeric($value)) {
$value = (int) $value == $value
? (int) $value
: (float) $value;
continue;
}
This foreach loop that passes reference to $value is the culprit! By replacing it with:
foreach ($sample as $key => $value) {
if (is_string($value)) {
if (is_numeric($value)) {
$sample[$key] = (int) $value == $value
? (int) $value
: (float) $value;
continue;
}
all works as expected really!
This leads me to conclusion that for whatever WEIRD reason, something happens internally that messes up the SVD process. Now the problem is that SVD is written as C extension and I honestly have no clue how to debug that :)
My question is -- do you see this as a bug in NumericStringConverter or in C extension? If it's former, I'd be happy to submit a bugfix really!