Skip to content

Weird issue with Truncated SVD and NumericStringConvertor #226

Open
@MihailoJoksimovic

Description

@MihailoJoksimovic

So it took me ages to figure out the WHY, but I finally pinpointed some extremely weird behavior.

Namely, here's the simples code that reproduces the issue:

$dataset = \Rubix\ML\Datasets\Labeled::build([
    [5.1, 3.5, 1.4, 0.2],
    [4.9, 3, 1.4, 0.2]
], ['setosa', 'variosa'])->apply(new NumericStringConverter());

$transformer = new \Rubix\ML\Transformers\TruncatedSVD(2);

$dataset->apply($transformer);

var_dump($dataset);

Output

object(Rubix\ML\Datasets\Labeled)#2 (2) {
  ["labels":protected]=>
  array(2) {
    [0]=>
    string(6) "setosa"
    [1]=>
    string(7) "variosa"
  }
  ["samples":protected]=>
  array(2) {
    [0]=>
    array(2) {
      [0]=>
      float(0)
      [1]=>
      float(0)
    }
    [1]=>
    array(2) {
      [0]=>
      float(0)
      [1]=>
      float(0)
    }
  }
}

As you can see - it's all zeros.

Now, removing the NumericStringConverter:

$dataset = \Rubix\ML\Datasets\Labeled::build([
    [5.1, 3.5, 1.4, 0.2],
    [4.9, 3, 1.4, 0.2]
], ['setosa', 'variosa']);

$transformer = new \Rubix\ML\Transformers\TruncatedSVD(2);

$dataset->apply($transformer);

var_dump($dataset);

Gives following output:

object(Rubix\ML\Datasets\Labeled)#2 (2) {
  ["labels":protected]=>
  array(2) {
    [0]=>
    string(6) "setosa"
    [1]=>
    string(7) "variosa"
  }
  ["samples":protected]=>
  array(2) {
    [0]=>
    array(2) {
      [0]=>
      float(-6.3431263560806)
      [1]=>
      float(-0.1573150685585)
    }
    [1]=>
    array(2) {
      [0]=>
      float(-5.9145190147327)
      [1]=>
      float(0.16871521675666)
    }
  }
}

Now, it took me hours to figure out WTF is happening, because, apparently, nothing spectacular is ... BUT ... BUT! I pinpointed the issue to the following line in NumericStringCoverter:

    protected function convertToNumber(array &$sample) : void
    {
        foreach ($sample as &$value) {
            if (is_string($value)) {
                if (is_numeric($value)) {
                    $value = (int) $value == $value
                        ? (int) $value
                        : (float) $value;

                    continue;
                }

This foreach loop that passes reference to $value is the culprit! By replacing it with:

        foreach ($sample as $key => $value) {
            if (is_string($value)) {
                if (is_numeric($value)) {
                    $sample[$key] = (int) $value == $value
                        ? (int) $value
                        : (float) $value;

                    continue;
                }

all works as expected really!

This leads me to conclusion that for whatever WEIRD reason, something happens internally that messes up the SVD process. Now the problem is that SVD is written as C extension and I honestly have no clue how to debug that :)

My question is -- do you see this as a bug in NumericStringConverter or in C extension? If it's former, I'd be happy to submit a bugfix really!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions