Skip to content

Conversation

@richardwu
Copy link
Collaborator

Latest changes include:

  • New featurizer EmbeddingFeaturizer: supports mixed value repair. Can be used to replace OccurAttrFeaturizer
  • Quantization in order to generate better domains in mixed datasets (hc.quantize_numericals)

@richardwu richardwu requested a review from thodrek June 22, 2019 19:46
richardwu and others added 29 commits June 22, 2019 12:47
Dataset. Move correlations to separate module.
also domain generation for tuple embedding model.
active attributes. Refactored domain to run estimator separately from
domain generation.
TupleEmbedding, added multiple ground truth in evaluation, changed
EmbeddingFeat to return probability instead of embedding vectors.
regression TupleEmbedding with nonlinearity.
refactored EmbeddingFeat to support numerical feature (RMSE) from
TupleEmbedding.
Copy link

@thodrek thodrek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor changes, will continue tomorrow

@@ -1,6 +1,5 @@
language: python
python:
- "2.7"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure that this does not break other dependencies?

from .dataset import Dataset
from .dataset import AuxTables
from .dataset import CellStatus
from .dataset import Source
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is a source?

:param src_col: (str) if not None, for fusion tasks
specifies the column containing the source for each "mention" of an
entity.
:param exclude_attr_cols:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the types of these inputs? format them appropriately.

specifies the column containing the source for each "mention" of an
entity.
:param exclude_attr_cols:
:param numerical_attrs:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above.

:return: the data after quantization in pandas.DataFrame
"""
if self.quantized_data is None:
raise Exception('ERROR No dataset quantized')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the message. This is no proper English.

If infer_mode = 'dk', these attributes correspond only to attributes that contain at least
one potentially erroneous cell. Otherwise all attributes are returned.
If applicable, in the provided :param:`train_attrs` variable.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this second comment mean?

from utils import NULL_REPR


def quantize_km(env, df_raw, num_attr_groups_bins):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name is not informative. Switch to kmeans (it is not long). Also, I would expect to specify "k" here. Do bins refer to clusters? I would change the name to clusters instead of bins to follow common convention.

minafarid and others added 2 commits September 26, 2019 17:10
* Quantization and handling of numerical/mixed data.
* Relocated test data into subdirectories.
* Move active attributes to right after error detection and inside Dataset. Move correlations to separate module.
* Refactor domain generation sort domain by co-occurrence probability and
also domain generation for tuple embedding model.
* Make co-occurrence featurizer only generate co-occurrence features for
active attributes. Refactored domain to run estimator separately from
domain generation.
* Implemented TupleEmbedding model as an estimator.
* Always load clean/ground truth as strings since we load/store raw data as strings.
* Added featurizer for learned embeddings from TupleEmbedding model.
* Support multiple layers during repair and made TupleEmbedding dump/load more sophisticated.
* Improved validation logging and fixed a few bugs.
* Improve validation in TupleEmbedding using pandas dataframes.
* Suppose multi-dimensional quantization.
* Quantize from dict rather than numerical attrs.
* Mean/var normalize numerical attributes in context and added non-linearity to numerical spans.
* Support specifying n-dimensional numerical attr groups vs splitting on columns.
* Fixed None numerical_attr_groups.
* Fixed report RMS error and converting to floats for quantization.
* Added store_to_fb flag to load_data, added LR schedule to TupleEmbedding, added multiple ground truth in evaluation, changed EmbeddingFeat to return probability instead of embedding vectors.
* Pre-split domain and ground truth values.
* Fixed batch size argument in EmbeddingFeaturizer.
* Removed numerical_attrs reference from Table.
* Fix to how multi-ground truth is handled. Use simplified numerical regression TupleEmbedding with nonlinearity.
* Max domain size need only be as large as largest for categorical attributes.
* Remove domain for numerical attributes in TupleEmbedding.
* Fixed some reference issues and added infer all mode.
* Fixed _nan_ replacement, max_cat_domain being possibly nan, and evaluation for sample accuracy.
* Do not weak label clean cells and fixed raw data in Logistic estimator.
* Added ReLU after context for numerical targets in TupleEmbedding and refactored EmbeddingFeat to support numerical feature (RMSE) from TupleEmbedding.
* Use cosine annealing with restart LR schedule and use weak_label instead of init.
* Fixed memory issues with get_features and predict_pp_batch.
* Fixed bug in get_features.
* Added comment to EmbeddingFeat.
* Finally fixed memory issues with torch.no_grad.
* ConstraintFeaturizer runs on un-quantized values.
* Do not drop single value cells (for evaluation).
* Do not generate queries/feature for DC that does not pertain to attributes we are training on.
* Fixed ConstraintFeaturizer to handle no DCs.
* Removed deprecated code and added dropout.
* Fixed calculation of num_batches in learning loop.
* do not drop null inits cells with dom(len) <= 1
* Fixed z-scoring with 0 std and deleting e-notation numerical values.
* Do not quantize if bins > unique.
* Fixed some things in domain.
* Added repair w/ validation set and removed multiple correct values in evaluation.
* Fixed domain generation to include single value cells in domain.
* Handle untrained context values properly and added code for domain co-occurrence in tupleembedding.
* Regression fix for moving raw_data_dict before z-normalization and removed code references to domain_cooccur (for the most part).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants