Merge latest changes into master from dev (quantization and embedding model) #90

richardwu · 2019-06-22T19:46:36Z

Latest changes include:

New featurizer EmbeddingFeaturizer: supports mixed value repair. Can be used to replace OccurAttrFeaturizer
Quantization in order to generate better domains in mixed datasets (hc.quantize_numericals)

Dataset. Move correlations to separate module.

also domain generation for tuple embedding model.

active attributes. Refactored domain to run estimator separately from domain generation.

…a as strings.

more sophisticated.

non-linearity to numerical spans.

columns.

TupleEmbedding, added multiple ground truth in evaluation, changed EmbeddingFeat to return probability instead of embedding vectors.

regression TupleEmbedding with nonlinearity.

attributes.

evaluation for sample accuracy.

refactored EmbeddingFeat to support numerical feature (RMSE) from TupleEmbedding.

of init.

attributes we are training on.

thodrek

Minor changes, will continue tomorrow

thodrek · 2019-06-23T01:51:36Z

.travis.yml

@@ -1,6 +1,5 @@
 language: python
 python:
-  - "2.7"


are we sure that this does not break other dependencies?

thodrek · 2019-06-23T01:52:00Z

dataset/__init__.py

 from .dataset import Dataset
 from .dataset import AuxTables
 from .dataset import CellStatus
+from .dataset import Source


what is a source?

thodrek · 2019-06-23T01:53:33Z

dataset/dataset.py

        :param src_col: (str) if not None, for fusion tasks
            specifies the column containing the source for each "mention" of an
            entity.
+        :param exclude_attr_cols:


what are the types of these inputs? format them appropriately.

thodrek · 2019-06-23T01:53:43Z

dataset/dataset.py

            specifies the column containing the source for each "mention" of an
            entity.
+        :param exclude_attr_cols:
+        :param numerical_attrs:


same as above.

thodrek · 2019-06-23T01:54:46Z

dataset/dataset.py

+        :return: the data after quantization in pandas.DataFrame
+        """
+        if self.quantized_data is None:
+            raise Exception('ERROR No dataset quantized')


Fix the message. This is no proper English.

thodrek · 2019-06-23T01:55:59Z

dataset/dataset.py

+        If infer_mode = 'dk', these attributes correspond only to attributes that contain at least
+        one potentially erroneous cell. Otherwise all attributes are returned.
+
+        If applicable, in the provided :param:`train_attrs` variable.


what does this second comment mean?

thodrek · 2019-06-23T01:57:55Z

dataset/quantization.py

+from utils import NULL_REPR
+
+
+def quantize_km(env, df_raw, num_attr_groups_bins):


the name is not informative. Switch to kmeans (it is not long). Also, I would expect to specify "k" here. Do bins refer to clusters? I would change the name to clusters instead of bins to follow common convention.

* Quantization and handling of numerical/mixed data. * Relocated test data into subdirectories. * Move active attributes to right after error detection and inside Dataset. Move correlations to separate module. * Refactor domain generation sort domain by co-occurrence probability and also domain generation for tuple embedding model. * Make co-occurrence featurizer only generate co-occurrence features for active attributes. Refactored domain to run estimator separately from domain generation. * Implemented TupleEmbedding model as an estimator. * Always load clean/ground truth as strings since we load/store raw data as strings. * Added featurizer for learned embeddings from TupleEmbedding model. * Support multiple layers during repair and made TupleEmbedding dump/load more sophisticated. * Improved validation logging and fixed a few bugs. * Improve validation in TupleEmbedding using pandas dataframes. * Suppose multi-dimensional quantization. * Quantize from dict rather than numerical attrs. * Mean/var normalize numerical attributes in context and added non-linearity to numerical spans. * Support specifying n-dimensional numerical attr groups vs splitting on columns. * Fixed None numerical_attr_groups. * Fixed report RMS error and converting to floats for quantization. * Added store_to_fb flag to load_data, added LR schedule to TupleEmbedding, added multiple ground truth in evaluation, changed EmbeddingFeat to return probability instead of embedding vectors. * Pre-split domain and ground truth values. * Fixed batch size argument in EmbeddingFeaturizer. * Removed numerical_attrs reference from Table. * Fix to how multi-ground truth is handled. Use simplified numerical regression TupleEmbedding with nonlinearity. * Max domain size need only be as large as largest for categorical attributes. * Remove domain for numerical attributes in TupleEmbedding. * Fixed some reference issues and added infer all mode. * Fixed _nan_ replacement, max_cat_domain being possibly nan, and evaluation for sample accuracy. * Do not weak label clean cells and fixed raw data in Logistic estimator. * Added ReLU after context for numerical targets in TupleEmbedding and refactored EmbeddingFeat to support numerical feature (RMSE) from TupleEmbedding. * Use cosine annealing with restart LR schedule and use weak_label instead of init. * Fixed memory issues with get_features and predict_pp_batch. * Fixed bug in get_features. * Added comment to EmbeddingFeat. * Finally fixed memory issues with torch.no_grad. * ConstraintFeaturizer runs on un-quantized values. * Do not drop single value cells (for evaluation). * Do not generate queries/feature for DC that does not pertain to attributes we are training on. * Fixed ConstraintFeaturizer to handle no DCs. * Removed deprecated code and added dropout. * Fixed calculation of num_batches in learning loop. * do not drop null inits cells with dom(len) <= 1 * Fixed z-scoring with 0 std and deleting e-notation numerical values. * Do not quantize if bins > unique. * Fixed some things in domain. * Added repair w/ validation set and removed multiple correct values in evaluation. * Fixed domain generation to include single value cells in domain. * Handle untrained context values properly and added code for domain co-occurrence in tupleembedding. * Regression fix for moving raw_data_dict before z-normalization and removed code references to domain_cooccur (for the most part).

richardwu requested a review from thodrek June 22, 2019 19:46

richardwu and others added 29 commits June 22, 2019 12:47

Relocated test data into subdirectories.

7c1a444

Move active attributes to right after error detection and inside

8e1d461

Dataset. Move correlations to separate module.

Remove 2.7 tests from Travis because of Python 3 features.

38db1c7

Refactor domain generation sort domain by co-occurrence probability and

dcd9e0f

also domain generation for tuple embedding model.

Make co-occurrence featurizer only generate co-occurrence features for

c6c1941

active attributes. Refactored domain to run estimator separately from domain generation.

Implemented TupleEmbedding model as an estimator.

83c941c

Always load clean/ground truth as strings since we load/store raw dat…

8929efc

…a as strings.

Added featurizer for learned embeddings from TupleEmbedding model.

1e240a6

Support multiple layers during repair and made TupleEmbedding dump/load

9c37148

more sophisticated.

Quantization and handling of numerical/mixed data.

d5b4065

Improved validation logging and fixed a few bugs.

2c24733

Improve validation in TupleEmbedding using pandas dataframes.

3fbe38f

Suppose multi-dimensional quantization.

22c903a

Quantize from dict rather than numerical attrs.

1c74de7

Mean/var normalize numerical attributes in context and added

4627e38

non-linearity to numerical spans.

Support specifying n-dimensional numerical attr groups vs splitting on

8d019db

columns.

Fixed None numerical_attr_groups.

9a507d8

Fixed report RMS error and converting to floats for quantization.

4bbe318

Added store_to_fb flag to load_data, added LR schedule to

d17442c

TupleEmbedding, added multiple ground truth in evaluation, changed EmbeddingFeat to return probability instead of embedding vectors.

Pre-split domain and ground truth values.

5a0af23

Fixed batch size argument in EmbeddingFeaturizer.

211bab3

Removed numerical_attrs reference from Table.

5c940a5

Fix to how multi-ground truth is handled. Use simplified numerical

e5e01d0

regression TupleEmbedding with nonlinearity.

Max domain size need only be as large as largest for categorical

34adeee

attributes.

Remove domain for numerical attributes in TupleEmbedding.

d7ade7f

Fixed some reference issues and added infer all mode.

d9f453a

Fixed _nan_ replacement, max_cat_domain being possibly nan, and

e1a4b88

evaluation for sample accuracy.

Do not weak label clean cells and fixed raw data in Logistic estimator.

b6ba6a7

Added ReLU after context for numerical targets in TupleEmbedding and

bdd2742

refactored EmbeddingFeat to support numerical feature (RMSE) from TupleEmbedding.

richardwu and others added 17 commits June 22, 2019 12:47

Use cosine annealing with restart LR schedule and use weak_label instead

b726749

of init.

Fixed memory issues with get_features and predict_pp_batch.

6c86d20

Fixed bug in get_features.

04a1653

Added comment to EmbeddingFeat.

5b28dc7

Finally fixed memory issues with torch.no_grad.

1c2216e

ConstraintFeaturizer runs on un-quantized values.

3a52fe6

Do not drop single value cells (for evaluation).

598dd80

Do not generate queries/feature for DC that does not pertain to

6d94842

attributes we are training on.

Fixed ConstraintFeaturizer to handle no DCs.

592c7ec

Removed deprecated code and added dropout.

0c4a3e6

Fixed calculation of num_batches in learning loop.

591fa02

do not drop null inits cells with dom(len) <= 1

bbe68e7

Fixed z-scoring with 0 std and deleting e-notation numerical values.

ba1cc4b

Do not quantize if bins > unique.

2303c31

Fixed some things in domain.

f0805ef

Added notebook for using EmbeddingFeaturizer.

ae336cc

Fixed up notebook for mixed value repair.

93e84e5

richardwu force-pushed the dev branch from fbdd6f2 to 93e84e5 Compare June 22, 2019 19:52

thodrek reviewed Jun 23, 2019

View reviewed changes

minafarid and others added 2 commits September 26, 2019 17:10

Bug fix in embedding feat (#98)

24e881a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge latest changes into master from dev (quantization and embedding model) #90

Merge latest changes into master from dev (quantization and embedding model) #90

Uh oh!

richardwu commented Jun 22, 2019

Uh oh!

thodrek left a comment

Uh oh!

thodrek Jun 23, 2019

Uh oh!

thodrek Jun 23, 2019

Uh oh!

thodrek Jun 23, 2019

Uh oh!

thodrek Jun 23, 2019

Uh oh!

thodrek Jun 23, 2019

Uh oh!

thodrek Jun 23, 2019

Uh oh!

thodrek Jun 23, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		from utils import NULL_REPR


		def quantize_km(env, df_raw, num_attr_groups_bins):

Merge latest changes into master from dev (quantization and embedding model) #90

Are you sure you want to change the base?

Merge latest changes into master from dev (quantization and embedding model) #90

Uh oh!

Conversation

richardwu commented Jun 22, 2019

Uh oh!

thodrek left a comment

Choose a reason for hiding this comment

Uh oh!

thodrek Jun 23, 2019

Choose a reason for hiding this comment

Uh oh!

thodrek Jun 23, 2019

Choose a reason for hiding this comment

Uh oh!

thodrek Jun 23, 2019

Choose a reason for hiding this comment

Uh oh!

thodrek Jun 23, 2019

Choose a reason for hiding this comment

Uh oh!

thodrek Jun 23, 2019

Choose a reason for hiding this comment

Uh oh!

thodrek Jun 23, 2019

Choose a reason for hiding this comment

Uh oh!

thodrek Jun 23, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants