-
Notifications
You must be signed in to change notification settings - Fork 131
Merge latest changes into master from dev (quantization and embedding model) #90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Dataset. Move correlations to separate module.
also domain generation for tuple embedding model.
active attributes. Refactored domain to run estimator separately from domain generation.
more sophisticated.
non-linearity to numerical spans.
TupleEmbedding, added multiple ground truth in evaluation, changed EmbeddingFeat to return probability instead of embedding vectors.
regression TupleEmbedding with nonlinearity.
evaluation for sample accuracy.
refactored EmbeddingFeat to support numerical feature (RMSE) from TupleEmbedding.
attributes we are training on.
thodrek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor changes, will continue tomorrow
| @@ -1,6 +1,5 @@ | |||
| language: python | |||
| python: | |||
| - "2.7" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we sure that this does not break other dependencies?
| from .dataset import Dataset | ||
| from .dataset import AuxTables | ||
| from .dataset import CellStatus | ||
| from .dataset import Source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is a source?
| :param src_col: (str) if not None, for fusion tasks | ||
| specifies the column containing the source for each "mention" of an | ||
| entity. | ||
| :param exclude_attr_cols: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are the types of these inputs? format them appropriately.
| specifies the column containing the source for each "mention" of an | ||
| entity. | ||
| :param exclude_attr_cols: | ||
| :param numerical_attrs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above.
| :return: the data after quantization in pandas.DataFrame | ||
| """ | ||
| if self.quantized_data is None: | ||
| raise Exception('ERROR No dataset quantized') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the message. This is no proper English.
| If infer_mode = 'dk', these attributes correspond only to attributes that contain at least | ||
| one potentially erroneous cell. Otherwise all attributes are returned. | ||
| If applicable, in the provided :param:`train_attrs` variable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this second comment mean?
| from utils import NULL_REPR | ||
|
|
||
|
|
||
| def quantize_km(env, df_raw, num_attr_groups_bins): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the name is not informative. Switch to kmeans (it is not long). Also, I would expect to specify "k" here. Do bins refer to clusters? I would change the name to clusters instead of bins to follow common convention.
* Quantization and handling of numerical/mixed data. * Relocated test data into subdirectories. * Move active attributes to right after error detection and inside Dataset. Move correlations to separate module. * Refactor domain generation sort domain by co-occurrence probability and also domain generation for tuple embedding model. * Make co-occurrence featurizer only generate co-occurrence features for active attributes. Refactored domain to run estimator separately from domain generation. * Implemented TupleEmbedding model as an estimator. * Always load clean/ground truth as strings since we load/store raw data as strings. * Added featurizer for learned embeddings from TupleEmbedding model. * Support multiple layers during repair and made TupleEmbedding dump/load more sophisticated. * Improved validation logging and fixed a few bugs. * Improve validation in TupleEmbedding using pandas dataframes. * Suppose multi-dimensional quantization. * Quantize from dict rather than numerical attrs. * Mean/var normalize numerical attributes in context and added non-linearity to numerical spans. * Support specifying n-dimensional numerical attr groups vs splitting on columns. * Fixed None numerical_attr_groups. * Fixed report RMS error and converting to floats for quantization. * Added store_to_fb flag to load_data, added LR schedule to TupleEmbedding, added multiple ground truth in evaluation, changed EmbeddingFeat to return probability instead of embedding vectors. * Pre-split domain and ground truth values. * Fixed batch size argument in EmbeddingFeaturizer. * Removed numerical_attrs reference from Table. * Fix to how multi-ground truth is handled. Use simplified numerical regression TupleEmbedding with nonlinearity. * Max domain size need only be as large as largest for categorical attributes. * Remove domain for numerical attributes in TupleEmbedding. * Fixed some reference issues and added infer all mode. * Fixed _nan_ replacement, max_cat_domain being possibly nan, and evaluation for sample accuracy. * Do not weak label clean cells and fixed raw data in Logistic estimator. * Added ReLU after context for numerical targets in TupleEmbedding and refactored EmbeddingFeat to support numerical feature (RMSE) from TupleEmbedding. * Use cosine annealing with restart LR schedule and use weak_label instead of init. * Fixed memory issues with get_features and predict_pp_batch. * Fixed bug in get_features. * Added comment to EmbeddingFeat. * Finally fixed memory issues with torch.no_grad. * ConstraintFeaturizer runs on un-quantized values. * Do not drop single value cells (for evaluation). * Do not generate queries/feature for DC that does not pertain to attributes we are training on. * Fixed ConstraintFeaturizer to handle no DCs. * Removed deprecated code and added dropout. * Fixed calculation of num_batches in learning loop. * do not drop null inits cells with dom(len) <= 1 * Fixed z-scoring with 0 std and deleting e-notation numerical values. * Do not quantize if bins > unique. * Fixed some things in domain. * Added repair w/ validation set and removed multiple correct values in evaluation. * Fixed domain generation to include single value cells in domain. * Handle untrained context values properly and added code for domain co-occurrence in tupleembedding. * Regression fix for moving raw_data_dict before z-normalization and removed code references to domain_cooccur (for the most part).
Latest changes include:
EmbeddingFeaturizer: supports mixed value repair. Can be used to replaceOccurAttrFeaturizerhc.quantize_numericals)