-
Notifications
You must be signed in to change notification settings - Fork 131
Added EM iterations to repair process, allow multiple init values, and select best init value as current via co-occurrence probability #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
|
How on earth is F1 and Recall > 1.0? See repairing F1 and repairing recall. |
|
Pushed up a patch to update single/co-occur stats after each EM iteration for I attempted to do more iterations but there is an issue with how we use |
|
Sounds good. |
32ae5ef to
fb01e01
Compare
(singular value, old 'init_value').
iterations for repair.
multiple init values by specifying init values in raw data separated by |||.
fb01e01 to
39088f4
Compare
|
Newest results with this patch with fix to Latest changes:
Ready for another review 👀 |
This PR introduces EM iterations to the repair process where after every iteration as well as supporting multiple init values:
current_valueand renamed from e.g.InitFeaturizertoCurrentFeaturizercurrent_values incell_domainwith inferred values frominf_vals_domcurrent_values (featurizers such asCurrentAttrFeaturizerorCurrentXFeaturizer) can take advantage of the updated current valuesInitSimFeaturizerwhere it wasn't computing the similarity metrics correctly between theinit_valueand values in the domainNULLvalues inNullDetectorcurrent_valueis initialized with the value frominit_valueswith the highest sum of co-occurrence probabilities with the otherinit_valuesin the tupleI've tested this with 3 iterations with the hospital dataset. On the second iteration we see an improvement in recall (with a slight hit to precision) due to the increased number of repairs made. It seems to converge after the 2nd iteration.
NB: this PR does not currently include the detection process in the EM iterations: this might be worth considering.