-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization for text column profile ksneab #791
Optimization for text column profile ksneab #791
Conversation
@@ -144,7 +144,7 @@ def _update_vocab( | |||
:type subset_properties: dict | |||
:return: None | |||
""" | |||
data_flat = list(itertools.chain(*data)) | |||
data_flat = set(itertools.chain(*data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actual change to code that is being investigated with the description
class NpEncoder(json.JSONEncoder): | ||
def default(self, obj): | ||
if isinstance(obj, np.integer): | ||
return int(obj) | ||
if isinstance(obj, np.floating): | ||
return float(obj) | ||
if isinstance(obj, np.ndarray): | ||
return obj.tolist() | ||
return super().default(obj) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this???
@@ -236,7 +236,8 @@ def dp_space_time_analysis( | |||
if not os.path.exists(os.path.dirname(path)): | |||
os.makedirs(os.path.dirname(path)) | |||
with open(path, "w") as fp: | |||
json.dump(profile_times, fp, indent=4) | |||
json.dump(profile_times, fp, indent=4, cls=NpEncoder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe call this NumpyEncoder
?
7e794c2
to
e451b9f
Compare
The tests ran to generate these numbers can be found in the
dataprofiler/tests/space_time_analysis/structured_space_time_analysis.py
scriptIn repo dataset:
Generated Dataset
Findings:
This change is effective when dealing with large amounts of text and string columns. In the "In repo" dataset there are not
many examples of these data entry classes and as a result it does not benefit as much from the change as the generated dataset does.
(generated dataset is of all 7 data classes mentioned in the
dataprofiler/tests/space_time_analysis/throughput-test-guidelines.md
)