Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNO-172-upgrade-to-elasticsearch7 #296

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from
Open

SNO-172-upgrade-to-elasticsearch7 #296

wants to merge 4 commits into from

Conversation

Bek
Copy link
Collaborator

@Bek Bek commented Oct 5, 2020

No description provided.

@Bek Bek changed the title ES Upgrade SNO-172 ES Upgrade Oct 5, 2020
@keenangraham keenangraham changed the title SNO-172 ES Upgrade SNO-172-upgrade-to-elasticsearch7 Oct 6, 2020
src/snovault/elasticsearch/create_mapping.py Outdated Show resolved Hide resolved
src/snovault/elasticsearch/create_mapping.py Show resolved Hide resolved
src/snovault/elasticsearch/create_mapping.py Outdated Show resolved Hide resolved
src/snovault/elasticsearch/indexer.py Show resolved Hide resolved
src/snovault/elasticsearch/indexer.py Show resolved Hide resolved
@Bek Bek force-pushed the SNO-172-ES7-upgrade branch from 916faeb to 2b86c6e Compare October 14, 2020 18:25
@Bek Bek force-pushed the SNO-172-ES7-upgrade branch 2 times, most recently from 33d8cba to bdc6355 Compare October 15, 2020 18:08
Copy link
Contributor

@keenangraham keenangraham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks close, just a couple of questions.

Also looks like real test failure where cherry^ is no longer returning results (404). I noticed the same query between demo and prod also returning less cherry^ results (looks like Files specifically)..

base.ini Outdated Show resolved Hide resolved
development.ini Outdated Show resolved Hide resolved
src/snovault/elasticsearch/indexer.py Show resolved Hide resolved
src/snovault/elasticsearch/indexer.py Show resolved Hide resolved
return {
'type': 'object',
'include_in_all': False,
'properties': properties,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this always had properties why the if/else now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is no mapping directive exists in the schema, object should be stored without its fields being analyzed. else clause prevents unnecessary dynamic mapping and update_mapping events in elasticsearch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be working properly for some fields, e.g. cloud_metadata in files. (Might have something to do with it being a calculated_property.)

Screen Shot 2020-10-27 at 12 18 01 PM

Screen Shot 2020-10-27 at 12 18 19 PM

Before and after. Now you can't look up cloud_metadata.url=* for example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, previous assumption was that if the field was important, it would be in the schema. This can be fixed, implication is that any other object inserted that doesn't have a mapping directive in the schema will have all of its field values be analyzed as keyword.

@@ -133,10 +134,6 @@ def schema_mapping(name, schema):
'type': field_type
}

# these fields are unintentially partially matching some small search
# keywords because fields are analyzed by nGram analyzer
if name in NON_SUBSTRING_FIELDS:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No equivalent replacement of copy_to: _all here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

@Bek Bek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks close, just a couple of questions.

Also looks like real test failure where cherry^ is no longer returning results (404). I noticed the same query between demo and prod also returning less cherry^ results (looks like Files specifically)..

we are back to the same assertions in the test and cherry^ is returning 1 result as before in the tests. With the latest change, the File count is equal now.

@@ -133,10 +134,6 @@ def schema_mapping(name, schema):
'type': field_type
}

# these fields are unintentially partially matching some small search
# keywords because fields are analyzed by nGram analyzer
if name in NON_SUBSTRING_FIELDS:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bek Bek force-pushed the SNO-172-ES7-upgrade branch 2 times, most recently from f1e5cae to 8cb4e1a Compare October 26, 2020 08:51
src/snovault/elasticsearch/create_mapping.py Outdated Show resolved Hide resolved
src/snovault/elasticsearch/create_mapping.py Outdated Show resolved Hide resolved
if name in NON_SUBSTRING_FIELDS:
sub_mapping['include_in_all'] = False
if name not in NON_SUBSTRING_FIELDS:
if depth == 1 or (depth == 2 and parent == 'array'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the logic I'm most uncomfortable with, though the results certainly look closer than before. What's the implication of not tracking depth, as it didn't seem like it was tracked before? Is it possible to always copy to _all if name not in NON_SUBSTRING_FIELDS?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Older demos showed that setting "copy_to": "_all" always here is equivalent to boosting every single field in the nested objects (minus the NON_SUBSTRING_FIELDS). Depth tracking stops "copy_to": "_all" after level 1. Before, if file or experiment award description has "brain" and "blood" in the text, the files or experiments show up under either "blood" or "brain" search term. Non-specific experiments that showed up in the results in earlier demos were due to "copy_to" being applied to the fields of deeply nested objects. Older include_in_all was applied by default and depth tracking is applied under the hypothesis that, this default application of the setting was weighed more towards top fields rather than nested object fields. Result is that, now all top fields of the object has copy_to_all, and nested fields need to be boosted like so "file.award.description": 1.0. We could have achieved the same matching results by boosting all of the individual fields of a type also without depth tracking, which is still an option but may take more effort to match results.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like maybe defining in boost would make sense given we already have top-level fields there, e.g.:

"boost_values": {
       "accession": 20.0,
       "@type": 1.0,
       "alternate_accessions": 1.0,
       "assay_term_name": 20.0,
       "assay_term_id": 1.0,
       "assay_title": 20.0,
       "assay_slims": 5.0,
       "dbxrefs": 1.0,
       "aliases": 1.0,
       "biosample_ontology.term_id": 1.0,
       "biosample_ontology.term_name": 10.0,

this default application of the setting was weighed more towards top fields rather than nested object fields.

I don't think this explains the case of award.description with blood or brain. Even if the weight was low on these they would still show up in results, though at the bottom. However we're not seeing them at all on current prod.

My understanding is that if _all is enabled on old ES then include_in_all is True by default. However you can set it to include_in_all: False for a certain field and it will apply to all subfields unless otherwise specified. I think looking at new and old mappings informative here. For file:

Screen Shot 2020-10-27 at 11 26 38 AM

Screen Shot 2020-10-27 at 11 27 55 AM

I'm a little concerned since the default has changed from don't include unless specifically set to always include if in first level (probably explains why we are seeing more hits on new demo?). The boost offers more specific control, while always including top-level fields doesn't.

In any case might be close enough for our current purposes, and could be tweaked later. But I think worth looking more into.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that boost offers more specific control. I find that it's not feasible to determine what to boost for the number of types we have multiplied by the number of top fields for each type, make demo for each combination, and compare results with the old ES. ctcf search bringing up File type results was due to unintentional boost on the aliases field. Determining which other objects and fields were important to match the existing result count in the old ES maybe challenging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's likely the difference is just due to not copying the links and unique_keys stuff to _all, which is by default include_in_all: True on old ES:
Screen Shot 2020-10-27 at 4 02 44 PM

Pretty sure it was the alias in unique keys that was matching ctcf, not the the alias in embedded properties. (Can confirm that alias in embedded properties is False for old mapping.)

It doesn't look like any of the embedded properties in old mapping are include_in_all: True unless they are defined in boost values for that schema. Could we go back to only copying embedded properties to _all if they are defined in boost, and just make the dynamic template links/unique keys have copy_to?

ctcf search bringing up File type results was due to unintentional boost on the aliases field

Yeah, looks like only field ctcf is mentioned in file is alias and submitted_file name, but it's probably fine to have these intentionally searchable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keenangraham keenangraham requested a review from pedrohr October 26, 2020 23:52
@Bek Bek force-pushed the SNO-172-ES7-upgrade branch from 8cb4e1a to 157a13e Compare October 27, 2020 16:03
return {
'type': 'object',
'include_in_all': False,
'properties': properties,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be working properly for some fields, e.g. cloud_metadata in files. (Might have something to do with it being a calculated_property.)

Screen Shot 2020-10-27 at 12 18 01 PM

Screen Shot 2020-10-27 at 12 18 19 PM

Before and after. Now you can't look up cloud_metadata.url=* for example.

@keenangraham
Copy link
Contributor

keenangraham commented Oct 27, 2020

One other thing. I suppose we should take a look at the dynamic_template definitions one more time. Looking at the original mapping definition I guess these are included in _all by default. I don't think being able to search principals_allowed is that useful but maybe links and unique_keys would be.

'dynamic_templates': [
            {
                'template_principals_allowed': {
                    'path_match': "principals_allowed.*",
                    'match_mapping_type': "string",
                    'mapping': {
                        'type': 'keyword',
                    },
                },
            },
            {
                'template_unique_keys': {
                    'path_match': "unique_keys.*",
                    'match_mapping_type': "string",
                    'mapping': {
                        'type': 'keyword',
                    },
                },
            },
            {
                'template_links': {
                    'path_match': "links.*",
                    'match_mapping_type': "string",
                    'mapping': {
                        'type': 'keyword',
                    },
                },
....

It does seem like group.admin and system.Everyone give search results on prod, which seems a bit silly right?

Screen Shot 2020-10-27 at 2 00 25 PM

Possibly this also explains the file type search differences before you added the depth tracking?

@keenangraham
Copy link
Contributor

It looks like links and unique_keys are also by default included in all (though not linked_uuids). Could also explain result differences before adding all top-level fields.

Screen Shot 2020-10-27 at 2 10 47 PM

Screen Shot 2020-10-27 at 2 10 42 PM

@Bek
Copy link
Collaborator Author

Bek commented Oct 27, 2020

One other thing. I suppose we should take a look at the dynamic_template definitions one more time. Looking at the original mapping definition I guess these are included in _all by default. I don't think being able to search principals_allowed is that useful but maybe links and unique_keys would be.

'dynamic_templates': [
            {
                'template_principals_allowed': {
                    'path_match': "principals_allowed.*",
                    'match_mapping_type': "string",
                    'mapping': {
                        'type': 'keyword',
                    },
                },
            },
            {
                'template_unique_keys': {
                    'path_match': "unique_keys.*",
                    'match_mapping_type': "string",
                    'mapping': {
                        'type': 'keyword',
                    },
                },
            },
            {
                'template_links': {
                    'path_match': "links.*",
                    'match_mapping_type': "string",
                    'mapping': {
                        'type': 'keyword',
                    },
                },
....

It does seem like group.admin and system.Everyone give search results on prod, which seems a bit silly right?

Screen Shot 2020-10-27 at 2 00 25 PM

Possibly this also explains the file type search differences before you added the depth tracking?

Right, some an existing test rely on searching user uuid, but we don't boost user uuid, so it was getting a hit based on "principals_allowed.edit": ["group.admin", "userid.81a6cc12-2847-4e2e-8f2c-f566699eb29e"] because it contained user uuid...

@Bek
Copy link
Collaborator Author

Bek commented Oct 27, 2020

One other thing. I suppose we should take a look at the dynamic_template definitions one more time. Looking at the original mapping definition I guess these are included in _all by default. I don't think being able to search principals_allowed is that useful but maybe links and unique_keys would be.

'dynamic_templates': [
            {
                'template_principals_allowed': {
                    'path_match': "principals_allowed.*",
                    'match_mapping_type': "string",
                    'mapping': {
                        'type': 'keyword',
                    },
                },
            },
            {
                'template_unique_keys': {
                    'path_match': "unique_keys.*",
                    'match_mapping_type': "string",
                    'mapping': {
                        'type': 'keyword',
                    },
                },
            },
            {
                'template_links': {
                    'path_match': "links.*",
                    'match_mapping_type': "string",
                    'mapping': {
                        'type': 'keyword',
                    },
                },
....

It does seem like group.admin and system.Everyone give search results on prod, which seems a bit silly right?

Screen Shot 2020-10-27 at 2 00 25 PM

Possibly this also explains the file type search differences before you added the depth tracking?

File type searches not showing up was due to aliases field not being copied to _all. This brings up a worrisome implication for search, which is that if the file name didn't have ctcf in the string of it's name, then it wouldn't be included in the results even if the file was part of the experiment that targets CTCF.

@keenangraham
Copy link
Contributor

keenangraham commented Oct 27, 2020

This brings up a worrisome implication for search, which is that if the file name didn't have ctcf in the string of it's name, then it wouldn't be included in the results even if the file was part of the experiment that targets CTCF.

Think this is fine. File for a long time didn't have any access to experiment biosample or assay information. Now that it does we should probably explicitly add these to boost values.

(In other words it's not that we expect ctcf to be a great search query for files, just that the conspicuous absence of files from new results was worth investigating.)

@Bek Bek force-pushed the SNO-172-ES7-upgrade branch from 00dc931 to 0c22a37 Compare November 11, 2020 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants