Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit user fields which are indexed by default #408

Open
4 tasks
kadamwhite opened this issue May 10, 2022 · 7 comments
Open
4 tasks

Limit user fields which are indexed by default #408

kadamwhite opened this issue May 10, 2022 · 7 comments
Labels
should have Should be done, medium priority for now

Comments

@kadamwhite
Copy link

kadamwhite commented May 10, 2022

On a large multisite network, we are encountering many errors of this type when indexing users:

{
    "index": {
        "_index": "ep-sitename-user",
        "_type": "_doc",
        "_id": "338",
        "status": 400,
        "error": {
            "type": "illegal_argument_exception",
            "reason": "Limit of total fields [5000] has been exceeded"
        }
    }
}

We have manually excluded some fields using the ep_prepare_user_meta_data hook, but are still running into this error. This ticket proposes further limiting the fields which get indexed by default, so that it is less likely that user indexing will fail on a large network.

Specifically, I wonder whether these fields need to be indexed:

  • hm.workflows.*
  • metaboxhidden_*
  • meta-box-order_*
  • wp_##_capabilities
  • wp_##_user_level
  • wp_##_user_settings
  • wp_##_user-settings-time
  • wp_##_media_library_mode
  • wp_##_yoast_notifications
  • wp_##_dashboard_quick_press_last_post_id
  • Actually, any wp_##_... field
  • Possibly more...?

@rmccue suggested the roles and capabilities fields may be needed for user queries, although we weren't sure off the top whether Elasticpress involves itself in the normal queries Core does. But these fields represent a significant percentage of the meta stored against a user, and as a layperson I struggle to imagine how I'd want to search for somebody based on these values.

Acceptance crtieria:

  • Check field mapping for the user index on 3 of the larger client sites and note down the most common patterns that make up the bulk of the keys - use wp elasticpress get-mapping to find this
  • Ensure we are excluding specific meta key patterns from being indexed, at a least:
    • hm.workflows.*
    • any meta keys that include the user ID
    • any key patterns found from step 1 that account for more unique key names than the number of sites on a multisite network
  • Document how to filter user meta keys from being indexed
  • Document the error message noted above under an FAQs or troubleshooting section (add this if it doesn't exist) with the possible mitigations e.g. using filters to exclude certain meta keys and how to find common meta key patterns
@roborourke
Copy link
Contributor

hm.workflows.* definitely doesn't need to be and I suspect that's responsible for the majority of entries. I would ideally look at the complete list of fields in the mapping to get an idea of whether the others really count as heavily towards the total field count.

ElasticPress can be used for regular queries but in Altis it only kicks in when performing a search by default. Some projects might be using the ep_integrate query var to improve performance of non search related queries however.

@roborourke
Copy link
Contributor

I've updated the acceptance criteria with what I think will give us a good baseline and also start some docs top make these common errors more searchable.

@veselala veselala added the should have Should be done, medium priority for now label May 11, 2022
@goldenapples
Copy link
Contributor

In projects using the Roles to Taxonomy plugin, the capabilities meta fields can probably be ignored by default. In other cases, I think we should be able to specify the index type of that field, so that we're only indexing it once instead of multiple times for each of the different indexable types.

@roborourke
Copy link
Contributor

Would be good to know if that plugin plays nicely with capability queries actually 🤔

@veselala
Copy link

Hey team! Please add your planning poker estimate with ZenHub @robindevitt @shadyvb

@veselala
Copy link

@roborourke we've estimate this one to 5 SP and moved it to the backlog

@kadamwhite
Copy link
Author

@ivankristianto notes in the issue which spawned this suggestion, that the default fields ES does use for indexing are very limited:

$prepared_search_fields = [
	'user_login',
	'user_nicename',
	'display_name',
	'user_url',
	'user_email',
	'meta.first_name',
	'meta.last_name',
	'meta.nickname',
];

He elaborates there,

unless you add more search fields, all other meta in the ES index is useless.

I would recommend to change to treat meta with inclusion method. Only index the fields that you want to search against. You can use this filter ep_prepare_user_meta_excluded_public_keys, get all the user meta keys, filter by the meta you want to use, and then exclude everything else.

This would be more concise overall than maintaining a long list of patterns to test against in order to omit data, and runs less risk of throwing away something that an Altis site does end up wanting to search on.

One final thought: Fields get interpreted into many different types of mapping in ES. I wonder if it is necessary to have long and double representations of a field which is not numeric? or whether we should throw away date-time mapping values if the date resolves to 1970-01-01 00:00:01, as it would with a non-date value? ES computes these because it's not sure what you might expect it to be, but it seems like it contributes to the ballooning number of mappings without providing any benefit. I may misunderstand and these additional mapping types might be necessary to keep, but if they aren't, they could be removed with the ep_user_sync_args filter by looping through the $user_meta['meta'] array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
should have Should be done, medium priority for now
Projects
None yet
Development

No branches or pull requests

4 participants