Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove duplicates from pattern results #314

Open
iandunn opened this issue Jul 20, 2021 · 5 comments
Open

Remove duplicates from pattern results #314

iandunn opened this issue Jul 20, 2021 · 5 comments
Labels
[Component] Pattern Directory The backend of the pattern directory: submission, management, etc i18n Translations and internationalization of patterns
Milestone

Comments

@iandunn
Copy link
Member

iandunn commented Jul 20, 2021

When merged, #307 will change the results to include both the site/user locale, and en_US as a fallback. That has a side effect for en_* locales, where e.g., the en_AU and en_US versions of About me cards will both appear.

https://en-au.wordpress.org/patterns/search/about/

We can probably de-dupe via post_parent.

This might intersect w/ #28 because of integrating ElasticSearch.

@iandunn iandunn added [Type] Bug Something isn't working [Component] Pattern Directory The backend of the pattern directory: submission, management, etc labels Jul 20, 2021
@iandunn iandunn added this to the Later milestone Jul 20, 2021
@iandunn
Copy link
Member Author

iandunn commented Jul 20, 2021

For a browsing example, https://api.wordpress.org/patterns/1.0/?pattern-keywords=11&wp-version=5.8&locale=en_GB&per_page=100 shows two Heading patterns: ID 184 for en_US, and 1512 for en_GB.

@iandunn
Copy link
Member Author

iandunn commented Jul 20, 2021

It seems like this should also de-dupe en_US patterns if a translated version exists. i.e., we should remove Three columns of text from https://api.wordpress.org/patterns/1.0/?pattern-keywords=11&wp-version=5.8&locale=es_MX&per_page=100, because it has 'Tres columnas de texto'.

I think doing that would match user expectations, remove clutter, and avoid offending folks who are sensitive to English-centricism.

@dd32 said,

This also has the benefit that searching an English phrase would result in matching an english pattern, and not just returning an empty data set if it's not included in the locale translated patterns.

...but I'm guessing he didn't mean that we should keep en_US patterns if there is a translated version.

@iandunn iandunn self-assigned this Jul 20, 2021
@dd32
Copy link
Member

dd32 commented Jul 21, 2021

@dd32 said,

This also has the benefit that searching an English phrase would result in matching an english pattern, and not just returning an empty data set if it's not included in the locale translated patterns.

...but I'm guessing he didn't mean that we should keep en_US patterns if there is a translated version.

Correct-ish. Part of the limitation of using WordPress search means that we kind of have to keep the en_US variant even if a translated variant exists. Ideally however, search would be search both/all locales and only return the most "relevant" of each pattern. When viewing an archive however, I think it should indeed deduplicate it, although that's not as easy to do with the current schema (as it can't be done at the SQL level I don't think)

As an example of a situation where one might want both returned; A company based in Switzerland is creating a Pricing page the page creator prefers German, they search for Preisgestaltung (Google translate 'pricing' in german) and get a german pricing translation. Next they go to create the French variant of the pricing page, maybe they search for Pricing or maybe they search for Preisgestaltung, but perhaps they search for Prix (Google translate 'pricing' in italian) instead so that they get a french-centric pricing page.

IMHO that suggests that perhaps we should be searching all locales, but archives should definitely be limited to the locale fallback (which presently is just $locale => US English, although maybe Australian English => British English => US English makes more sense; likewise, es_MX => es_* => en_US) and only returning a singular entry.

@dd32
Copy link
Member

dd32 commented Jul 21, 2021

This modified SQL does result in de-duplicated search:

- SELECT SQL_CALC_FOUND_ROWS wp_posts.ID
+ SELECT SQL_CALC_FOUND_ROWS MAX( wp_posts.ID )
FROM wp_posts
INNER JOIN wp_postmeta ON ( wp_posts.id = wp_postmeta.post_id )
WHERE  1 = 1
       AND (( ( wp_posts.post_title LIKE '%about%' ) OR ( wp_posts.post_excerpt LIKE '%about%' ) OR ( wp_posts.post_content LIKE '%about%' ) ))
       AND (( wp_postmeta.meta_key = 'wpop_locale' AND wp_postmeta.meta_value IN ( 'en_AU', 'en_US' ) ))
       AND wp_posts.post_type = 'wporg-pattern'
       AND wp_posts.post_status = 'publish'

- GROUP  BY wp_posts.id
+ GROUP  BY if ( wp_posts.post_parent > 0, wp_posts.post_parent, wp_posts.ID )

ORDER  BY FIELD(wp_postmeta.meta_value, 'en_US', 'en_AU') DESC,
          wp_posts.post_title LIKE '%about%' DESC,
          wp_posts.post_date DESC
LIMIT  0, 10 

Results before/after:

about-me-cards-en_au
- about-me-cards
button-links-with-image-background
large-header-with-text-and-a-button
three-buttons

Didn't quite work the same for a category index, in this case the Australian english header category, but it's not a bad result

- SELECT SQL_CALC_FOUND_ROWS wp_posts.ID
+ SELECT SQL_CALC_FOUND_ROWS MAX( wp_posts.ID )
FROM wp_posts
LEFT JOIN wp_term_relationships ON ( wp_posts.id = wp_term_relationships.object_id )
INNER JOIN wp_postmeta ON ( wp_posts.id = wp_postmeta.post_id )

WHERE  1 = 1
       AND ( wp_term_relationships.term_taxonomy_id IN ( 5 ) )
       AND (( wp_postmeta.meta_key = 'wpop_locale' AND wp_postmeta.meta_value IN ( 'en_AU', 'en_US' ) ))
       AND wp_posts.post_type = 'wporg-pattern'
       AND (( wp_posts.post_status = 'publish' ))

- GROUP  BY wp_posts.ID
+ GROUP  BY if ( wp_posts.post_parent, wp_posts.post_parent, wp_posts.ID)
ORDER  BY Field(wp_postmeta.meta_value, 'en_US', 'en_AU') DESC,
          wp_posts.post_date DESC
LIMIT  0, 100 

Results:

- contact-and-social-links-en_au
- large-header-with-a-heading-en_au
large-background-image-with-title-and-description
cover-with-title-and-audio
section-with-navigation
text-column-over-fixed-background
image-with-angled-overlay-and-text
poster-layout
triptych-with-two-images-and-a-text-area
event
- contact-and-social-links
+ contact-and-social-links-en_au
title-area-with-image-and-heading
header-with-halftone-pattern
header-with-hatched-pattern
header-area-with-geometric-pattern
full-width-image-of-space-with-big-headline
media-text-in-a-full-height-container
media-text-with-image-on-the-right
media-text-with-image-on-the-left
large-header-with-text-and-a-button
large-header-with-left-aligned-text
- large-header-with-a-heading
+ large-header-with-a-heading-en_au
large-header-with-a-heading-and-a-button

Explanation of the SQL change:

  • It's assumed that a translation will always be created after the original, so the translated post_id should be higher than that of the parent. So using MAX( ID ) works to find that.
  • By grouping conditionally, it can group post id 3 (parent: 1) and post_id 1 together, resulting in only post id 3 showing up in the results.

I'm not sure why it's affecting the archive in a different way than the search results, I'm guessing the query change is causing the ORDER BY FIELD() to be ignored, as it's actually internally returning the en_US posts rather than the $locale ones, which explains the need for the MAX() in the first place.

I think the only way to do it "right" in SQL is a join against itself.. I'm not sure how to achieve it in Elastic Search either.. deduplicating in PHP might be a better option, especially as most queries seem to be unlimited in result size right now

@iandunn iandunn removed their assignment Jul 26, 2021
@iandunn
Copy link
Member Author

iandunn commented Jul 26, 2021

I've switched focus to some higher priority things for a bit, so unassigning.

I may come back to this when I get back to #28, but feel free if anyone wants to move it forward in the meantime.

@ryelle ryelle added the i18n Translations and internationalization of patterns label Jan 6, 2022
@ryelle ryelle removed the [Type] Bug Something isn't working label Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Component] Pattern Directory The backend of the pattern directory: submission, management, etc i18n Translations and internationalization of patterns
Projects
None yet
Development

No branches or pull requests

3 participants