Skip to content

Conversation

@kazuhidelee
Copy link

@kazuhidelee kazuhidelee commented Apr 7, 2025

Fixes #463

Description: This PR introduces some potential enhancement to the validation and handling of the <title> tag in HTML front articles. It ensures that the <title> tag is neither empty nor composed entirely of numeric characters/special characters, which could potentially cause issues with the Kiwix suggestion system. If the <title> tag is a invalid string, a default title is generated to ensure robustness in the system.

Changes: Added the isValidTitle function to validate that the <title> tag:

  1. Has a minimum length.

  2. Is not purely numeric.

  3. Contains at least one alphanumeric character.

  4. Updated the parseAndAdaptHtml method to handle missing or invalid <title> tags by generating a default title.

  5. Added warning logs when an invalid title is detected to make debug easier.

  6. Added unit testing to test the functionality and correctness of the isValidTitle function.

This enhancement improves the reliability of the Kiwix suggestion system and ensures that all HTML front articles will have a valid title tag to prevent any unexpected behavior.

Possible Enhancements:
The length and criteria for a valid title can be changed to make it fit better to the Kiwix system.
Explore additional validation rules for the <title> tag to account for other edge cases.

@kazuhidelee kazuhidelee marked this pull request as draft April 7, 2025 04:10
@kazuhidelee kazuhidelee changed the title Add:isValidTitle function and unit testing Add: Possible Enhancement: Ensure non-empty <title> tag in HTML front articles Apr 7, 2025
@kazuhidelee kazuhidelee marked this pull request as ready for review April 7, 2025 04:17
@kazuhidelee kazuhidelee changed the title Add: Possible Enhancement: Ensure non-empty <title> tag in HTML front articles Possible Enhancement: Ensure non-empty <title> tags in HTML front articles contain valid strings Apr 7, 2025
@kelson42
Copy link
Contributor

kelson42 commented Apr 7, 2025

@kazuhidelee Please open the issue corresponding to this PR. There is no reason pure alphanumeric titles shoukd not work.

@kazuhidelee
Copy link
Author

I've opened a new issue and modified to code to allow pure alphanumeric titles
Link to the newly opened issue

@kelson42
Copy link
Contributor

kelson42 commented Apr 8, 2025

Please commit only the necessary code to fix the issue. Don't change the code indentation.

@kazuhidelee kazuhidelee force-pushed the add-validatetitle-tests branch from a27cc29 to b22f478 Compare April 8, 2025 22:21
@kazuhidelee
Copy link
Author

  • Rebased the branch to revert back the auto-styling changes in the indentation.
  • Removed unnecessary .vscode/settings.json changes that were unrelated to the PR's purpose.

@kelson42
Copy link
Contributor

kelson42 commented Apr 9, 2025

@kazuhidelee Who says that points (1) and (2) lead to invalid titles? Why?

@kelson42 kelson42 self-requested a review April 9, 2025 05:35
@kazuhidelee
Copy link
Author

I initially made the decision to treat short or purely numeric titles as invalid based on two things:

  • Based on personal experiences, short or purely numeric titles often came from malformed or autogenerated HTML pages especially in scraped content, and they didn’t work well as an identifier of the page.
  • Also if title was just 1 or 12345, I thought that the title might lacked descriptiveness so I chose to mark it as not valid and convert it to a default title.
  • I've removed the check for purely numeric based on the previous feedback to allow more flexibility.

With that being said, I’d love your input here, and I’m happy to adjust accordingly!

@kelson42
Copy link
Contributor

kelson42 commented Apr 10, 2025

I anything is not working well with a short title, then a dedicated issue should be open because this sounds like a bug.

The only invlid scenario to me is the emoty title and I would like to know:

  • How it behqves today?
  • How it should behqve to your opinion?

@kazuhidelee
Copy link
Author

  • Current behavior: If the <title> tag is missing or empty, the code generates a fallback title from the filename using the last part of the URL without the extension.

  • After reviewing the current behavior, I realized that the only strictly invalid case is when the <title> tag is completely empty. The fallback logic always generates a non-empty string—for example, _____.html becomes " " after replacing underscores with spaces. While this isn’t technically empty, it may not be meaningful for users or for the suggestion system.
    Based on this, I added extra validation checks to catch these edge cases and ensure the program can produce a more useful title when needed.

@kelson42
Copy link
Contributor

Then I don't understand why current behaviour (without your patch) is not correct!? Can you please be very precise, list cases with DOM title and filename and resukted ZIM title?

@kelson42
Copy link
Contributor

@kazuhidelee ?

@kelson42 kelson42 force-pushed the add-validatetitle-tests branch from b22f478 to 363e9f3 Compare November 5, 2025 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Possible enhancement for handling missing or invalid <title> tags in HTML front articles

3 participants