Skip to content
This repository has been archived by the owner on Aug 26, 2023. It is now read-only.

RFC: How to prepare for a future merge of nokogumbo into the nokogiri gem #170

Closed
flavorjones opened this issue Mar 12, 2021 · 6 comments
Closed

Comments

@flavorjones
Copy link
Collaborator

flavorjones commented Mar 12, 2021

As discussed elsewhere, Nokogumbo and Nokogiri will be merging. The net result of this will be a single gem that provides both Nokogiri::HTML (nokogiri 1.11 and earlier) and Nokogiri::HTML5 (nokogumbo) in the same gem.

My current plan is simply to lift-and-shift the Nokogumbo code into the Nokogiri repository and gem in Nokogiri v1.12. Unless we do something, this will mean that users of Nokogumbo will see warning messages like this when they upgrade Nokogiri:

/home/flavorjones/code/oss/nokogumbo/lib/nokogumbo/html5.rb:7: warning: method redefined; discarding old HTML5
/home/flavorjones/code/oss/nokogiri/lib/nokogiri/html.rb:42: warning: previous definition of HTML5 was here
/home/flavorjones/code/oss/nokogumbo/lib/nokogumbo/html5.rb:13: warning: already initialized constant Nokogiri::HTML5::HTML_NAMESPACE
/home/flavorjones/code/oss/nokogiri/lib/nokogiri/html.rb:48: warning: previous definition of HTML_NAMESPACE was here

These warnings aren't just annoying; they tell us that Nokogumbo's implementation is clobbering the Nokogiri implementation, meaning that until the issue is resolved, updates to Nokogiri's code won't have any effect.

@flavorjones
Copy link
Collaborator Author

One idea is to have the next release of Nokogumbo check to see if Nokogiri has already defined these modules, methods, and constants, and to avoid loading itself. We could do this with a Nokogiri version check (< 1.12) or by checking defined?(Nokogiri::HTML5) (or some other module or constant).

Another idea is for the next release of Nokogumbo require nokogiri < 1.12, and the following release of Nokogumbo be an empty gem that depends on nokogiri >= 1.12.

I'd love feedback on these ideas, and to hear other folks' ideas.

@stevecheckoway
Copy link
Collaborator

I think having an empty gem as a final nokogumbo release is a great idea along with some release notes that it's a purely transitional release.

For the penultimate release, I don't have an informed opinion on whether it is better to check if things are defined or do a version check. In many cases, functionality checks are better than version checks (I'm thinking of things like feature tests in compilers), but I'm not sure that applies here. Requiring nokogiri < 1.12 might be sufficient.

Does Nokogiri follow semantic versioning? It might be best for the combined Nokogiri to have version 2. The situation I'm worried about is existing versions of Nokogumbo being used with Nokogiri that already defines the symbols. I believe that current Nokogumbo requires Nokogiri to be less than 2.0

  s.add_runtime_dependency 'nokogiri', '~> 1.8', '>= 1.8.4'

thus this situation would be avoided.

Separately from this, have you thought about importing the tests? I wrote a lot of tests for gumbo, but there are also a lot of ruby tests, including tests for the html tree-construction. I had a bunch of pull requests for those tests fixing the number of errors that should be emitted, but they never got merged. (See this in particular.) So I've just been using my own fork of the tests which isn't ideal.

@rubys
Copy link
Owner

rubys commented Mar 12, 2021

defined? seems cleanest to me. An empty gem would pose problems to people that are still using what would then be a backlevel of nokogiri.

@flavorjones
Copy link
Collaborator Author

Thanks for the thoughtful responses. @stevecheckoway I agree that I like having a (basically) empty gem at the end of Nokogumbo's lifecycle. @rubys I also agree that using defined? is the cleanest option. I think we can do both by using both defined? and by setting gem requirements.

For the purposes of this analysis, let's assume

  • Nokogiri current is v1.11 or earlier
  • Nokogiri next includes Nokogumbo's Nokogiri::HTML5 module
  • Nokogumbo current is v2.0.4 or earlier
  • Nokogumbo next is a potential intermediate migration release
  • Nokogumbo final is the final release after having been merged into Nokogiri
  • ✔ or ✖ indicates whether the combination is allowed or disallowed by dependency requirements

If we use the defined? strategy starting in Nokogumbo next, there's no need for a two-stage rollout, but the long-term state of the gem will contain all the current code:

Nokogiri Current Nokogiri Next
Nokogumbo current ✔ Uses Nokogumbo code ✔ Uses Nokogiri code
Emits "redefined" warnings
Nokogumbo next ✔ Uses Nokogumbo code ✔ Uses Nokogiri code
Prompts to remove Nokogumbo

If we use the "requirements constraints" strategy starting in Nokogumbo next:

Nokogiri Current Nokogiri Next
Nokogumbo current ✔ Uses Nokogumbo code ✔ Uses Nokogiri code
Emits "redefined" warnings
Nokogumbo next ✔ Uses Nokogumbo code
Nokogumbo final ✔ Uses Nokogiri code
Prompts to remove Nokogumbo

If we use both strategies, though, we could still end up in the desired final state regardless of the upgrade path followed by users. Imagine "next" using the defined? check, and then a "final" that sets an dependency constraint:

Nokogiri Current Nokogiri Next
Nokogumbo current ✔ Uses Nokogumbo code ✔ Uses Nokogiri code
Emits "redefined" warnings
Nokogumbo next ✔ Uses Nokogumbo code ✔ Uses Nokogiri code
Prompts to remove Nokogumbo
Nokogumbo final ✔ Uses Nokogiri code
Prompts to remove Nokogumbo

I think I like this last option best, because it gives us flexibility: the change between Nokogumbo "current" and "next" is small, and we can roll out the final release at our leisure. WDYT?


have you thought about importing the tests?

I don't have a strong opinion, but if we're depending on the tests and upstream isn't responsive, then we may as well keep them in-repo. If upstream merges and/or becomes more active, we can always periodically update the local files from upstream. WDYT?

@rubys
Copy link
Owner

rubys commented Mar 13, 2021

I may need an ELI5 :-)

Here is an example application that currently uses nokogumbo: https://github.com/apache/whimsy/blob/5273b3067f939a98f7e40805e5fcf1b7b5c6e020/www/board/agenda/Gemfile#L20

If Nokogumbo were changed to add an if defined? now, it would continue to work as Nokogiri current doesn't define Nokogiri::HTML5

Once Nokogiri ships Nokogiri::HTML5, then things would continue to work with no changes and no prompting.

At some future date, perhaps a warning could be added to Nokogumbo indicating that this gem is now unnecessary and obsolete, recommending that Gemfiles be changed to reference nokogiri instead.

@flavorjones
Copy link
Collaborator Author

@rubys Sorry for using more words than necessary, writing helps me think but has the unfortunate side effect of being a wall of words. We're in complete agreement, that's exactly what I'm suggesting.

flavorjones added a commit that referenced this issue Mar 14, 2021
Closes #170

A future version of Nokogiri will provide Nokogumbo's API (see
sparklemotion/nokogiri#2204). This change
will allow Nokogumbo to detect whether Nokogiri provides the HTML5 API
and become a "shim" -- gracefully defer to Nokogiri by refusing to
load itself.

Some contractual assumptions I'm making about Nokogiri:

- Nokogiri will faithfully reproduce the `::Nokogiri::HTML5` singleton
  method, module, and namespace (including classes
  `Nokogiri::HTML5::Node`, `Nokogiri::HTML5::Document`, and
  `Nokogiri::HTML5::DocumentFragment`)

- Nokogiri will not provide a `::Nokogumbo` module/namespace, but will
  provide a similar `::Nokogiri::Gumbo` module which will provide the
  same constants and singleton methods as `::Nokogumbo`:

  - `Nokogumbo.parse()` will be provided as `Nokogiri::Gumbo.parse()`
  - `Nokogumbo.fragment()` → `Nokogiri::Gumbo.fragment()`
  - `Nokogumbo::DEFAULT_MAX_ATTRIBUTES` → `Nokogiri::Gumbo::DEFAULT_MAX_ATTRIBUTES`
  - `Nokogumbo::DEFAULT_MAX_ERRORS` → `Nokogiri::Gumbo::DEFAULT_MAX_ERRORS`
  - `Nokogumbo::DEFAULT_MAX_TREE_DEPTH` → `Nokogiri::Gumbo::DEFAULT_MAX_TREE_DEPTH`

This change checks for the existence of `Nokogiri::HTML5`,
`Nokogiri::Gumbo`, and an expected singleton method on each. We could
do a more- or less-thorough check here.

This change also provides an "escape hatch" using an environment
variable `NOKOGUMBO_IGNORE_NOKOGIRI_HTML5` which can be set to avoid
the "shim" behavior. This escape hatch might be unnecessary, but this
change is invasive enough to make me want to be cautious.

In "shim" mode, `Nokogumbo.parse()` and `.fragment()` will be
forwarded to the Nokogiri implementation. The `Nokogumbo::DEFAULT*`
constants will always be defined, but when in "shim" mode will be set
to the `Nokogiri`-provided values.

Nokogumbo will emit a single warning message at `require`-time when it
is in "shim" mode. This message points users to
sparklemotion/nokogiri#2205 which will
explain what's going on and help people migrate their
applications (but is an empty placeholder right now).

I did not include deprecation warning messages in `Nokogumbo.parse`
and `.fragment`. If you feel strongly that we should, let me know.
flavorjones added a commit that referenced this issue Mar 15, 2021
Closes #170

A future version of Nokogiri will provide Nokogumbo's API (see
sparklemotion/nokogiri#2204). This change
will allow Nokogumbo to detect whether Nokogiri provides the HTML5 API
and whether to use Nokogiri's implementation or Nokogumbo's
implementation.

Some contractual assumptions I'm making about Nokogiri:

- Nokogiri will faithfully reproduce the `::Nokogiri::HTML5` singleton
  method, module, and namespace (including classes
  `Nokogiri::HTML5::Node`, `Nokogiri::HTML5::Document`, and
  `Nokogiri::HTML5::DocumentFragment`)

- Nokogiri will not provide a `::Nokogumbo` module/namespace, but will
  provide a similar `::Nokogiri::Gumbo` module which will provide the
  same public API as `::Nokogumbo`.

This change checks for the existence of `Nokogiri::HTML5`,
`Nokogiri::Gumbo`, and an expected singleton method on each. We could
do a more- or less-thorough check here.

This change also provides an "escape hatch" using an environment
variable `NOKOGUMBO_IGNORE_NOKOGIRI_HTML5` which can be set to force
Nokogumbo to use its own implementation. This escape hatch might be
unnecessary, but this change is invasive enough to make me want to be
cautious.

Nokogumbo will emit a single warning message at `require`-time when it
is uses Nokogiri's implementation. This message points users to
sparklemotion/nokogiri#2205 which will
explain what's going on and help people migrate their
applications (but is an empty placeholder right now).
flavorjones added a commit to sparklemotion/nokogiri that referenced this issue Jun 17, 2021
flavorjones added a commit to sparklemotion/nokogiri that referenced this issue Jun 18, 2021
flavorjones added a commit to sparklemotion/nokogiri that referenced this issue Jun 19, 2021
flavorjones added a commit to sparklemotion/nokogiri that referenced this issue Jun 19, 2021
flavorjones added a commit to sparklemotion/nokogiri that referenced this issue Jun 19, 2021
ci: nokogumbo test suite

---

**What problem is this PR intended to solve?**

Merge the remaining Nokogumbo tests into Nokogiri's test suite.

- [x] submodule @stevecheckoway's fork of html5lib-tests (see see rubys/nokogumbo#170 (comment))
- [x] make sure these tests only run under CRuby
- [x] gumbo tests

**Have you included adequate test coverage?**

Tests are the entire substance of this PR.


**Does this change affect the behavior of either the C or the Java implementations?**

Notably, these additional tests should only run under CRuby.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants