Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid potential place where the world map could be modified after its iterator is created #8315

Merged
merged 2 commits into from
Sep 24, 2022

Conversation

willkroboth
Copy link
Contributor

The code in net.minecraft.server.MinecraftServer previously looked like this:

public abstract class MinecraftServer ... {
    ...
    public boolean isIteratingOverLevels = false; // Paper
    ...

    public void tickChildren(BooleanSupplier shouldKeepTicking) {
        ...
        Iterator iterator = this.getAllLevels().iterator();

        // CraftBukkit start
        // Run tasks that are waiting on processing
        MinecraftTimings.processQueueTimer.startTiming(); // Spigot
        while (!this.processQueue.isEmpty()) {
            this.processQueue.remove().run();
        }
        MinecraftTimings.processQueueTimer.stopTiming(); // Spigot

        MinecraftTimings.timeUpdateTimer.startTiming(); // Spigot // Paper
        // Send time updates to everyone, it will get the right time from the world the player is in.
        // Paper start - optimize time updates
        for (final ServerLevel world : this.getAllLevels()) {
            ...
        }
        // Paper end
        MinecraftTimings.timeUpdateTimer.stopTiming(); // Spigot // Paper

        this.isIteratingOverLevels = true; // Paper
        while (iterator.hasNext()) {
            ...
        }
        this.isIteratingOverLevels = false; // Paper
        ...
    }
}

The flag added by Paper, isIteratingOverLevels, is used to prevent CraftServer#createWorld and CraftServer#unloadWorld from adding or removing from the levels map while the levels map is being iterated through. However, in between the Iterator being created and the flag being set, the process queue and time updates are processed. The time updates are probably not an issue, but the code run by the process queue could potentially try to load or unload a world. As Multiverse/Multiverse-Core#2800 discovered, this causes a crash on older versions of Paper.

With this PR, the code in net.minecraft.server.MinecraftServer now looks like this:

public abstract class MinecraftServer ... {
    ...
    public boolean isIteratingOverLevels = false; // Paper
    ...

    public void tickChildren(BooleanSupplier shouldKeepTicking) {
        ...

        // CraftBukkit start
        // Run tasks that are waiting on processing
        MinecraftTimings.processQueueTimer.startTiming(); // Spigot
        while (!this.processQueue.isEmpty()) {
            this.processQueue.remove().run();
        }
        MinecraftTimings.processQueueTimer.stopTiming(); // Spigot

        MinecraftTimings.timeUpdateTimer.startTiming(); // Spigot // Paper
        // Send time updates to everyone, it will get the right time from the world the player is in.
        // Paper start - optimize time updates
        for (final ServerLevel world : this.getAllLevels()) {
            ...
        }
        // Paper end
        MinecraftTimings.timeUpdateTimer.stopTiming(); // Spigot // Paper

        this.isIteratingOverLevels = true; // Paper
        Iterator iterator = this.getAllLevels().iterator(); // Paper - move down
        while (iterator.hasNext()) {
            ...
        }
        this.isIteratingOverLevels = false; // Paper
        ...
    }
}

Simple fix, Iterator iterator = this.getAllLevels().iterator(); was just moved down to after the processQueue and time updates. Now, there is no instance where the levels iterator has been created but isIteratingOverLevels has been set. Following a minimal diff policy, I did not replace the Iterator+while loop with an enhanced for loop.

A note on the importance of this change:

I said earlier that the current code had the potential to crash on old versions of Paper. The example from Multiverse was using Paper#91. On the current version of Paper, doing the same thing that the Multiverse user did would likely not cause a crash. This is because this problem was already "fixed" by SPIGOT-7089, which was resolved by making the levels map effectively copy-on-write, avoiding the ConcurrentModificationException that crashed the server. This fix has been in Paper since Paper#126.

However, following the precedence of #8300, Paper aims to avoid the unsafe operation of unloading a world while it is being ticked for its potential to cause crashes in the future. Without this PR, it is possible for a world to be unloaded by a Runnable in the processQueue, then ticked after it has been unloaded. To avoid this potential problem, this PR should be included.

…et that indicates the levels are being ticked.
@willkroboth willkroboth requested a review from a team as a code owner August 19, 2022 16:31
@Kihsomray
Copy link

Hello, any idea when this will be implemented?

@willkroboth
Copy link
Contributor Author

Hello, any idea when this will be implemented?

I don't know when it'll be implemented, whenever the Paper devs pick it up I suppose. It's not really an important fix though. All it does is avoid a potentially problematic situation where a world might be ticked after it is unloaded. Thanks to the Spigot "fix" mentioned above this PR might actually be redundant, though following #8300, I'm erring on the safe side of avoiding any potential unexpected problems. With or without this PR, the server will probably not crash.

Since you're asking about it, I have to ask: did you encounter a concrete issue this PR would fix? It'd be nice to be able to point at a real situation this PR could measurably improve.

@Kihsomray
Copy link

Kihsomray commented Aug 26, 2022

Hello, any idea when this will be implemented?

I don't know when it'll be implemented, whenever the Paper devs pick it up I suppose. It's not really an important fix though. All it does is avoid a potentially problematic situation where a world might be ticked after it is unloaded. Thanks to the Spigot "fix" mentioned above this PR might actually be redundant, though following #8300, I'm erring on the safe side of avoiding any potential unexpected problems. With or without this PR, the server will probably not crash.

Since you're asking about it, I have to ask: did you encounter a concrete issue this PR would fix? It'd be nice to be able to point at a real situation this PR could measurably improve.

I own a plugin by the name of CyberWorldReset, which resets worlds via command or scheduler/timer with a bunch of preset values.

I recently had reports of an issue where the server would freeze/crash after a thread dump. This occurred when the world was created. Being a somewhat responsible dev, I decided to research this issue. After a few Google searches, I ran into Multiverse/Multiverse-Core#2800. It just so happened that you posted and referred to this PR. Since then, I have been waiting for this PR to be merged.

Am I in the right place?

@electronicboy
Copy link
Member

The linked thread post is 100% irrelevant to this, the linked post is down to the server taking too long to tick due to loading a world due to sync thread loads which creating a world creates if you don't mitigate that;

This PR solves an individual concern, out of the many dozen, in regards to unloading worlds mid-tick

@Kihsomray
Copy link

Hey, thanks for the reply. What should I do in this scenario? I really want to fix this issue.

@Lulu13022002
Copy link
Contributor

Lulu13022002 commented Aug 26, 2022

This doesn't prevent any CME, because the CME is already catch by the flag, this exception occur when you modify and iterate the world list in the same time. But in this case, even if a plugin load/unload a world into the process queue, there are no way to iterate while modify the list because the field is local and that the process queue is sync to main thread. In short the iteration of the list is already safe (see iterator#next/time loop) and happens later into the server tick. But it prevent well the unsafe unload which is a good point but not really related to the issue.

@willkroboth
Copy link
Contributor Author

If the server isn't crashing (it's still running once the error messages go away) then that is a different "problem". Creating a world can sometimes just be expensive and take a while.

#8177 might be the PR you want to watch. I think the third point in the TODO list is most helpful. It sounds like there are plans to make chuck generation asynchronous to free the main thread of that burden, which should help world creation not stall the main thread.

Right now, you can try increasing the settings.timeout-time value in your spigot.yml. From your log, it looks like it took 22 seconds to load the world, so if you increase timeout-time to more than that, you can at least avoid the watchdog errors. This won't make the world load any faster, but it should remove the error messages.

@willkroboth
Copy link
Contributor Author

This doesn't prevent any CME, because the CME is already catch by the flag, this exception occur when you modify and iterate the world list in the same time. But in this case, even if a plugin load/unload a world into the process queue, there are no way to iterate while modify the list because the field is local and that the process queue is sync to main thread. In short the iteration of the list is already safe (see iterator#next/time loop) and happens later into the server tick. But it prevent well the unsafe unload which is a good point but not really related to the issue.

I think I understand what you mean and I disagree, but correct me if what I'm about to say actually agrees with you.

Multiverse/Multiverse-Core#2800 is an example that shows that a world unload in the process queue did cause a CME on older versions of Paper (1.19 build 91 at least). Something in their code triggered CraftServer#createWorld via the process queue (seen in the watchdog dump: MinecraftServer.tickChildren called Waitable.run which eventually called CraftServer.createWorld). Immedietly after this CME occurred:

[12:25:42 ERROR]: Encountered an unexpected exception
--
  | java.util.ConcurrentModificationException: null
  | at java.util.LinkedHashMap$LinkedHashIterator.nextNode(LinkedHashMap.java:756) ~[?:?]
  | at java.util.LinkedHashMap$LinkedValueIterator.next(LinkedHashMap.java:783) ~[?:?]
  | at net.minecraft.server.MinecraftServer.tickChildren(MinecraftServer.java:1532) ~[paper-1.19.1.jar:git-Paper-91]
  | at net.minecraft.server.dedicated.DedicatedServer.tickChildren(DedicatedServer.java:446) ~[paper-1.19.1.jar:git-Paper-91]
  | at net.minecraft.server.MinecraftServer.tickServer(MinecraftServer.java:1415) ~[paper-1.19.1.jar:git-Paper-91]
  | at net.minecraft.server.MinecraftServer.runServer(MinecraftServer.java:1191) ~[paper-1.19.1.jar:git-Paper-91]
  | at net.minecraft.server.MinecraftServer.lambda$spin$0(MinecraftServer.java:305) ~[paper-1.19.1.jar:git-Paper-91]
  | at java.lang.Thread.run(Thread.java:833) ~[?:?]

which is because the levels LinkedHashMap was modified after it's iterator was created (it didn't matter that this was the first call: the iterator was created, the original map was modified, then the iterator was used, so it throws a CME).

In the latest Paper build, this CME dose not happen because of Spigot's fix. The isIteratingOverLevels flag did not catch the CME. Currently, the iterator is created, the process queue is run, then the isIteratingOverLevels flag is set. Since the isIteratingOverLevels flag is not set when the process queue is run, calls to CraftServer#createWorld and CraftServer#unloadWorld could still go through and modify the underlying LinkedHashMap used by the iterator. The only reason this doesn't happen is because Spigot's fix made the levels variable effectively copy-on-write.

Even though the server dose not crash, #8300 established that Paper is wary of accepting Spigot's fix and just allowing a world to be ticked after it is unloaded. This PR should make it impossible for a plugin to unload a world after the iterator is created since it makes it so there is no time where the iterator is created and the flag is not set.

Bringing it back to your comment, the flag dose not catch this CME, the Spigot fix dose. However, according to Paper this still doesn't make the iterator over the levels map safe, as they're not just going to use the Spigot fix for the CME.

@Lulu13022002
Copy link
Contributor

After relooking at this nevermind you've right the CME was masked by the spigot fix during my test, and the CME can occur when i have tried in the PlayerChatEvent. But this is still a niche case bc this event is deprecated (for it's main thread use) and it's strange to load/unload a world in this event

@willkroboth
Copy link
Contributor Author

What's the consensus on the necessity of this PR? I wouldn't guess pulling this is a question of whether or not Paper wants to align with Spigot on the issue of #8300 since either way this PR is equally useful/useless.

Currently, Paper is not aligned, so this PR helps avoid the 'dangerous' situation of a world being ticked after its resources are closed by something in the process queue. What I got from #8300 is that Paper wants to be cautious here, and this PR helps Paper be cautious.

If Paper (eventually) decides to align with Spigot, it would still be as easy as removing the single patch file. In the meantime, this PR would still help Paper be cautious while they figure out whether or not to align.

Speaking of Paper deciding to align with Spigot, what would it take to address the concerns of this comment and finalize the decision there? Is it even possible to definitively prove loading/unloading worlds while they are being ticked safe or unsafe? The only way I can think of proving that is to test every conceivable situation or read every line of related code. As this comment describes, that probably isn't possible, and there's no guarantee that behavior would continue to hold in the future.
In my opinion, any issues that arise from aligning with Spigot should just be dealt with when they come up instead of teetering on uncertainty right now. I think those issues are also likely to be shared with Spigot, so it should be Spigot's problem to deal with them most of the time.

Coming back to this issue, the logic for pulling this PR now assumes this PR is useful. As Lulu13022002 said, it might be strange to load/unload a world during the PlayerChatEvent or equivalent. I disagree, using Multiverse/Multiverse-Core#2800 as a counterexample to that. I think a good chat-based guided command system can be very intuitive to use, and it's reasonable that a plugin developer might try to create such a system to manage and unload worlds. As this comment mentions, other events could probably also be used to cause this dangerous state, and making this change would document that for plugin developers (I still think updating the documentation of Bukkit#createWorld and Bukkit#unloadWorld might be useful for plugin developers as well).

What do other people think? Is this PR too niche to be pulled as Lulu13022002 suggests, or does it help further the cause of #8300 by being cautious with allowing worlds to tick after having their resources shut down? How does resolving Spigot alignment on this issue continue? Can a decision be made now or is there a concrete path forwards to make progress with a final decision?
I implore others to think about this carefully. If you don't have time to think about this now, wait until you do. I'm fine waiting longer for an answer if it means it is well thought through :)

@electronicboy
Copy link
Member

having had a bit of a deeper look, there is some good news here in that mojangs mailbox mess has allowed them to isolate the work between worlds, which means that there is much less concerns around unloading worlds mid tick as the chunk system is more contained as they can just shutdown the isolated mailbox rather than dealing with the global work queues we had to deal with; the fact that stuff is flushed out, things are unloaded, etc, appears to help mitigate some other further concerns

having tested for a few minutes having disabled the check, even without these set of changes, it seems to cope; I'd imagine in part mostly for the improvements mojang has made over the past few dozen releases in isolating this mess.

I think in part we should expose the iterating value and let plugin authors decide what to do, there's 0 promise ever that this will be safe, I feel iffy on this given that these types of issues will end up at our door before the plugin devs un/loading worlds in precarious places, which is ofc the biggest headache in this type of stuff

@electronicboy electronicboy changed the title Avoid all potential places where the world map could be modified after its iterator is created Avoid potential place where the world map could be modified after its iterator is created Sep 24, 2022
@electronicboy electronicboy merged commit d332623 into PaperMC:master Sep 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants