No DocIDs will be created if maxPagesToFetch is reached (most times). #430
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #413
This is not a thread safe solution. One thread may fill the DocIDServer while another thread is looping, however, the amount of memory wasted will be decreased.
The current problem with this hotfix:
Supose there's only 1 slot left to fill maxPagesToFetch
There should be a method to clean docB from DocIDServer in when it gets rejected, making the Frontier aware od DocIDServer, but since CrawlController has setFrontier and setDocIdServer methods, this wouldn't be safe either, as someone playing around with multiple DocIDServers and Frontiers may cause unpredictable situacions.
Since it isn't a perfect solution either, I decided to keep the hotfix as simple as possible. However, if you feel like it would be OK, I can quickly add another commit that allows the Frontier to remove unscheduled URLs from DocIDServer, given the problem stated above.