Skip to content

Commit

Permalink
[Fix rails#51242] Rework in_batches(use_ranges: true) to be more effi…
Browse files Browse the repository at this point in the history
…cient

When `use_ranges: true` option is used, we do not need to return the
whole list of primary keys for each range. This wastes network resources
and takes longer than necessary, as only the last primary key from the
range would be needed.

We can instead use OFFSET to peek at the last primary key of the next
batch, instead of returning the whole list of pks from the range.

There is a trade-off, however, as we need an additional query to confirm
the size of the very last batch, and the value of the last last primary
key. Unless we only have a handful of small batches, this strategy is a
winner overall, as it reduces the time and network resources spent
generating batches.
  • Loading branch information
maximerety committed Apr 2, 2024
1 parent 4a7c86a commit 58a0796
Show file tree
Hide file tree
Showing 3 changed files with 73 additions and 21 deletions.
16 changes: 16 additions & 0 deletions activerecord/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
* Optimize Active Record batching further when using ranges.

When used for full table iterations, or with the `use_ranges: true` option, `in_batches`
previously retrieved all primary keys in the range. For example, `[10, 11, 12, ..., 20]` was
loaded to generate a relation in the form of `WHERE id > 10 AND id <= 20` for a batch.

But since `10` would already be known from the previous iteration, it only needs to retrieve
the last key in the range (`20`) using a `LIMIT 1 OFFSET ω` construct (ω = batch size - 1),
thus avoiding the unnecessary loading and discarding of all other primary keys in the range.

E.g., tested on a PostgreSQL table with 10M records and batches of 10k records, the
generation of relations for the 1000 batches was `x2.4` times faster (`5.6s` vs `2.3s`) and
used `x900` less bandwidth (`180MB` vs. less than `0.2MB`).

*Maxime Réty*

* Retry known idempotent SELECT queries on connection-related exceptions

SELECT queries we construct by walking the Arel tree and / or with known model attributes
Expand Down
28 changes: 20 additions & 8 deletions activerecord/lib/active_record/relation/batches.rb
Original file line number Diff line number Diff line change
Expand Up @@ -373,32 +373,44 @@ def batch_on_unloaded_relation(relation:, start:, finish:, load:, order:, use_ra
if load
records = batch_relation.records
ids = records.map(&:id)
ids_size = ids.size
ids_last = ids.last
yielded_relation = where(primary_key => ids)
yielded_relation.load_records(records)
elsif (empty_scope && use_ranges != false) || use_ranges
ids = batch_relation.ids
finish = ids.last
if finish
yielded_relation = apply_finish_limit(batch_relation, finish, batch_orders)
# Efficiently peak at the last id for the next batch using offset and limit.
ids_size = batch_limit
ids_last = batch_relation.offset(batch_limit - 1).limit(1).ids.first
# If the last id is not found using offset, there is at most one more batch of size < batch_limit.
# We retry getting the list of ids so that we have the exact size and last id.
unless ids_last
ids = batch_relation.ids
ids_size = ids.size
ids_last = ids.last
end
if ids_last
yielded_relation = apply_finish_limit(batch_relation, ids_last, batch_orders)
yielded_relation = yielded_relation.except(:limit, :order)
yielded_relation.skip_query_cache!(false)
end
else
ids = batch_relation.ids
ids_size = ids.size
ids_last = ids.last
yielded_relation = where(primary_key => ids)
end

break if ids.empty?
break if ids_size == 0

primary_key_offset = ids.last
primary_key_offset = ids_last
raise ArgumentError.new("Primary key not included in the custom select clause") unless primary_key_offset

yield yielded_relation

break if ids.length < batch_limit
break if ids_size < batch_limit

if limit_value
remaining -= ids.length
remaining -= ids_size

if remaining == 0
# Saves a useless iteration when the limit is a multiple of the
Expand Down
50 changes: 37 additions & 13 deletions activerecord/test/cases/batches_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -327,15 +327,15 @@ def test_in_batches_has_attribute_readers
end

def test_in_batches_should_yield_relation_if_block_given
assert_queries_count(6) do
assert_queries_count(7) do
Post.in_batches(of: 2) do |relation|
assert_kind_of ActiveRecord::Relation, relation
end
end
end

def test_in_batches_should_be_enumerable_if_no_block_given
assert_queries_count(6) do
assert_queries_count(7) do
Post.in_batches(of: 2).each do |relation|
assert_kind_of ActiveRecord::Relation, relation
end
Expand Down Expand Up @@ -370,7 +370,7 @@ def test_in_batches_each_record_should_be_ordered_by_id
end

def test_in_batches_update_all_affect_all_records
assert_queries_count(6 + 6) do # 6 selects, 6 updates
assert_queries_count(7 + 6) do # 7 selects, 6 updates
Post.in_batches(of: 2).update_all(title: "updated-title")
end
assert_equal Post.all.pluck(:title), ["updated-title"] * Post.count
Expand Down Expand Up @@ -416,7 +416,7 @@ def test_in_batches_should_be_loaded
end

def test_in_batches_if_not_loaded_executes_more_queries
assert_queries_count(@total + 1) do
assert_queries_count(@total + 2) do
Post.in_batches(of: 1, load: false) do |relation|
assert_not_predicate relation, :loaded?
end
Expand Down Expand Up @@ -490,7 +490,7 @@ def test_in_batches_when_loaded_can_return_an_enum
end

def test_in_batches_should_return_relations
assert_queries_count(@total + 1) do
assert_queries_count(@total + 2) do
Post.in_batches(of: 1) do |relation|
assert_kind_of ActiveRecord::Relation, relation
end
Expand All @@ -507,7 +507,7 @@ def test_in_batches_should_start_from_the_start_option

def test_in_batches_should_end_at_the_finish_option
post = Post.order("id DESC").where("id <= ?", 5).first
assert_queries_count(7) do
assert_queries_count(8) do
relation = Post.in_batches(of: 1, finish: 5, load: true).reverse_each.first
assert_equal post, relation.last
end
Expand All @@ -516,32 +516,56 @@ def test_in_batches_should_end_at_the_finish_option
def test_in_batches_executes_range_queries_when_unconstrained
c = Post.lease_connection
quoted_posts_id = Regexp.escape(c.quote_table_name("posts.id"))

relations = assert_queries_match(/ORDER BY #{quoted_posts_id} ASC LIMIT \S+ OFFSET \S+\z/i, count: 6) do
assert_queries_match(/ORDER BY #{quoted_posts_id} ASC LIMIT \S+\z/i, count: 1) do
Post.in_batches(of: 2).to_a
end
end

assert_queries_match(/WHERE #{quoted_posts_id} > .+ AND #{quoted_posts_id} <= .+/i) do
Post.in_batches(of: 2) { |relation| assert_kind_of Post, relation.first }
relations.each { |relation| assert_kind_of Post, relation.first }
end
end

def test_in_batches_executes_in_queries_when_unconstrained_and_opted_out_of_ranges
c = Post.lease_connection
quoted_posts_id = Regexp.escape(c.quote_table_name("posts.id"))

relations = assert_queries_match(/ORDER BY #{quoted_posts_id} ASC LIMIT \S+\z/i, count: 6) do
Post.in_batches(of: 2, use_ranges: false).to_a
end

assert_queries_match(/#{quoted_posts_id} IN \(.+\)/i) do
Post.in_batches(of: 2, use_ranges: false) { |relation| assert_kind_of Post, relation.first }
relations.each { |relation| assert_kind_of Post, relation.first }
end
end

def test_in_batches_executes_in_queries_when_constrained
c = Post.lease_connection
quoted_posts_id = Regexp.escape(c.quote_table_name("posts.id"))

relations = assert_queries_match(/ORDER BY #{quoted_posts_id} ASC LIMIT \S+\z/i, count: 3) do
Post.where("id < ?", 5).in_batches(of: 2).to_a
end

assert_queries_match(/#{quoted_posts_id} IN \(.+\)/i) do
Post.where("id < ?", 5).in_batches(of: 2) { |relation| assert_kind_of Post, relation.first }
relations.each { |relation| assert_kind_of Post, relation.first }
end
end

def test_in_batches_executes_range_queries_when_constrained_and_opted_in_into_ranges
c = Post.lease_connection
quoted_posts_id = Regexp.escape(c.quote_table_name("posts.id"))

relations = assert_queries_match(/ORDER BY #{quoted_posts_id} ASC LIMIT \S+ OFFSET \S+\z/i, count: 3) do
assert_queries_match(/ORDER BY #{quoted_posts_id} ASC LIMIT \S+\z/i, count: 1) do
Post.where("id < ?", 5).in_batches(of: 2, use_ranges: true).to_a
end
end

assert_queries_match(/#{quoted_posts_id} > .+ AND #{quoted_posts_id} <= .+/i) do
Post.where("id < ?", 5).in_batches(of: 2, use_ranges: true) { |relation| assert_kind_of Post, relation.first }
relations.each { |relation| assert_kind_of Post, relation.first }
end
end

Expand All @@ -554,11 +578,11 @@ def test_in_batches_no_subqueries_for_whole_tables_batching
end

def test_in_batches_shouldnt_execute_query_unless_needed
assert_queries_count(2) do
assert_queries_count(3) do
Post.in_batches(of: @total) { |relation| assert_kind_of ActiveRecord::Relation, relation }
end

assert_queries_count(1) do
assert_queries_count(2) do
Post.in_batches(of: @total + 1) { |relation| assert_kind_of ActiveRecord::Relation, relation }
end
end
Expand Down Expand Up @@ -819,7 +843,7 @@ def test_find_in_batches_should_return_a_sized_enumerator

test ".in_batches bypasses the query cache for its own queries" do
Post.cache do
assert_queries_count(2) do
assert_queries_count(4) do
Post.in_batches { }
Post.in_batches { }
end
Expand Down

0 comments on commit 58a0796

Please sign in to comment.