Consider tools for safely denormalizing data fetches in complex situations #9394

asmecher · 2023-10-10T18:37:38Z

asmecher
Oct 10, 2023
Maintainer

The new submission lists toolset (#7495) includes information in the submission lists about submissions, editorial assignments, individual review assignments, etc. This mixes different "depths" of data much more freely than our submission lists have done before. Building these new lists will be a heavyweight process and loading everything from the database one at a time will not be scalable:

// BAD pattern: not scalable, regardless of whether it's done in PHP land or via subsequent API requests
- get submission list, filtering by status
- for submission 1, get all editorial assignments
- for submission 1, get all review assignments
- for submission 2, get all editorial assignments
- for submission 2, get all review assignments
- ...

Compounding the problem is that aggregate information like the submission lists will have to reimplement access control: we already have access policies for each submission ("can I view the submission's workflow") but we will need the same for aggregate data ("should I include this submission's review information in the list"). This will get more complicated as we enrich the submission lists to include more data that was previously only available on the individual submission level. Our access policies are not currently written with lists in mind -- there is only one "submission" authorized context object per request, for example.

We might be able to solve several of these problems at once.

asmecher · 2023-10-10T19:21:24Z

asmecher
Oct 10, 2023
Maintainer Author

Idea: Denormalize (enrich) the data in SQL.

We can use the submission collector as a basis for the submission list fetch. Consider this (currently used) code:

$collector = Repo::submission()->getCollector()
    ->filterByContextIds([$context->getId()]); // In a way, this is already an access policy check

... add filters ...

$submissions = $collector->getMany();
return $response->withJson([
    'itemsMax' => $collector->limit(null)->offset(null)->getCount(),
    'items' => some_method_to_map_to_json($submissions)
);

This is good enough for the general-purpose API (and single-submission front end). However, it's not rich enough for the submissions list, which needs e.g. review assignment and editorial assignment information.

We could augment the data without re-implementing the code, with SQL like the following:

-- MySQL/MariaDB
SELECT s.submission_id, reviewer_list
FROM submissions s
JOIN (
  SELECT ra.submission_id AS submission_id,
    GROUP_CONCAT('[', JSON_OBJECT('user_id', u.user_id, 'username', u.username), ']') AS reviewer_list 
  FROM review_assignments ra JOIN users AS u ON (ra.reviewer_id = u.user_id)
  GROUP BY ra.submission_id
) AS rd ON (rd.submission_id = s.submission_id);

-- PostgreSQL
SELECT s.submission_id, reviewer_list
FROM submissions s
JOIN (
  SELECT ra.submission_id AS submission_id,
    JSON_AGG(JSON_BUILD_OBJECT('user_id', u.user_id, 'username', u.username)) AS reviewer_list
  FROM review_assignments ra JOIN users AS u ON (ra.reviewer_id = u.user_id)
  GROUP BY ra.submission_id
) AS rd ON (rd.submission_id = s.submission_id);

Typical results:

+---------------+----------------------------------------------------------------------------------------------------------------------------+
| submission_id | reviewer_list                                                                                                              |
+---------------+----------------------------------------------------------------------------------------------------------------------------+
|             1 | [{"user_id": 7, "username": "jjanssen"}],[{"user_id": 9, "username": "amccrae"}],[{"user_id": 10, "username": "agallego"}] |
|             3 | [{"user_id": 9, "username": "amccrae"}],[{"user_id": 10, "username": "agallego"}]                                          |
|             5 | [{"user_id": 8, "username": "phudson"}],[{"user_id": 10, "username": "agallego"}]                                          |
|             6 | [{"user_id": 10, "username": "agallego"}],[{"user_id": 7, "username": "jjanssen"}]                                         |
|             7 | [{"user_id": 8, "username": "phudson"}],[{"user_id": 9, "username": "amccrae"}],[{"user_id": 10, "username": "agallego"}]  |
|             9 | [{"user_id": 7, "username": "jjanssen"}],[{"user_id": 10, "username": "agallego"}]                                         |
|            10 | [{"user_id": 10, "username": "agallego"}],[{"user_id": 9, "username": "amccrae"}]                                          |
|            12 | [{"user_id": 10, "username": "agallego"}],[{"user_id": 8, "username": "phudson"}],[{"user_id": 7, "username": "jjanssen"}] |
|            13 | [{"user_id": 7, "username": "jjanssen"}],[{"user_id": 9, "username": "amccrae"}],[{"user_id": 10, "username": "agallego"}] |
|            15 | [{"user_id": 8, "username": "phudson"}],[{"user_id": 9, "username": "amccrae"}]                                            |
|            17 | [{"user_id": 8, "username": "phudson"}],[{"user_id": 7, "username": "jjanssen"}]                                           |
|            19 | [{"user_id": 8, "username": "phudson"}],[{"user_id": 9, "username": "amccrae"}]                                            |
+---------------+----------------------------------------------------------------------------------------------------------------------------+

(The syntax is unfortunately slightly different for the two DBMSs -- this can be finessed by a Laravel querybuilder macro, I think. There are better functions in more recent releases of MariaDB/MySQL than GROUP_CONCAT, but I don't think we need to bump our requirements that high.)

Characteristics:

We can add the denormalized column(s) by building cleanly on the existing general-purpose submission query builder (collector).
This does not require a main GROUP BY clause (good for performance, and avoids collisions with other similar requirements)
This keeps the complexity within a single subquery -- does not e.g. pollute the table namespace
Performs extremely well on SciELO dataset (0.165 seconds for a query with 372 candidate rows out of a submissions table containing ~200k entries and a review_assignments table containing ~300k) -- very fast compared with any sort of repeated round trips to/from PHP and/or the client side!

In PHP in the _submissions endpoint, where we want to denormalize for performance, this would look something like...

$collector = Repo::submission()->getCollector()
    ->filterByContextIds([$context->getId()]) // In a way, this is already an access policy check
    ->augmentWithReviewerData()
    ->augmentWithStageAssignmentData()
...

(I think we could also apply this idea to our localizations -- so that we can return the full set of localized data to the UI, rather than using multiple joins to get the "primary" and "current" locale data. But that's another story.)

4 replies

jonasraoni Oct 12, 2023
Collaborator

I've been using these new aggregation functions and they work well, even though they might be limited (e.g. GROUP_CONCAT works together with the variable group_concat_max_len), if we require a higher MySQL version, its JSON aggregators will be available.

Another variant that I've used in the past is just to bring all data together (which would allow sharing some filters and doesn't require fancy functions). It was pretty good performance-wise (network traffic was cheaper than extra requests), and just required a nice way to loop through the data.

I mean, instead of grouping data at the SQL level, you'd have a simple query like this (where the submission data would be duplicated):

SELECT submission.id AS submission_id, submission.title, author.id AS author_id, author.name
FROM submission
INNER JOIN author ON author.id = author.submission_id
ORDER BY submission.id, author.id -- Required

And the deduplication would happen at the code level, which could look like this:

while ($resultSet->next('submission_id')) { //advances when the submission.id gets changed
  while ($resultSet->next('author_id') { //advances when the author.id gets changed
    //...
  }
}

I guess the Laravel toolset, must be doing something like this internally.

asmecher Oct 12, 2023
Maintainer Author

@jonasraoni, that would be the simplest approach from the database perspective, but I suspect it won't scale well with the number of relationships we work with. Consider...

one submission row
x 4 authors = 4 rows
x 3 review rounds = 12 rows
x 3 reviewers per round = 48 rows
x 4 stage assignments (a sub editor, a copyeditor, a layout editor, an author) = 192 rows
x 3 publications = 576 (possibly even worse, as author metadata is replicated on publications)
...and so on (for example, it's very likely we'd want to denormalize the _settings tables for many of the above in the same way)

We'd also have to make sure the rows are read in the same order of the joins. I suspect there's lots of room for things to go wrong here.

jonasraoni Oct 13, 2023
Collaborator

Yep! This is flexible, but it has its drawbacks/requirements, just wanted to share another way :)

I think it's better to invest time on the "Eloquent: Relationships" with the aggregate functions for local optimizations (e.g. retrieve author names in all languages for an author query).

jonasraoni Oct 13, 2023
Collaborator

Yeah, it's possible to compose both, this worked fine for me:

use Illuminate\Contracts\Database\Query\Builder;
use Illuminate\Support\Facades\DB;
use Illuminate\Database\Eloquent\Model;

class Publication extends Model
{
    protected $primaryKey = 'publication_id';
}

class Authors extends Model
{
    protected $primaryKey = 'author_id';

    public function scopeWithName(Builder $query)
    {
        return $query->select('*', DB::raw(
            "(
                SELECT JSON_OBJECTAGG(s.locale, s.setting_value)
                FROM author_settings s
                WHERE s.setting_name = 'givenName'
                AND s.author_id = authors.author_id
            ) AS given_name"
        ));
    }

    public function getGivenNameAttribute($value)
    {
        return json_decode($value);
    }

    public function publication()
    {
        return $this->belongsTo(Publication::class, 'publication_id');
    }
}

foreach (Authors::with('publication')->withName()->get() as $x) {
    print_r([
        'ID' => $x->author_id,
        'Publication' => $x->publication->publication_id,
        'Name' => $x->given_name
    ]);
}

asmecher · 2023-10-10T19:36:14Z

asmecher
Oct 10, 2023
Maintainer Author

Idea: Use query builders to implement access policies in SQL.

Building on Idea: Denormalize (enrich) the data in SQL, use querybuilders to build access control in SQL (in place of the current access policy toolset, which is aging poorly).

This would be usable in both aggregate (submission lists) and individual (single submission workflow) cases using a single implementation. It might also prove a general replacement for our current access policy toolkit.

For simple cases, this is clear: PHP checks that require objects to be fetched and instantiated first get replaced by collector calls.

Old patterns (used e.g. in access policies):

$reviewAssignment = $reviewAssignmentDao->getById($reviewId);

// Only supports one submission; hard to know where in the policy set it's stored in the authorized context!
$submission = $this->getAuthorizedContextObject(Application::ASSOC_TYPE_SUBMISSION);

if ($reviewAssignment->getSubmissionId() != $submission->getId()) {
    return AuthorizationPolicy::AUTHORIZATION_DENY;
}

New pattern that might get a list of reviews by composing collectors:

$submissonCollector = Repo::submissions()->getCollector()
    ->filterByContextIds([$context->getId()])
    ->filterByAssignedUserIds([$user->getId()]);

$reviewAssignmentCollector = Repo::reviewAssignments()->getCollector()
    ->filterBySubmissionIds($submissionCollector);

// The $reviewAssignmentCollector can now be used to get information about one or many
// review assignments that are assigned to a specific user, in a specific context, without needing
// to execute a large series of queries to build up data. In fact, using this object would cause only
// a single query to be executed if e.g. `getMany` is called.

To be determined:

whether this is suitable for complex access policies like the submission access policy (where several role-based conditions can satisfy an allow/deny decision through building a decision tree);
where the code would live
how/whether it plays with Laravel's authorization toolset

1 reply

jonasraoni Oct 12, 2023
Collaborator

Random comments:

I didn't touch much the "access policy" toolset, but I found it a bit confusing to debug, and when dealing with lots of data, it's really not ideal.
It's tough to have a lot of rules spread/duplicated across the code. Perhaps dynamic roles/permissions would help a bit with the disentangling?! So, I see with good eyes anything that can get these business rules a bit centralized, to decrease the risk of having permissions out of sync (e.g. updating the permission in one place and forgetting to forward it to similar cases).
Yeah, the "Collector" needs to be checked against the Laravel toolset (e.g. lazy/eager loaders). I think it looks ok in general (simple to understand/use), so it might be ok to keep it and augment, as long as its functionality doesn't completely overlap with Laravel's toolset.
I agree with replacing the access policy by natural filters, as in the example above, wherever possible.

jardakotesovec · 2023-10-11T12:25:04Z

jardakotesovec
Oct 11, 2023
Collaborator

I would like to mention interesting middle ground approach (between 1 query and 500 sql queries), which I learnt while building graphql APIs and found it quite easy and efficient. Initial JS implementation came from facebook - https://github.com/graphql/dataloader , but there are implementations in all languages - here is php https://github.com/overblog/dataloader-php

To briefly explain it, lets say that you need submissions, submission-> review_assignments and submission->review_assignments->user to build your detailed response.

You can achieve that with 3 queries.

Fetch submissions with relevant filters - lets say it will fetch submissions with ids 1,3,4,6
Fetch all review_assignments that has has submisison_id from previous list - WHERE submission_id IN (1, 3, 4, 6);. Let say it result in review_assignments that has following user ids (1, 5, 8, 12, 13, 16, 18, 19, 20).
using these user ids (1, 5, 8, 12, 13, 16, 18, 19, 2) it can fetch relevant users

Dataloader is than responsible to match correct ids. So it has to be correct, but these are often very simple loops and in practice couple of such dataloaders satisfies such dependency use cases.

So what I imagine in our codebase that in the Submission mapping class, when you calculate stages you could ask Dataloader for review_assignments for given submission_id. Dataloader is than responsible to collect all these submission ids within one sync event loop. And making one query after the sync event loop and resolving promise afterwards, which allows to carry on. So it requires using promises to get this working. In Node.js its all you use.. so very native pattern. I am not sure whether in php code base it would be desirable pattern.

In short terms - it would be really just batching the sql requests that are coming from the mappers.

Obviously building one super sql query will be always fastest, but this approach still scales well using basic sql queries, which returns always the same object shapes.

Happy to do deeper dive what it could look like in our code base if there would be interest..

8 replies

jonasraoni Oct 12, 2023
Collaborator

I think we can discuss about GraphQL on the other discussion that I've linked. In general, OJS isn't a high demanding system, and I doubt many journals would have performance issues with properly implemented data loaders (i.e. it's not the nature of such system to receive lots of accesses).

At this moment we still have 2 active patterns to access the database in the codebase (just like the frontend which still has jQuery + Vue), so before adopting a new pattern, we should commit to first getting rid of the oldest pattern.

jardakotesovec Oct 12, 2023
Collaborator

@jonasraoni Agreed, moved my graphql comment to the existing discussion.

asmecher Oct 12, 2023
Maintainer Author

@jardakotesovec, it looks like Laravel's eager loading implements fetches with relationships as you have described above. From their documentation, using with('...') sets up (per use case) eager loading of related entities. A pattern like this:

$books = Book::with('author')->get();
 
foreach ($books as $book) {
    echo $book->author->name;
}

...results in just 2 queries, one per entity:

select * from books
 
select * from authors where id in (1, 2, 3, 4, 5, ...)

This feels like a very promising direction, as it doesn't set up a massive split between denormalized (high volume/complexity) and everyday implementations.

(Credit to @touhidurabir for pointing this out)

jardakotesovec Oct 12, 2023
Collaborator

@asmecher Agreed, I did reference this docs yesterday in my response below.

asmecher Dec 18, 2023
Maintainer Author

I think we can implement this to add review assignment data to submission lists without many infrastructural changes.

Currently we are fetching review assignment data on a submission-by-submission basis:

pkp-lib/classes/submission/maps/Schema.php

Line 282 in af8ad0f

    
           $reviewAssignments = Repo::reviewAssignment()->getCollector()->filterBySubmissionIds([$submission->getId()])->getMany();

That's fine for a single submission, but we can optimize it with just a few lines of code to fetch a single collection containing review assignments for all submissions. Changes get made just in the schema mapper class for submissions.

We'll need to consider the public entry points to the schema mapper:

Single-submission entry points: map, summarize, mapToSubmissionList, summarizeWithoutPublication
Multiple-submission entry points: mapMany, summarizeMany, mapManyToSubmissionList

I propose we add a new property to the mapper called $reviewAssignments.

In the single-submission entry points, we initialize it for the same behaviour as before:

$this->reviewAssignments = Repo::reviewAssignment()->getCollector()->filterBySubmissionIds([$submission->getId()])->getMany();

For multiple-submission entry points, we can use the $collection provided to e.g. mapMany (which would need to be converted to a Collection typehint, rather than an Enumerable):

$this->reviewAssignments = Repo::reviewAssignment()->getCollector()->filterBySubmissionIds($collection->keys()->toArray())->getMany()->remember();

(Note the remember, which allows us to iterate through the list multiple times. Otherwise it would be burn-after-reading.)

Then, wherever we need review assignment data, we use the property with a filter where the $reviewAssignments property is used to ensure we don't include data from other submissions:

diff --git a/classes/submission/maps/Schema.php b/classes/submission/maps/Schema.php
index fd700b09fa..6c2f11ce08 100644
--- a/classes/submission/maps/Schema.php
+++ b/classes/submission/maps/Schema.php
@@ -279,7 +285,7 @@ class Schema extends \PKP\core\maps\Schema
      */
     protected function getPropertyReviewAssignments(Submission $submission): array
     {
-        $reviewAssignments = Repo::reviewAssignment()->getCollector()->filterBySubmissionIds([$submission->getId()])->getMany();
+        $reviewAssignments = $this->reviewAssignments->filter(fn($reviewAssignment) => $reviewAssignment->getSubmissionId() == $submission->getId());
 
         $reviews = [];
         foreach ($reviewAssignments as $reviewAssignment) {

jardakotesovec · 2023-10-11T18:38:35Z

jardakotesovec
Oct 11, 2023
Collaborator

@asmecher Just to follow up our discussion.
There is way to do very similar thing that you can achieve with dataloaders (batching) with eloquant - https://laravel.com/docs/10.x/eloquent-relationships#eager-loading Which from their example looks easiest.. but not sure how well it fits to our existing database layer :-).

1 reply

jonasraoni Oct 12, 2023
Collaborator

Yeah, we have plans to adopt this. I'm not sure the document is public (to post the link here), but there's a pinned item on the "Optimization" channel in Mattermost, with an older document that still has some relevance.

I think this is the best option performance-wise, where we still have some control over what's being retrieved and how.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider tools for safely denormalizing data fetches in complex situations #9394

{{title}}

Replies: 4 comments 14 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Consider tools for safely denormalizing data fetches in complex situations #9394

asmecher Oct 10, 2023 Maintainer

Replies: 4 comments · 14 replies

asmecher Oct 10, 2023 Maintainer Author

Idea: Denormalize (enrich) the data in SQL.

jonasraoni Oct 12, 2023 Collaborator

asmecher Oct 12, 2023 Maintainer Author

jonasraoni Oct 13, 2023 Collaborator

jonasraoni Oct 13, 2023 Collaborator

asmecher Oct 10, 2023 Maintainer Author

Idea: Use query builders to implement access policies in SQL.

jonasraoni Oct 12, 2023 Collaborator

jardakotesovec Oct 11, 2023 Collaborator

jonasraoni Oct 12, 2023 Collaborator

jardakotesovec Oct 12, 2023 Collaborator

asmecher Oct 12, 2023 Maintainer Author

jardakotesovec Oct 12, 2023 Collaborator

asmecher Dec 18, 2023 Maintainer Author

jardakotesovec Oct 11, 2023 Collaborator

jonasraoni Oct 12, 2023 Collaborator

asmecher
Oct 10, 2023
Maintainer

Replies: 4 comments 14 replies

asmecher
Oct 10, 2023
Maintainer Author

jonasraoni Oct 12, 2023
Collaborator

asmecher Oct 12, 2023
Maintainer Author

jonasraoni Oct 13, 2023
Collaborator

jonasraoni Oct 13, 2023
Collaborator

asmecher
Oct 10, 2023
Maintainer Author

jonasraoni Oct 12, 2023
Collaborator

jardakotesovec
Oct 11, 2023
Collaborator

jonasraoni Oct 12, 2023
Collaborator

jardakotesovec Oct 12, 2023
Collaborator

asmecher Oct 12, 2023
Maintainer Author

jardakotesovec Oct 12, 2023
Collaborator

asmecher Dec 18, 2023
Maintainer Author

jardakotesovec
Oct 11, 2023
Collaborator

jonasraoni Oct 12, 2023
Collaborator