Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include charts' parent tags in the algolia index #3790

Merged
merged 2 commits into from
Jul 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion baker/algolia/indexChartsToAlgolia.ts
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ import {
countries,
orderBy,
removeTrailingParenthetical,
uniq,
} from "@ourworldindata/utils"
import { MarkdownTextWrap } from "@ourworldindata/components"
import { getAnalyticsPageviewsByUrlObj } from "../../db/model/Pageview.js"
Expand Down Expand Up @@ -161,6 +162,8 @@ const getChartsRecords = async (

const pageviews = await getAnalyticsPageviewsByUrlObj(knex)

const parentTagsByChildName = await db.getParentTagsByChildName(knex)

const records: ChartRecord[] = []
for (const c of parsedRows) {
// Our search currently cannot render explorers, so don't index them because
Expand All @@ -181,6 +184,11 @@ const getChartsRecords = async (
fontSize: 10, // doesn't matter, but is a mandatory field
}).plaintext

const parentTags = c.tags.flatMap(
// a chart can be tagged with a tag that isn't in the tag graph
(tag) => parentTagsByChildName[tag] || []
)

const record = {
objectID: c.id.toString(),
chartId: c.id,
Expand All @@ -192,7 +200,7 @@ const getChartsRecords = async (
numDimensions: parseInt(c.numDimensions),
publishedAt: c.publishedAt,
updatedAt: c.updatedAt,
tags: c.tags as any as string[],
tags: uniq([...c.tags, ...parentTags]),
keyChartForTags: c.keyChartForTags as string[],
titleLength: c.title.length,
// Number of references to this chart in all our posts and pages
Expand Down
45 changes: 43 additions & 2 deletions db/db.ts
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ import {
GRAPHER_DB_PORT,
} from "../settings/serverSettings.js"
import { registerExitHandler } from "./cleanup.js"
import { keyBy } from "@ourworldindata/utils"
import { createTagGraph, keyBy } from "@ourworldindata/utils"
import {
DbChartTagJoin,
ImageMetadata,
Expand All @@ -26,8 +26,10 @@ import {
DbPlainPostGdocLink,
OwidGdocLinkType,
OwidGdoc,
DbPlainTag,
TagGraphNode,
} from "@ourworldindata/types"
import { groupBy } from "lodash"
import { groupBy, uniq } from "lodash"
import { gdocFromJSON } from "./model/Gdoc/GdocFactory.js"

// Return the first match from a mysql query
Expand Down Expand Up @@ -525,6 +527,45 @@ export async function getFlatTagGraph(knex: KnexReadonlyTransaction): Promise<
return { ...tagGraphByParentId, __rootId: tagGraphRootIdResult.id }
}

// DFS through the tag graph and create a map of parent tags for each child tag
// e.g. { "Child": [ "Parent", "Grandparent" ], "Parent": [ "Grandparent" ] }
// parent tags are listed in no particular order
export async function getParentTagsByChildName(
trx: KnexReadonlyTransaction
): Promise<Record<DbPlainTag["name"], DbPlainTag["name"][]>> {
const { __rootId, ...flatTagGraph } = await getFlatTagGraph(trx)
const tagGraph = createTagGraph(flatTagGraph, __rootId)

const tagsById = await trx("tags")
.select("id", "name")
.then((tags) => keyBy(tags, "id"))

const parentTagsByChildName: Record<
DbPlainTag["name"],
DbPlainTag["name"][]
> = {}

function trackParents(node: TagGraphNode): void {
for (const child of node.children) {
trackParents(child)
}

const preexistingParents = parentTagsByChildName[node.name] ?? []
// node.path is an array of tag ids from the root to the current node
// slice to remove the root node and the current node, then map them into tag names
const newParents = node.path.slice(1, -1).map((id) => tagsById[id].name)

parentTagsByChildName[node.name] = uniq([
...preexistingParents,
...newParents,
])
}

trackParents(tagGraph)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that this code works, but the structure of how we're building and working with this graph is more complex than it needs to be.

For example, you need to build the graph twice, once flat, once with the object model, and you also need to do a second query to fetch the tag names. You have to rewrite the parents multiple times instead of computing it just once.

A more standard graph representation would be something like idToNode (and maybe idToChildIds or idToParentId), and would make the subsequent traversal much easier. Then you also skip entirely the fake root node.

In this case, the flat tag graph is basically idToChildNodes, plus the root id. The graph would be better if it was just idToNode, ungrouped, since the node already contains the parent id. Let's imagine that scenario.

const idToNode = getFlatTagGraph(trx)

const tagNameToParentTags = {}
for (const node of Object.values(idToNode)) {
  const name = node.name
  const parents = []

  let parentId = node.parentId
  while (parentId) {
    const parent = idToNode[parentId]
    parents.append(name)
    parentId = parent.parentId
  }

  tagNameToParentTags[name] = parents
}

It becomes easy to iterate across every node, and easy to follow the parent chain for every node too.

How much would that mess up other code elsewhere?

Copy link
Member Author

@ikesau ikesau Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe #3800 might be adding some confusion here 😛

A few things:

  1. With the tag_graph table (not the tags' parentId column) a node can have multiple parents:
image
  1. The fake root node was added as a simple way to coexist with a tags table that has non-topic/area tags in it. Anything that's a child of the root tag node is part of the graph. Things that aren't (e.g. Abstract) can be ignored.

  2. If I hadn't already written code for the tag graph UI, I could have written this function with idToParentIds and idToNode maps, but given I had those 2 functions already (which, if you squint, create the same data structure) it seemed simpler at the time to do it this way.

We should have a call about this to make sure I'm understanding you correctly though. 🙂


return parentTagsByChildName
}

larsyencken marked this conversation as resolved.
Show resolved Hide resolved
export async function updateTagGraph(
knex: KnexReadWriteTransaction,
tagGraph: FlatTagGraph
Expand Down