Skip to content

Enhance tfact_problem_events with Submission Details #1548

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

quazi-h
Copy link
Contributor

@quazi-h quazi-h commented May 7, 2025

What are the relevant tickets?

https://github.com/mitodl/hq/issues/7195

Description (What does it do?)

Combine data from tracking logs, studentmodule, and studentmodulehistoryextended in order to enhance the tfact_problem_events data used for reporting student activity.

Affected tables:
tfact_problem_events

How can this be tested?

dbt build --select tfact_problem_events

@quazi-h quazi-h changed the title WIP first attempt at incorporating coursemodule data into the problem… Enhance tfact_problem_events with Submission Details May 7, 2025
@quazi-h quazi-h marked this pull request as ready for review May 20, 2025 22:50
quazi-h added 2 commits May 21, 2025 19:03
…le rows.

Added a new tfact model to handle the studentmodule problem events for each platform.
@@ -101,6 +101,191 @@ with mitxonline_problem_events as (
select * from {{ ref('dim_platform') }}
)

, content as (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a dedupe query in this file, as simply doing a UNION ALL on data from studentmodule would introduce duplicates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the dbt test to enforce uniqueness between these columns in tfact_problem_events:
["openedx_user_id", "courserun_readable_id", "problem_block_fk", "event_timestamp"]
When I built the tfact_problem_events table locally I did not get a failure on that test but I should still add some dedupe logic to make sure we don't have dupes with production data. I added dedupe logic to the model using the same fields as the dedupe key. I am now just testing to make sure it builds successfully.

@rachellougee
Copy link
Contributor

I meant to leave comments earlier. If we plan to integrate the studentmodulehistory data into tfact_problem_events, we should ensure there are deduplication queries in place. Otherwise, we end up getting duplicate records, since studentmodulehistory and tracking log contain overlapping data. Also, I'm not sure why we would want to add the same data in two different places if there is a new tfact_studentmodule_problems. Let me know if you want any help handling the deduplication.

quazi-h added 6 commits May 27, 2025 09:42
Updated table definition.
Updated queries in studentmodule macro.
Refactored tfact_problem_events to pull in studentmodule data from intermediate table.
Adding in guards to address null or empty maps with the correctness and answer history arrays.
Building the studentmodule tfact model as an actual table instead of a view.
@quazi-h
Copy link
Contributor Author

quazi-h commented Jun 18, 2025

/gemini review

@quazi-h quazi-h requested a review from Copilot June 18, 2025 16:56
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the tfact_problem_events data by incorporating submission details from studentmodule and studentmodulehistoryextended sources. Key changes include:

  • Introduction of a new model (tfact_studentmodule_problems.sql) that generates problem event records from multiple platforms.
  • Modifications to tfact_problem_events.sql to union in data from the new studentmodule events and perform deduplication.
  • Updates to the models YAML configuration and the transformation macro to support the new submission details.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/ol_dbt/models/dimensional/tfact_studentmodule_problems.sql New model aggregating studentmodule problem events.
src/ol_dbt/models/dimensional/tfact_problem_events.sql Updated to union studentmodule events and deduplicate records.
src/ol_dbt/models/dimensional/_dim__models.yml Added metadata for tfact_studentmodule_problems and adjusted tests.
src/ol_dbt/macros/transform_studentmodule_data.sql New macro to transform studentmodule data for problem events.
Comments suppressed due to low confidence (1)

src/ol_dbt/models/dimensional/_dim__models.yml:487

  • Duplicate column 'event_type' is defined in the tfact_studentmodule_problems model; consider consolidating or renaming one occurrence to avoid ambiguity.
  - name: event_type

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances tfact_problem_events by integrating detailed submission data from studentmodule and its history. A new macro and an intermediate model tfact_studentmodule_problems are introduced. There are a few critical and high-severity issues in the YML schema definitions (duplicate column, type mismatches, incorrect test key) that need to be addressed.

Comment on lines +105 to +124
, case
when correct_item is not null and cardinality(map_keys(cast(correct_item as map(varchar, json)))) >= 1
then map_keys(cast(correct_item as map(varchar, json)))[1]
else null
end as problem_id
, case
when correct_item is not null and cardinality(map_keys(cast(correct_item as map(varchar, json)))) >= 1
then json_extract_scalar(
cast(correct_item as map(varchar, json))[map_keys(cast(correct_item as map(varchar, json)))[1]],
'$.correctness'
)
else null
end as correctness -- noqa: PRS
, case
when answer_item is not null and cardinality(map_keys(cast(answer_item as map(varchar, json)))) >= 1
then json_format(
cast(answer_item as map(varchar, json))[map_keys(cast(answer_item as map(varchar, json)))[1]]
)
else null
end as answers_json -- noqa: PRS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for extracting problem_id, correctness, and answers_json relies on map_keys(...)[1], assuming that the relevant map (correct_item or answer_item) always has at least one key and that the first key is the desired one. If a map could legitimately contain multiple keys, this logic would only pick the first part. Consider adding a comment to clarify this assumption about the JSON structure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants