Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
## Explanation Fixes #4897 ### Overview of changes This PR follows up from the changes introduced in #4846 by adding new functionality to help improve event reliability investigations, as well as telemetry diagnostics for research project facilitators that need to ensure events have been successfully uploaded for processing. To help with this, the following features have been introduced: - A new "force upload" button that immediately uploads any pending cached events. - Event counts to show both the number of events waiting to be uploaded & the events that are successfully uploaded. - An improvement on the 'share' button to include a Base64 deflated binary representation of the entire event log store on the device. Leading up to, and during, development, a number of theoretical issues were encountered that were also fixed in case they are playing a role in some of the ongoing concerns around Firebase event delivery reliability: - The sync status reported on the learner analytics screen can be wrong in some cases, leading to the assumption that events have been uploaded when they may not have been. - Events can, in some cases, be lost from previous app instances if they haven't been uploaded yet and, in other cases, duplicated. The former is much more likely to happen in practice, particularly with the connectivity patterns being used in the research project. ### Technical specifics - ``FakeSyncStatusManager`` was renamed to ``TestSyncStatusManager`` since it's actually augmenting the production implementation of ``SyncStatusManager``, not replacing it. - Events are now being cached indefinitely if the learner study feature is enabled. This allows for some robustness improvements (see below), and the ability to share the events themselves if the team suspects events are not arriving to Firebase. - The issue of sync status being incorrect came from an unreliable means of reporting sync status during different event scenarios. Specifically, if the ``LogUploadWorker`` doesn't kick off and there's network connectivity, a new event being logged could trigger the sync status to become "data uploaded" even though there are still pending events. The new solution is to leverage the now-tracked uploaded events to determine whether there are still events yet waiting to upload. In this way, the sync status should always be the most up-to-date information with one caveat: network connectivity changes won't reflect until a new log is requested (this is a limitation in how the app tracks network connectivity, and it didn't seem very important to fix it). - The issue of possible event duplication comes from ``PersistentCacheStore`` using ``mergeFrom`` with the *current* cached state and the on-disk proto being loaded. I originally thought that all pathways to loading the proto from disk could only execute once, but there's at least one pathway where it's possible to chain multiple calls into the cache store such that the load happens twice (and both results end up being merged together, hence the event duplication). This has been fixed by never reloading the cache store if it's already been loaded which should always be correct. - The issue if possible events being lost comes from ``AnalyticsController`` previously not priming the event log cache. This means that if a new event is logged before the cache is read (e.g. early events like "open profile screen" can race against ``LogUploadWorker`` kicking off on app startup) then the cache will be completely reset and, once written, will overwrite any events currently cached on disk. This has been fixed by ensuring that the event log cache has been primed before any interactions with it (read or write) occur. Note that all of the above has been thoroughly tested through automated testing to ensure that these issues have been properly fixed, and stay that way in the long-term. ### Test notes & exemptions - In order to simplify testing ``SyncStatusManagerImpl`` and ``TestSyncStatusManager``, they ought to both pass a common test suite specific to ``SyncStatusManager`` itself. This has been done by introducing a ``SyncStatusManagerTestBase`` class that's inherited by both implementation test suites so that they inherit common API tests. This pattern is used in one other location in the codebase currently: https://github.com/oppia/oppia-android/blob/0290ef8037f798d8ba2721441612a80dec4a1f1a/testing/src/test/java/org/oppia/android/testing/threading/TestCoroutineDispatcherTestBase.kt#L33 - ``CircularProgressIndicatorAdaptersTestActivity`` has been exempted for an a11y label for the same reason as all other test activities: it technically doesn't need one. - ``TestSyncStatusManagerTest`` has been allow-listed to use parameterized tests since it's aggressively verifying many different force/override sync status cases. Making sure that the test utility works correctly is extremely important to avoid false failures/passes of dependent tests (which are, in turn, guarding against regressions for the core issues covered by this PR, among other event system behaviors). - ``SyncStatusManagerTestBase`` has been exempted from requiring KDocs since it's technically a test suite, so KDocs aren't required. - A number of classes have been exempted from having new test suites as per existing conventions and requirements: ``ControlButtonsViewModel`` (renamed from ``ShareIdsViewModel``), ``CircularProgressIndicatorAdaptersTestActivity`` (simple test activities don't require their own tests), ``CircularProgressIndicatorAdaptersTestViewModel`` (view models aren't tested directly), ``SyncStatusManagerTestBase`` (the test base is more or less a test suite itself, so it doesn't make sense for it to have tests as well). - ``TestSyncStatusManagerTest`` has been disabled in Gradle since it depends on ``SyncStatusManagerTestBase`` which is part of a different module. To make this work, the test base would have to be moved to src/main of the testing module which isn't a very clean approach. The current setup works fine with Bazel, and longer term this will get cleaner by moving testing utilities like ``TestSyncStatusManager`` to be near their corresponding production implementations (we just can't do this today because of Gradle limitations). - Due to the asynchronous and data race nature of some of the issues discovered during test writing, changed & added tests were sometimes found to be flaky (and, in some cases, very rarely, e.g. 1/50). To ensure that all of the new & changed tests are stable, I made sure to run all 8 affected test suites 128 times to ensure stability. Here are the results of that run ([buildbuddy link](https://app.buildbuddy.io/invocation/5da310e1-3cc9-451b-a1d0-951a3322c8d7)): ![128xruns_followup_mr6_fixes](https://user-images.githubusercontent.com/12983742/224290120-5bb2991d-59c1-421a-8b73-e05838734253.png) ## Essential Checklist <!-- Please tick the relevant boxes by putting an "x" in them. --> - [x] The PR title and explanation each start with "Fix #bugnum: " (If this PR fixes part of an issue, prefix the title with "Fix part of #bugnum: ...".) - [x] Any changes to [scripts/assets](https://github.com/oppia/oppia-android/tree/develop/scripts/assets) files have their rationale included in the PR explanation. - [x] The PR follows the [style guide](https://github.com/oppia/oppia-android/wiki/Coding-style-guide). - [x] The PR does not contain any unnecessary code changes from Android Studio ([reference](https://github.com/oppia/oppia-android/wiki/Guidance-on-submitting-a-PR#undo-unnecessary-changes)). - [x] The PR is made from a branch that's **not** called "develop" and is up-to-date with "develop". - [x] The PR is **assigned** to the appropriate reviewers ([reference](https://github.com/oppia/oppia-android/wiki/Guidance-on-submitting-a-PR#clarification-regarding-assignees-and-reviewers-section)). ## For UI-specific PRs only Following is a video demonstrating the new functionality introduced in this PR: https://user-images.githubusercontent.com/12983742/224289397-73c5e3a7-4f7a-4116-8e71-3316e30fa9d7.mp4 Note that affected UI tests that could run on Espresso were **not** verified in this PR similar for the reasons listed in #4846 (specifically, the ongoing issue of Espresso tests being time consuming to investigate, often broken, and this PR being high priority). Localization & accessibility changes aren't relevant for this change, and any tablet-specific issues aren't important since the affected screen is only expected to be accessed on handset devices.
- Loading branch information