[maui-scenarios] MAUI scenarios and XHarness versions #4574

ivanpovazan · 2024-11-14T14:06:43Z

Description

Problem 1: using an outdated version

MAUI scenarios on Android are using an outdated version of XHarness.
More specifically, the version used in perf jobs is:

performance/eng/performance/maui_scenarios_android.proj

Line 7 in acac700

    
           <MicrosoftDotNetXHarnessCLIVersion>1.0.0-prerelease.21566.2</MicrosoftDotNetXHarnessCLIVersion>

For reference, the current version of XHarness is: 10.0.0-prerelease.24524.9

Problem 2: using outdated commands

Bumping the version manually will not be the only fix/change for this issue as the code in:

performance/src/scenarios/shared/androidhelper.py

Line 25 in acac700

cmdline = xharnesscommand() + ['android', 'state', '--adb']

is invoking xharness android state --adb which is not a supported command anymore

Problem 3: mismatch between referenced xharness versions

MAUI scenarios for iOS are using a different xharness version:

performance/eng/performance/maui_scenarios_ios.proj

Line 6 in acac700

    
           <MicrosoftDotNetXHarnessCLIVersion>9.0.0-prerelease.23606.1</MicrosoftDotNetXHarnessCLIVersion>

It is recommended to align the xharness references.

NOTE

An additional consideration would be to switch from hardcoding xharness versions to use darc subscriptions instead.

Security

While this currently "works" on CI, any changes to the tools or its references (like adb) will not be available for perf testing.
With that, all recent improvements regarding security/SDL work on XHarness will not be included.

/cc: @vitek-karas

The text was updated successfully, but these errors were encountered:

ivanpovazan · 2024-11-28T13:38:35Z

@LoopedBard3 please feel free to link the testing CI runs here so we can help if needed investigating the failures.

matouskozak · 2024-11-29T10:24:51Z

Based on the logs from https://dev.azure.com/dnceng/internal/_build/results?buildId=2590246&view=logs&j=efa3ffcd-91e9-5b69-9db7-650958b3131c&t=a635f724-5afe-5774-89bd-de12fd2d4e6e:

(.venv) D:\h\w\A39508DB\w\B8830A85\e>python test.py devicestartup --device-type android --package-path pub\com.companyname.netandroiddefault-Signed.apk --package-name com.companyname.NetAndroidDefault --scenario-name "Device Startup - .NET Android Default"  --upload-to-perflab-container 
[2024/11/26 16:51:02][INFO] ----------------------------------------------
[2024/11/26 16:51:02][INFO] Initializing logger 2024-11-26 16:51:02.220086
[2024/11/26 16:51:02][INFO] ----------------------------------------------
[2024/11/26 16:51:02][INFO] Clearing potential previous run nettraces
[2024/11/26 16:51:02][INFO] Preparing ADB
[2024/11/26 16:51:02][INFO] $ dotnet exec D:\h\w\A39508DB\p\microsoft.dotnet.xharness.cli\10.0.0-prerelease.24524.9\tools\net8.0\any\Microsoft.DotNet.XHarness.CLI.dll android adb -- shell wm size
[2024/11/26 16:51:04][INFO] Physical size: 1080x2340
[2024/11/26 16:51:04][INFO] * daemon not running; starting now at tcp:5037
[2024/11/26 16:51:04][INFO] * daemon started successfully
['Device Startup - .NET Android Default' END OF WORK ITEM LOG: Command timed out, and was killed]

it appears that we are executing XHarness and getting the output successfully.

However, I did some experiments in my own branch and it appears that the XHarness command is executed but gets stuck inside

performance/scripts/performance/common.py

Lines 274 to 302 in 9d4bbf4

    
           def __runinternal(self, working_directory: Optional[str] = None) -> Tuple[int, str]: 
        
               should_pipe = self.verbose 
        
               with push_dir(working_directory): 
        
                   quoted_cmdline = '$ ' 
        
                   quoted_cmdline += list2cmdline(self.cmdline) 
        
                   if '-AzureFeed' in self.cmdline or '-FeedCredential' in self.cmdline: 
        
                       quoted_cmdline = "<dotnet-install command contains secrets, skipping log>" 
        
                   getLogger().info(quoted_cmdline) 
        
                   with Popen( 
        
                           self.cmdline, 
        
                           stdout=PIPE if should_pipe else DEVNULL, 
        
                           stderr=STDOUT, 
        
                           universal_newlines=False, 
        
                           encoding=None, 
        
                           bufsize=0 
        
                   ) as proc: 
        
                       if proc.stdout is not None: 
        
                           with proc.stdout: 
        
                               self.__stdout = StringIO() 
        
                               for raw_line in iter(proc.stdout.readline, b''): 
        
                                   line = raw_line.decode('utf-8', errors='backslashreplace') 
        
                                   self.__stdout.write(line) 
        
                                   line = line.rstrip() 
        
                                   getLogger().info(line) 
        
                       proc.wait() 
        
                       return (proc.returncode, quoted_cmdline)

My guess is that something changed in XHarness + adb which is preventing Python Popen to successfully finish (XHarness or subprocess adb not exiting?). I couldn't reproduce this locally so I think we might need to take one of the machines out of the Helix queue and investigate it there.

For reference, dotnet/xharness#782 PR that added xharness android adb -- functionality.

Update

Reproduction tries with xharness 10.0.0-prerelease.24524.9:

macOS using Perf infra (hacked together to work with non-windows host) -> didn't reproduce
Windows host in separate python file without Perf infra using Popen and xharness android adb -- devices. Using Python 3.7 or Python 3.9. -> didn't reproduce
Windows host and running the Perf infra as setup on Helix
Helix host

LoopedBard3 · 2024-12-02T21:33:14Z

From what I recall last time I tried tracking this down, the only place that it repro'd was when running through helix on the machines. We should still try running as Perf Infra on Windows host regardless.

ivanpovazan · 2024-12-04T13:14:32Z

Thanks for providing the update and additional info on previous attempts to discover the problem.
I think we should connect to the Helix machine, and try to run:

adb on its own
adb through xharness
xharness through python

to narrow down the problem.
If it turns out to be Helix configuration, we should look into what is specific with the queue that dotnet perf is using, as opposed to all other (dotnet, MAUI and Xamarin) CIs which run with the latest xharness without any issues.

LoopedBard3 · 2024-12-05T22:06:08Z

I did some more modification to the running and it seems that the issue may be something with the STDOUT pipes, as my latest test gets caught with trying to close the stdout after the process has returned. I will give it a shot manually on the helix machine to get a better idea of where this issue may be from.

LoopedBard3 · 2024-12-11T00:18:28Z

I was able to run the same python script that is hitting the hang in the pipeline on the machine manually, and it is able to run past the spot that is hanging when run inside helix. Specifically, when running the python test ... command manually, the script runs past the spot of the current hang (it still seems to be failing, but runs most of the commands successfully so the failure is likely unrelated). I am getting the same results after putting the command line in a execute.cmd file and running the file instead.

More recently (above is from a few days ago but didn't get enough testing to send), I modified the workflow of the testing code for testing with the updates being pushed here: https://github.com/LoopedBard3/performance/tree/UpdateXHarnessAndroidNov2024. With the latest updates (manually closing specific streams, etc.) the DNCENGWIN-063 machine is making it past the hang while DNCENGWIN-065 is not (runs: https://dev.azure.com/dnceng/internal/_build/results?buildId=2598361&view=logs&j=efa3ffcd-91e9-5b69-9db7-650958b3131c&t=a635f724-5afe-5774-89bd-de12fd2d4e6e). Interestingly, it seems that the 063 machine is not printing out that it is starting the adb daemon while the 065 machine is. With this being the case, I am not sure if my recent update actually fixed anything.

In the next round of testing, I think I am going to restart the 063 machine, or at least kill the ADB service to see if maybe something in that output is causing the hang.

matouskozak · 2024-12-11T11:52:44Z

I was able to run the same python script that is hitting the hang in the pipeline on the machine manually, and it is able to run past the spot that is hanging when run inside helix. Specifically, when running the python test ... command manually, the script runs past the spot of the current hang (it still seems to be failing, but runs most of the commands successfully so the failure is likely unrelated). I am getting the same results after putting the command line in a execute.cmd file and running the file instead.

More recently (above is from a few days ago but didn't get enough testing to send), I modified the workflow of the testing code for testing with the updates being pushed here: https://github.com/LoopedBard3/performance/tree/UpdateXHarnessAndroidNov2024. With the latest updates (manually closing specific streams, etc.) the DNCENGWIN-063 machine is making it past the hang while DNCENGWIN-065 is not (runs: https://dev.azure.com/dnceng/internal/_build/results?buildId=2598361&view=logs&j=efa3ffcd-91e9-5b69-9db7-650958b3131c&t=a635f724-5afe-5774-89bd-de12fd2d4e6e). Interestingly, it seems that the 063 machine is not printing out that it is starting the adb daemon while the 065 machine is. With this being the case, I am not sure if my recent update actually fixed anything.

In the next round of testing, I think I am going to restart the 063 machine, or at least kill the ADB service to see if maybe something in that output is causing the hang.

I see that DNCENGWIN-063 is reporting INSTALL_FAILED_INSUFFICIENT_STORAGE seems that something is off with that machine which could be causing the subsequent failure.

I think it is a good idea to try to restart he 063 machine and clean the storage to see if the fix works.

ivanpovazan assigned matouskozak Nov 20, 2024

LoopedBard3 self-assigned this Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[maui-scenarios] MAUI scenarios and XHarness versions #4574

[maui-scenarios] MAUI scenarios and XHarness versions #4574

ivanpovazan commented Nov 14, 2024 •

edited

Loading

ivanpovazan commented Nov 28, 2024

matouskozak commented Nov 29, 2024 •

edited

Loading

LoopedBard3 commented Dec 2, 2024

ivanpovazan commented Dec 4, 2024

LoopedBard3 commented Dec 5, 2024

LoopedBard3 commented Dec 11, 2024 •

edited

Loading

matouskozak commented Dec 11, 2024

[maui-scenarios] MAUI scenarios and XHarness versions #4574

[maui-scenarios] MAUI scenarios and XHarness versions #4574

Comments

ivanpovazan commented Nov 14, 2024 • edited Loading

Description

Problem 1: using an outdated version

Problem 2: using outdated commands

Problem 3: mismatch between referenced xharness versions

NOTE

Security

ivanpovazan commented Nov 28, 2024

matouskozak commented Nov 29, 2024 • edited Loading

LoopedBard3 commented Dec 2, 2024

ivanpovazan commented Dec 4, 2024

LoopedBard3 commented Dec 5, 2024

LoopedBard3 commented Dec 11, 2024 • edited Loading

matouskozak commented Dec 11, 2024

ivanpovazan commented Nov 14, 2024 •

edited

Loading

matouskozak commented Nov 29, 2024 •

edited

Loading

LoopedBard3 commented Dec 11, 2024 •

edited

Loading