Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[maui-scenarios] MAUI scenarios and XHarness versions #4574

Open
ivanpovazan opened this issue Nov 14, 2024 · 7 comments
Open

[maui-scenarios] MAUI scenarios and XHarness versions #4574

ivanpovazan opened this issue Nov 14, 2024 · 7 comments
Assignees

Comments

@ivanpovazan
Copy link
Member

ivanpovazan commented Nov 14, 2024

Description

Problem 1: using an outdated version

MAUI scenarios on Android are using an outdated version of XHarness.
More specifically, the version used in perf jobs is:

<MicrosoftDotNetXHarnessCLIVersion>1.0.0-prerelease.21566.2</MicrosoftDotNetXHarnessCLIVersion>

For reference, the current version of XHarness is: 10.0.0-prerelease.24524.9

Problem 2: using outdated commands

Bumping the version manually will not be the only fix/change for this issue as the code in:

cmdline = xharnesscommand() + ['android', 'state', '--adb']
is invoking xharness android state --adb which is not a supported command anymore

Problem 3: mismatch between referenced xharness versions

MAUI scenarios for iOS are using a different xharness version:

<MicrosoftDotNetXHarnessCLIVersion>9.0.0-prerelease.23606.1</MicrosoftDotNetXHarnessCLIVersion>

It is recommended to align the xharness references.

NOTE

An additional consideration would be to switch from hardcoding xharness versions to use darc subscriptions instead.

Security

While this currently "works" on CI, any changes to the tools or its references (like adb) will not be available for perf testing.
With that, all recent improvements regarding security/SDL work on XHarness will not be included.


/cc: @vitek-karas

@ivanpovazan
Copy link
Member Author

@LoopedBard3 please feel free to link the testing CI runs here so we can help if needed investigating the failures.

@matouskozak
Copy link
Member

matouskozak commented Nov 29, 2024

Based on the logs from https://dev.azure.com/dnceng/internal/_build/results?buildId=2590246&view=logs&j=efa3ffcd-91e9-5b69-9db7-650958b3131c&t=a635f724-5afe-5774-89bd-de12fd2d4e6e:

(.venv) D:\h\w\A39508DB\w\B8830A85\e>python test.py devicestartup --device-type android --package-path pub\com.companyname.netandroiddefault-Signed.apk --package-name com.companyname.NetAndroidDefault --scenario-name "Device Startup - .NET Android Default"  --upload-to-perflab-container 
[2024/11/26 16:51:02][INFO] ----------------------------------------------
[2024/11/26 16:51:02][INFO] Initializing logger 2024-11-26 16:51:02.220086
[2024/11/26 16:51:02][INFO] ----------------------------------------------
[2024/11/26 16:51:02][INFO] Clearing potential previous run nettraces
[2024/11/26 16:51:02][INFO] Preparing ADB
[2024/11/26 16:51:02][INFO] $ dotnet exec D:\h\w\A39508DB\p\microsoft.dotnet.xharness.cli\10.0.0-prerelease.24524.9\tools\net8.0\any\Microsoft.DotNet.XHarness.CLI.dll android adb -- shell wm size
[2024/11/26 16:51:04][INFO] Physical size: 1080x2340
[2024/11/26 16:51:04][INFO] * daemon not running; starting now at tcp:5037
[2024/11/26 16:51:04][INFO] * daemon started successfully
['Device Startup - .NET Android Default' END OF WORK ITEM LOG: Command timed out, and was killed]

it appears that we are executing XHarness and getting the output successfully.

However, I did some experiments in my own branch and it appears that the XHarness command is executed but gets stuck inside

def __runinternal(self, working_directory: Optional[str] = None) -> Tuple[int, str]:
should_pipe = self.verbose
with push_dir(working_directory):
quoted_cmdline = '$ '
quoted_cmdline += list2cmdline(self.cmdline)
if '-AzureFeed' in self.cmdline or '-FeedCredential' in self.cmdline:
quoted_cmdline = "<dotnet-install command contains secrets, skipping log>"
getLogger().info(quoted_cmdline)
with Popen(
self.cmdline,
stdout=PIPE if should_pipe else DEVNULL,
stderr=STDOUT,
universal_newlines=False,
encoding=None,
bufsize=0
) as proc:
if proc.stdout is not None:
with proc.stdout:
self.__stdout = StringIO()
for raw_line in iter(proc.stdout.readline, b''):
line = raw_line.decode('utf-8', errors='backslashreplace')
self.__stdout.write(line)
line = line.rstrip()
getLogger().info(line)
proc.wait()
return (proc.returncode, quoted_cmdline)

My guess is that something changed in XHarness + adb which is preventing Python Popen to successfully finish (XHarness or subprocess adb not exiting?). I couldn't reproduce this locally so I think we might need to take one of the machines out of the Helix queue and investigate it there.

For reference, dotnet/xharness#782 PR that added xharness android adb -- functionality.


Update

Reproduction tries with xharness 10.0.0-prerelease.24524.9:

  • macOS using Perf infra (hacked together to work with non-windows host) -> didn't reproduce
  • Windows host in separate python file without Perf infra using Popen and xharness android adb -- devices. Using Python 3.7 or Python 3.9. -> didn't reproduce
  • Windows host and running the Perf infra as setup on Helix
  • Helix host

@LoopedBard3
Copy link
Member

From what I recall last time I tried tracking this down, the only place that it repro'd was when running through helix on the machines. We should still try running as Perf Infra on Windows host regardless.

@ivanpovazan
Copy link
Member Author

Thanks for providing the update and additional info on previous attempts to discover the problem.
I think we should connect to the Helix machine, and try to run:

  • adb on its own
  • adb through xharness
  • xharness through python

to narrow down the problem.
If it turns out to be Helix configuration, we should look into what is specific with the queue that dotnet perf is using, as opposed to all other (dotnet, MAUI and Xamarin) CIs which run with the latest xharness without any issues.

@LoopedBard3
Copy link
Member

I did some more modification to the running and it seems that the issue may be something with the STDOUT pipes, as my latest test gets caught with trying to close the stdout after the process has returned. I will give it a shot manually on the helix machine to get a better idea of where this issue may be from.

@LoopedBard3
Copy link
Member

LoopedBard3 commented Dec 11, 2024

I was able to run the same python script that is hitting the hang in the pipeline on the machine manually, and it is able to run past the spot that is hanging when run inside helix. Specifically, when running the python test ... command manually, the script runs past the spot of the current hang (it still seems to be failing, but runs most of the commands successfully so the failure is likely unrelated). I am getting the same results after putting the command line in a execute.cmd file and running the file instead.

More recently (above is from a few days ago but didn't get enough testing to send), I modified the workflow of the testing code for testing with the updates being pushed here: https://github.com/LoopedBard3/performance/tree/UpdateXHarnessAndroidNov2024. With the latest updates (manually closing specific streams, etc.) the DNCENGWIN-063 machine is making it past the hang while DNCENGWIN-065 is not (runs: https://dev.azure.com/dnceng/internal/_build/results?buildId=2598361&view=logs&j=efa3ffcd-91e9-5b69-9db7-650958b3131c&t=a635f724-5afe-5774-89bd-de12fd2d4e6e). Interestingly, it seems that the 063 machine is not printing out that it is starting the adb daemon while the 065 machine is. With this being the case, I am not sure if my recent update actually fixed anything.

In the next round of testing, I think I am going to restart the 063 machine, or at least kill the ADB service to see if maybe something in that output is causing the hang.

@matouskozak
Copy link
Member

I was able to run the same python script that is hitting the hang in the pipeline on the machine manually, and it is able to run past the spot that is hanging when run inside helix. Specifically, when running the python test ... command manually, the script runs past the spot of the current hang (it still seems to be failing, but runs most of the commands successfully so the failure is likely unrelated). I am getting the same results after putting the command line in a execute.cmd file and running the file instead.

More recently (above is from a few days ago but didn't get enough testing to send), I modified the workflow of the testing code for testing with the updates being pushed here: https://github.com/LoopedBard3/performance/tree/UpdateXHarnessAndroidNov2024. With the latest updates (manually closing specific streams, etc.) the DNCENGWIN-063 machine is making it past the hang while DNCENGWIN-065 is not (runs: https://dev.azure.com/dnceng/internal/_build/results?buildId=2598361&view=logs&j=efa3ffcd-91e9-5b69-9db7-650958b3131c&t=a635f724-5afe-5774-89bd-de12fd2d4e6e). Interestingly, it seems that the 063 machine is not printing out that it is starting the adb daemon while the 065 machine is. With this being the case, I am not sure if my recent update actually fixed anything.

In the next round of testing, I think I am going to restart the 063 machine, or at least kill the ADB service to see if maybe something in that output is causing the hang.

I see that DNCENGWIN-063 is reporting INSTALL_FAILED_INSUFFICIENT_STORAGE seems that something is off with that machine which could be causing the subsequent failure.

I think it is a good idea to try to restart he 063 machine and clean the storage to see if the fix works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants