Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running Noscribe 0.5 with nvidia 4080 - 20min for a 30min WAV file or 2minutes - depending on mouse down on title bar #105

Open
phb911 opened this issue Nov 19, 2024 · 18 comments

Comments

@phb911
Copy link

phb911 commented Nov 19, 2024

Strange issue.
When I run a sample file through noscribe, and just let it do its work it takes 20min für a 30min wav file.
But, when I just hold down the mouse on the tiltle bar of noscribe, it takes 2 minutes!

This system is a little bit different from a standard PC. because it is running W10 inside a VM hosted by Server 2022. The Nvidia card is routed to the VM by PCI passthrough (hyperV). CPU is a Intel I9-14900K. 64GB Ram. It currently is the only VM on this hyper-V host.

I also tested if it makes a difference if I use a remote control software to access the noscribe VM, or if I use rdp or the hyper-V console. Makes no difference.

What can be done to make this fantastic piece of software run with expected performance?

@kaixxx
Copy link
Owner

kaixxx commented Nov 19, 2024

This is really strange. We had a somewhat similar observation a few months back: #50 (comment)

I cannot reproduce this behavior on my machine. But I also don't have an NVIDIA graphics card, which may be the reason. If you find anything that we can do to improve the performance, let me know (but I don't want to steal mouse control from the user...).

@phb911
Copy link
Author

phb911 commented Nov 20, 2024

I tried another variant: This time an older PC (threadipper 1950x, Nvidia 3050, 64GB RAM, Server 2022, no hyper-V Role installed).
I see similar symptoms, however, the difference is not factor 10 but rather factor 2 to 3.
There has to be a problem with filling a buffer or memory pool, or as an alternative a problem with taking the results of a calculation out of a queue.
Why: because if I for example I do nothing for 10sec., but then hold down the mouse button, the speedup will be high until it later goes down to a value between doing nothing and the point where I started to hold down the mouse. The longer I do nothing, the longer (up to a limit X) the speedup will be.
If I monitor the RAM usage of the Nvidia Card, it makes no difference if the mouse button is pressed or not.
Maybe, then the problem is rather related to a queue containg results which is not beeing emptied fast enough, so the processes doing work have to wait?
The behaviour occurs in every stage of the transciption process.
How difficult will it be to make a modifcation of noscribe without an UI? If the UI is rather a wrapper around a commandline program it should not be hard. Having a commandline app and testing this would give an insight aboud the source of the problem.

@kaixxx
Copy link
Owner

kaixxx commented Nov 20, 2024

It would be interesting to know if this is really related to CUDA (which I suspect). You could temporarily disable CUDA by editing the file config.yml in C:\Users\AppData\Local\noScribe\noScribe\ (make sure noScribe is closed first) and set both pyannote_xpu and whisper_xpu to cpu.

Command line:
The speaker identification runs a command line program "diarize.exe". You can use it like so:

diarize.exe cuda "<path to audio wav file>" "<path to output.yaml>" <number of speakers expected or "auto">

Note that this only works with wav files.

If you want to run faster-whisper from command line, you can try this:
https://github.com/Softcatala/whisper-ctranslate2

@phb911
Copy link
Author

phb911 commented Nov 20, 2024

Thank you,
I tried the diarize executable. If run from commandline against my sample file, it will take 40 seconds. CPU at about 50%, Graphics card at about 2%
The same file using the UI: I stopped the clock after 12minutes, when "embeddings" reached about 50%. Maybe, it will have complete it within 18minutes?
Is it possible that the UI is not able to chomp the output from the console where the percentage is reported in the needed time?
When starting the diarize process with the ui, the CPU goes for a short time up to 50%, then drop to 3%
If run on the commandline the CPU will start with 50% and will stay until the end at 50%.
Is there an option to tell diarize to suspend its output to the console until the job is done?

@kaixxx
Copy link
Owner

kaixxx commented Nov 20, 2024

If run from commandline against my sample file, it will take 40 seconds.

That's very fast. Are you sure it didn't crash silently? Please check if the output file contains data.

Is it possible that the UI is not able to chomp the output from the console

I don't think that this is the problem, for several reasons:

  • you get the same behavior with whisper. We do not use a command line tool for whisper
  • there is not much output to the console, especially during the "embeddings" phase which takes the most time
  • I don't see how this could be related with the mouse (this is the biggest mystery of the whole issue...)

@phb911
Copy link
Author

phb911 commented Nov 20, 2024

The yaml file looks fine. Also, If I let it run through the GUI (with mouse down), the time used for processing is similar, and the resulting transcription makes sense regarding the recognition of the speakers.
Btw. we have another PC which when used together with noscribe does not experience these problems. It needs 3min. Uses windows 11 on bare metal, Nvidia 4070.
I will make a try with windows 11 instead of 10. In addition I will try what happens if I run the software using a real console with keyboard and mouse and monitor. Also, if you are interested in this problem I could give you access to the machine if you think this could help and want to invest your time.
The next option would be to build noscribe myself and then try to debug the problem, However I found a very similar Software called aTrain. It does not reach 2min on the same machine but 4min, but this is still better then 20min.

@kaixxx
Copy link
Owner

kaixxx commented Nov 20, 2024

It would be great if you could investigate this a little more. You can run noScribe directly from the python source, no need to "build" it first.

I know aTrain, it's very similar, but misses some features of noScribe (no editor, no marking of pauses or overlapping speech - but you might not need that anyway). It's made by some colleagues from Austria. I don't know why they reinvented the wheel instead of collaborating, but yeah.

@phb911
Copy link
Author

phb911 commented Nov 20, 2024

I will try to dig further.
The last thing I found out is: I can also have the full speed (2min) when I right clic the top bar of the window, then select "move". I can the release the mouse. So, its not the mouse hold down, it is the focus of the application window.

@phb911
Copy link
Author

phb911 commented Nov 22, 2024

I did a new windows installation on the PC which has the AMD1950x cpu with the Nvidia 3050 card. This time, I installed a vanilla windows 11. Nothing else. no Windows domain, no GPO rules. Also, onboard audio on.
Still Bob Seeger, 11minutes when just letting the program do its work, 4 minutes if I activated the "power" switch by selecting "move window".
For now, I let it be like it is. However, I would encourage everyone else who is using an Nvidia card to try what happens regarding speed if you try the move window thing.

@gernophil
Copy link
Collaborator

I know this is about the focus (so maybe more about the old thread), but this might still be relevant:

https://answers.microsoft.com/en-us/windows/forum/all/application-focus-having-a-significant-effect-on/c43ff3d5-4483-4bbd-9b0c-a29054344eb3

@phb911
Copy link
Author

phb911 commented Nov 23, 2024

I got it fixed/or rather made a workaround, Now I have 2minutes, also when I just let the application run.
in the diarize.py:
removed the callback hook/set to None.
In the noscribe.py:
removed the progress bar advancing during the spawn of whisper.

CPU is now always >50% (if involved in the transcription) and also the Nvidia card shows permanent load.
The system used is a VM under Server 2022 running windows 11. I still have to try with Windows 10.

@kaixxx
Copy link
Owner

kaixxx commented Nov 24, 2024

Interesting, thanks for investigating. Could you show me the exact code changes?

@phb911
Copy link
Author

phb911 commented Nov 24, 2024

in noScribe.py:
Line 1228, it reads: self.log(pause_str)
commented out, this saves about 1minute
Line 1321 it reads: self.set_progress(3, progr)
commented out, this saves about 6 Minutes

in diarize.py:
Line 101, it reads:

    with SimpleProgressHook(parent=None) as hook:
        if my_num_speakers is not None:
            diarization = pipeline(audio_file, hook=hook, num_speakers=my_num_speakers) # apply the pipeline to the audio file
        else:
            diarization = pipeline(audio_file, hook=hook)

replaced with:

    with SimpleProgressHook(parent=None) as hook:
        if my_num_speakers is not None:
            diarization = pipeline(audio_file, hook=None, num_speakers=my_num_speakers) # apply the pipeline to the audio file
        else:
            diarization = pipeline(audio_file, hook=None)

Saves about 8minutes

I also changed in line 344:

            self.geometry(f"{1100}x{725}")

before there was 650, to fix the annoying problem I have to resize the windows every time I start the program.

It seems that my initial assumption about something not able to consume the logs fast enough might be true? I dont think that python is just slow, but maye it is related to the context switches which have to occur every time an output is generated? But then my knowledge about python is rather limited, I am not even used to the language.

@gernophil
Copy link
Collaborator

gernophil commented Nov 24, 2024

I fail to see the change in diarize.py. I put your code into backticks, so we can see the intendation:
```

code

```

EDIT: Ah, I've found the difference (hook=hook to hook=None) two times.

@kaixxx
Copy link
Owner

kaixxx commented Nov 24, 2024

Great. Thanks to your detailed investigation, I might have found the issue. For some reason, updating the progress bar takes much more time than it should. If I leave everything like stock (all hooks in place) and just disable the progress bar update, I consistently gain around 10% percent in speed on my non-cuda system, which is quite surprising.
I think on a cuda system, the speed gains will be much more noticeable since everything else runs much faster. (Lets say I save 1 minute on a 10 minute job on my machine. If the same job takes only 2 minutes on a cuda system, saving 1 minute will cut the total time in half.)

To disable the progress bar update, go to line 659 and add return as the first line of set_progress:

    def set_progress(self, step, value):
        """ Update state of the progress bar """
        return

Could you test this on your system? Remember to revert the other code changes back to stock.

@phb911
Copy link
Author

phb911 commented Nov 25, 2024

Yes, I can confirm the progress bar is the main culprit.
Further removing logging does make only a difference of 15 seconds (i did not make a large enough sample, but it is not 1 minute)
I wonder what else could be optimized, because during the processing with whisper the GPU is not loaded always, it rather is idle and then makes short spikes to 100% - one for every segement which is processed.

@gernophil
Copy link
Collaborator

I wonder, if it might be worth decoupling the two processes (diarize and whisper) completely from the GUI and run them as subprocess to save resources?

@kaixxx
Copy link
Owner

kaixxx commented Nov 25, 2024

@phb911: Thank you for testing. It's really strange that updating the progress bar is bogging down the system so much. But I'm happy that we can leave the other user-feedback in place (logging to the screen) and still have the performance gains. I will remove the progress bar altogether in the next release, which is planned for December.

Regarding other performance optimizations: Running such AI models is a complex task, and only certain operations benefit from GPU-acceleration. PyAnnote and faster-whisper both rely on the pytorch library (developed and maintained by Meta) which is heavily optimized for CUDA-support. I don't think that there is much potential for further improvements.

@gernophil: I'm not completely sure what you mean by running them as subprocesses. The way we do this right now with the compiled diarize.exe is consuming more resources instead of less: a second instance of the python interpreter is loaded, together with heavy libraries... We had to do this for compatibility reasons, but when it comes to resources, it's not optimal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants