A portal for speech synthesis #1570

eeejay · 2025-01-13T23:41:13Z

eeejay
Jan 13, 2025

Background

Speech synthesis is typically implemented as a desktop service. Spiel is a new speech framework that takes advantage of the distributed nature of D-Bus to allow speech providers to ship as separate services. A client library is used to collate all of the providers, and the "voices" they support into a unified interface for client applications to use. In order to do this, the client needs to have access to the session bus to search for services, and activatable services. This is discouraged in sandboxed apps, so we need a portal. Some discussion about this started in a libspiel issue (project-spiel/libspiel#19).

Proposal

I propose the speech providers portal API have a Providers property that is an array of object paths to provider proxies. Each provider proxy would implement the org.freedesktop.Speech.Provider interface and would be an intermediate between the sandboxed app and the real speech provider.

We would pass on the Voices property to the sandboxed app, and notify of changes to it.
The portal would also pass the Synthesize method and associated file descriptor from the app the the actual provider.
Clients would get notified on changes to the Providers property when a speech provider is removed or installed (via ActivatableServicesChanged and NameOwnerChanged on the host side).

This kind of design would allow the client library (libspiel) to work in an almost identical way as talking to the session bus directly. This will minimize duplication and offer predictable behavior for the app whether it is sandboxed or not.

eeejay · 2025-03-31T19:38:22Z

eeejay
Mar 31, 2025
Author

Ok, here is a proposed portal interface for your review. I have a work in progress for this. From a backend perspecive I think a simple dialog will do via the access API, so no backend updates needed.

<node name="/" xmlns:doc="http://www.freedesktop.org/dbus/1.0/doc.dtd">
  <!--
      org.freedesktop.portal.SpeechSynthesis:
      @short_description: Portal for speech synthesis

      This simple interface lets sandboxed applications query available speech providers and voices.
      It then lets applications request speech synthesis from those providers

      This documentation describes version 1 of this interface.
  -->
  <interface name="org.freedesktop.portal.SpeechSynthesis">
    <!--
        CreateSession:
        @options: Vardict with optional further information
        @handle: Object path for the created :ref:`org.freedesktop.portal.Session` object

        Create a speech session. A successfully created session can at
        any time be closed using :ref:`org.freedesktop.portal.Session.Close`, or may
        at any time be closed by the portal implementation, which will be
        signalled via :ref:`org.freedesktop.portal.Session::Closed`.

        Supported keys in the @options vardict include:

        * ``session_handle_token`` (``s``)

              A string that will be used as the last element of the session handle. Must be a valid
              object path element. See the :ref:`org.freedesktop.portal.Session` documentation for
              more information about the session handle.

    -->
    <method name="CreateSession">
      <annotation name="org.qtproject.QtDBus.QtTypeName.In0" value="QVariantMap"/>
      <arg type="a{sv}" name="options" direction="in"/>
      <arg type="o" name="handle" direction="out"/>
    </method>

    <!--
        GetProviders:
        @session_handle: Object path for the :ref:`org.freedesktop.portal.Session` object
        @parent_window: Identifier for the application window, see :doc:`window-identifiers`
        @handle: Object path for the :ref:`org.freedesktop.portal.Request` object representing this call

        Get available synthesis voices

        Supported keys in the @options vardict include:

        * ``handle_token`` (``s``)

          A string that will be used as the last element of the @handle. Must be a valid
          object path element. See the :ref:`org.freedesktop.portal.Request` documentation for
          more information about the @handle.

        The following results get returned via the :ref:`org.freedesktop.portal.Request::Response` signal:

        * ``providers`` (``a(ss)``)

           An array of providers. Each provider in the array is structure with the following members:
             * A well known name
             * A human readable name
    -->
    <method name="GetProviders">
      <arg type="o" name="session_handle" direction="in"/>
      <arg type="s" name="parent_window" direction="in"/>
      <arg type="a{sv}" name="options" direction="in"/>
      <arg type="o" name="handle" direction="out"/>
    </method>

    <!--
        GetVoices:
        @session_handle: Object path for the :ref:`org.freedesktop.portal.Session` object
        @parent_window: Identifier for the application window, see :doc:`window-identifiers`
        @handle: Object path for the :ref:`org.freedesktop.portal.Request` object representing this call

        Get available synthesis voices

        Supported keys in the @options vardict include:

        * ``handle_token`` (``s``)

          A string that will be used as the last element of the @handle. Must be a valid
          object path element. See the :ref:`org.freedesktop.portal.Request` documentation for
          more information about the @handle.

        The following results get returned via the :ref:`org.freedesktop.portal.Request::Response` signal:

        * ``voices`` (``a(ssstas)``)

          An array of voices. Each voice in the array is structure with the following members:
            * A human readable name
            * A unique identifier
            * Synthesis output format
            * A voice features bit field
            * A list of languages the voice support represented as BCP 47 tags
    -->
    <method name="GetVoices">
      <arg type="o" name="session_handle" direction="in"/>
      <arg type="s" name="parent_window" direction="in"/>
      <arg type="s" name="provider_well_known_name" direction="in"/>
      <arg type="a{sv}" name="options" direction="in"/>
      <arg type="o" name="handle" direction="out"/>
    </method>

      <!--
          Synthesize:
          @session_handle: Object path for the :ref:`org.freedesktop.portal.Session` object
          @parent_window: Identifier for the application window, see :doc:`window-identifiers`
          @handle: Object path for the :ref:`org.freedesktop.portal.Request` object representing this call
          @pipe_fd: File descriptor of pipe to write to.
          @text: The text to be spoken.
          @voice_id: The voice identifier for the voice that should be spoken.
          @pitch: The voice pitch in which the text should be spoken.
          @rate: The rate in which the text should be spoken.
          @is_ssml: True if the text should be interpretted as an SSML snippet.
          @language: The language the utterance should be spoken in. Some voices support more than one language.

          This is the basic synthesis method.
          When called, the speech provider will send the synthesized output to the given file descriptor.
          Depending on the voice's advertised format it will be raw audio or composite audio and events.
      -->
    <method name="Synthesize">
      <annotation name="org.gtk.GDBus.C.UnixFD" value="true"/>
      <arg direction="in"  type="o" name="session_handle"/>
      <arg direction="in"  type="s" name="parent_window" />
      <arg direction="in"  type="s" name="provider_well_known_name" />
      <arg direction="in"  type="h" name="pipe_fd" />
      <arg direction="in"  type="s" name="text" />
      <arg direction="in"  type="s" name="voice_id" />
      <arg direction="in"  type="d" name="pitch" />
      <arg direction="in"  type="d" name="rate" />
      <arg direction="in"  type="b" name="is_ssml" />
      <arg direction="in"  type="s" name="language" />
      <arg type="a{sv}" name="options" direction="in"/>
      <arg direction="out" type="o" name="handle" />
    </method>

    <signal name="ProvidersChanged">
      <arg type="o" name="session_handle" direction="in"/>
    </signal>

    <signal name="VoicesChanged">
      <arg type="o" name="session_handle" direction="in"/>
      <arg type="s" name="provider_well_known_name" direction="in"/>
    </signal>

    <property name="version" type="u" access="read"/>
  </interface>
</node>

0 replies

Mikenux · 2025-04-01T03:06:07Z

Mikenux
Apr 1, 2025

Could I know the target cases? Thanks.

11 replies

eeejay Apr 2, 2025
Author

the pitch/rate/language/voice is all a per-instance setting. Not a system wide one. So this portal does not and should not change any settings the user has set.

The panel in the GNOME Settings app would change the defaults, but any app is welcome to synthesize speech on their own terms. For example, if the system settings sets the default English voice to Patricia at a speech rate of 110%, an app that uses spiel and does something similar to speaker.speak(new Utterance( text: "Hello world" )), they will hear it spoken with Patricia at 110%. But if they tweak the Utterance properties they can change the voice and rate as they would want.

A good analogy is fonts. There are system defaults, but an app is free to render text using whatever font is available.

As an aside, I don't think the system settings should include volume, or probably pitch for that matter.

eeejay Apr 2, 2025
Author

I don't know what the succinct language in the portal UI would be, but it would be something like:

Allow spoken content?
App wants to use system services to retrieve available voices and speak

Mikenux Apr 2, 2025

Edit:

So if apps are in control of those, then it should be clear in the UI design, that they are controllable by the app - so these can't be modified by the user.

I meant GNOME Settings should distinguish spoken content settings for accessibility from those for apps.

The system also needs to tell an app its "spoken content" has been paused when reading a portal window.

For the portal UI, I currently don’t know how to ask. I will take a look at that later.

eeejay Apr 2, 2025
Author

Edit:

So if apps are in control of those, then it should be clear in the UI design, that they are controllable by the app - so these can't be modified by the user.

I meant GNOME Settings should distinguish spoken content settings for accessibility from those for apps.

Orca for example, has its own speech settings dialog. If/when orca settings are integrated into gnome settings they will live there too. So in that sense a user will know they are specifically configuring the screen reader speech.

The system also needs to tell an app its "spoken content" has been paused when reading a portal window.

The "speak" method is async and dispatches a signal when utterance starts and ends so this is already accounted for.

For the portal UI, I currently don’t know how to ask. I will take a look at that later.

Mikenux Apr 2, 2025

The "speak" method is async and dispatches a signal when utterance starts and ends so this is already accounted for.

I'm not knowledgeable on code, but doesn't having a signal in the portal allows to make sure that a library matches its code to relevant signals and methods, especially in case of a library change? (e.g. libspiel has a successor or the system is using another library)?

Mikenux · 2025-04-06T03:24:40Z

Mikenux
Apr 6, 2025

If we want to let apps be speech providers, another model is needed, particularly so that they can expose their speech service on the session bus without the stores lowering their sandbox level (e.g., the sandbox badge in GNOME Software).

The key is that the system is responsible for obtaining the list of providers. Users then select the providers they approve. This allows users to deny permission to apps they consider not to be true speech providers.

When it comes to protections for the providers themselves, here are two things to think about:

Permission restrictions. Here, we define the permissions a provider can or cannot have. The stricter, the better. App permissions cannot be changed by either the user or the app (especially via an update).
Only one provider can be selected. Trusting one provider is better than trusting many, especially if we don't enforce a strict set permissions to providers.

4 replies

Mikenux Apr 8, 2025

For "Only one provider can be selected", that’s one provider that is selected for all clients; this isn’t per client.

eeejay Apr 8, 2025
Author

This is not in the scope of this pr. And that isn't necessary. Sandboxed apps can provide services on the session bus.

Mikenux Apr 8, 2025

Sandboxed apps can provide services on the session bus.

Since when?

Mikenux Apr 8, 2025

Mistake. Yes, they can. And any app can do that. Since such a service can acquire data, it is not possible to have apps asking to access and use a speech service (in my opinion). It is then necessary to ask the user to select the providers to be used, in addition to determining whether to use other restrictions.

eeejay · 2025-04-25T22:23:22Z

eeejay
Apr 25, 2025
Author

Closed in favor of #1690

0 replies

Mikenux · 2025-04-30T04:00:32Z

Mikenux
Apr 30, 2025

Please reopen and link the discussion to the PR. In my opinion, comments on a PR should focus on briefly reporting issues and fixing the code.

Returning to what I wrote earlier, it's important to know what to do with data exchange (even if it is text) between apps: the provider can obtain data through another app, and it can also give it to this same app (by returning the providers and voices). This is somewhat similar to the case of the "Web Extensions" portal (although, here, the data leak may be greater).

A second point concerns the object requested. In my opinion, we cannot attest that an app providing providers, voices, and speech synthesis actually does so. Therefore, we cannot clearly ask users: "App wants to use system services to retrieve available voices and speak," as you wrote earlier. What we ask for must not attest to something that we are not sure it will work as intended.

1 reply

Mikenux May 11, 2025

@swick: This plus #1570 (comment)

Uh oh!

A portal for speech synthesis #1570

Uh oh!

Background

Proposal

Replies: 5 comments · 16 replies

Uh oh!

eeejay Mar 31, 2025 Author

Uh oh!

Uh oh!

eeejay Apr 2, 2025 Author

Uh oh!

eeejay Apr 2, 2025 Author

Uh oh!

Uh oh!

eeejay Apr 2, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eeejay Apr 8, 2025 Author

Uh oh!

Uh oh!

Uh oh!

eeejay Apr 25, 2025 Author

Uh oh!

Uh oh!

Replies: 5 comments 16 replies

eeejay
Mar 31, 2025
Author

eeejay Apr 2, 2025
Author

eeejay Apr 2, 2025
Author

eeejay Apr 2, 2025
Author

eeejay Apr 8, 2025
Author

eeejay
Apr 25, 2025
Author