Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SRT 32 character line limit doesn't take into account wide/fullwidth characters #370

Open
ryan-lp opened this issue Apr 28, 2022 · 19 comments
Labels
discussion needed This needs more discussion help wanted Extra attention is needed

Comments

@ryan-lp
Copy link

ryan-lp commented Apr 28, 2022

Characters in Chinese, Japanese and Korean are twice the width of characters in the Latin alphabet. As such, applying the same 32 character line limit in SRT files to Chinese as you would do to English would effectively allow Chinese SRT lines to be twice as long as English lines and may overflow the viewing window. Probably for Chinese, the character limit should be half of 32, i.e. 16.

If we allow for multilingual transcripts (#367 ), then there may be a mix of Chinese words and English words in the same sentence, and so it would make more sense to specify that each Chinese/Japanese/Korean character counts as 2 characters when computing the 32 character line limit.

For more information, see: https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms

@daveajones
Copy link
Contributor

Is this a reference to something in the namespace transcript spec or something in the SRT spec itself?

@ryan-lp
Copy link
Author

ryan-lp commented Apr 28, 2022

It is in reference to the podcast namespace spec:

https://github.com/Podcastindex-org/podcast-namespace/blob/main/transcripts/transcripts.md#srt

As far as I'm aware, there is no official specification for SRT on maximum line length, although various organisations have their own guidelines. The general idea is the same, to put just enough text on the screen that it appears for at least 1-2 seconds, and doesn't exceed two lines. The namespace's standard of 32 characters is possibly a little bit narrower than most other recommendations I have found for English, but too wide for Chinese/Japanese/Korean due to the double-width characters.

@ryan-lp
Copy link
Author

ryan-lp commented May 12, 2022

Looking across the web for existing guidelines, I found the following page from JBI which also arrived at the same 16 character line limit:

https://jbilocalization.com/tips-subtitling-captioning-japanese/

It would be relatively straightforward to amend the spec to halve the character limit for languages with double-width characters, although I think it would be more flexible when considering multilingual transcripts to instead count each double-width character as 2 characters. That way, if a single line contains a mix of both Chinese and English words, it should still add up to the correct line length.

@daveajones
Copy link
Contributor

Linking this on Podcastindex.social for more visibility, along with #407. https://podcastindex.social/@dave/110083279500496922

@daveajones daveajones added help wanted Extra attention is needed discussion needed This needs more discussion labels Mar 25, 2023
@jamescridland
Copy link
Contributor

How's this?

"SRT files are for use in simple closed-captions or can be parsed for display on a website. SRT lines should be limited to 32 visible characters, to ensure they are visible on a wide variety of screen sizes without alteration. SRT files are more widely supported than other formats, and are highly recommended."

"JSON files are for use in more advanced circumstances, and contain timing for each word's start and end."

@ryan-lp
Copy link
Author

ryan-lp commented Mar 26, 2023

How's this?

"SRT files are for use in simple closed-captions or can be parsed for display on a website. SRT lines should be limited to 32 visible characters, to ensure they are visible on a wide variety of screen sizes without alteration. SRT files are more widely supported than other formats, and are highly recommended."

This doesn't address the issue at hand (maybe you posted it on the wrong issue?). I would suggest wording along these lines:

"SRT lines should be limited to 32 characters with each full-width character (Chinese/Japanese/Korean) counting as 2 characters."

@ryan-lp
Copy link
Author

ryan-lp commented Mar 26, 2023

As another reference, here is Netflix's limits.

They give a special character limit for Chinese, Japanese and Korean, and then they have an "everything else" category. The CJK character limit is roughly half that of the "everything else" category, however Netflix doesn't actually specify what should happen when using a mix of scripts in a single line, so the wording I proposed above (or something like it) would better handle that case (not only for multilingual podcasts, but also for the Japanese language itself whose writing system already mixes 4 different scripts including Latin.)

@ryan-lp
Copy link
Author

ryan-lp commented Mar 27, 2023

SRT lines should be limited to 32 visible characters
This doesn't address the issue at hand

Ah, unless the word "visible" is intended to indirectly do all the work of accounting for full-width characters, not by mentioning them explicitly but just by implying that anything "invisible" would be bad. Although I find the wording a bit problematic.

On Netflix and Teletext, we can tell subtitle publishers to make sure their text will be visible precisely because the Netflix and Teletext font proportions and layout are specified, and visibility can be determined from that. But in podcast apps, different developers are free to use different screen layouts and font proportions. SRT publishers can't simply be told to make sure their text will be "visible" in any app, and then somehow be able to derive that the character limit for CJK languages should be 16. I think it's got to be the other way around, since SRT publishers only deal with character limits, not font proportions and layouts which vary from app to app. So we need to give SRT publishers a concrete character limit, and THEN it would be up to the app developers to ensure visibility.

@jamescridland
Copy link
Contributor

jamescridland commented Apr 6, 2023

Yes, I meant "visible" as exactly that - what is visible on the screen once you've decoded it.

"SRT files are for use in simple closed-captions or can be parsed for display on a website. SRT lines should be limited to 32 visible characters on-screen after decoding, to ensure they are visible on a wide variety of screen sizes without alteration. When using some alphabets or characters, that may mean an SRT file is longer than 32 characters before decoding. SRT files are more widely supported than other formats, and are highly recommended."

Does that work better?

So - 💩 is (I think) three characters - U+1F4A9 - but displays as one character on-screen after decoding. Or maybe it's two characters. But it displays as one character on-screen.

That gives us a hard and fast rule - character limits, not worrying about the proportions of text and trying to calculate the difference between the length of 'm' and of 'i'. That's up to the app to deal with (and the app knows that were they to use fixed-width text then everything would be a maximum of 32 characters and the same physical length once decoded.

@jamescridland
Copy link
Contributor

I'd probably add: it's likely that Netflix shows bigger text for complex characters like Chinese and Korean, to aid in comprehension. My memory from watching SD television on a CRT screen in Hong Kong would suggest that's the case: the resolution wasn't capable of showing text as small as the characters we use.

@ryan-lp
Copy link
Author

ryan-lp commented Apr 6, 2023

I'd probably add: it's likely that Netflix shows bigger text for complex characters like Chinese and Korean, to aid in comprehension. My memory from watching SD television on a CRT screen in Hong Kong would suggest that's the case: the resolution wasn't capable of showing text as small as the characters we use.

But that's not a side note or an afterthought, that is actually THE whole purpose of this issue :-)

Compare these two lines, the first in English, the second in Japanese:

Hello and welcome to this week's
こんにちは、そして今週のあなたの好きな番組のエピソードにようこそ

Both of these lines contain exactly 32 characters each. The 32 character limit suffered from the same sort of oversight that often happens with CJK languages (Chinese, Japanese, Korean) (see this), although these 3 languages are a well-recognised exception in typography, which is why subtitle guidelines tend to have a general rule that applies to most languages, and then tend to have language-specific exceptions for these specific languages.

It's not as though we have 100s of languages and therefore we have 100s of potentially different character limits to address each language. No, actually latin script has been highly influential throughout the world's languages, with many languages simply borrowing and extending the latin alphabet, and even for scripts that don't follow the latin alphabet, their typography has followed latin conventions in that they have adopted a similar average character width to latin. The exceptions are very few, and so we can list these exceptions explicitly to make it absolutely clear. That is to say, we don't need to come up with some general and vague wording that could apply to any language in the future, we can actually list out the exceptions right now and make it clear and unambiguous.

This issue is not about how 'i' and 'm' may have different widths in proportional fonts, that's ultimately irrelevant when designing a publishing guideline for subtitles because the guidelines are based on the averages that are appropriate for the particular language. And in English, lines of text with similar character counts will tend to have roughly similar widths due to the way averages play out, even though proportional fonts allow for theoretically worst case extremes. This is somewhat addressed in most guidelines which advise having some margin space on either side when rendering, which means that there is some extra room for proportional fonts to grow.

This issue is also not about unicode encoding and decoding. The number of bytes used to encode a unicode codepoint does not determine the visible width of that character, and those bytes are not "characters" by modern character counting conventions (on UNIX, see wc -m vs the vestigial wc -c). I think if you start bringing in this low level terminology into what should be a really simple spec, it will only confuse the matter. When programmers implement the spec, what they are going to do when counting how many characters are in a line is they're going to use the standard unicode string length function in their respective modern programming language that supports unicode. And for the above two example lines, they're going to get a count of 32 characters of English, and 32 characters of Japanese. What's then needed is a clear spec that simply says that actually these particular languages have a different character limit which is XYZ, and aside from those special cases, every other language has a character limit of 32.

(In my original proposal, I was also trying to address the case of multilingual podcasts. But even then, I think we should avoid going to the level of byte encodings which has nothing to do with rendering, and would confuse the matter. In a UNIX terminal of 80 columns by 24 rows, originally designed for latin script in ASCII encoding, that originally meant an 80 character line limit. But when this was extended to support CJK languages, that meant a line limit of 40 CJK characters, or 80 latin characters, or when a line contains some combination of both kinds of characters, each CJK character occupies the same amount of space as 2 latin characters. If we are to have a special case for certain languages like this, rather than define a special character limit of 16, we could instead specify a special width factor of 2, where that number could contain decimal places.)

@jamescridland
Copy link
Contributor

Please: I gave a (hamfisted) example of what the spec should say. Here's where you can do similar.

Propose what the spec should say, and we can all agree to it. This isn't a space for a long argument, this is a space for "here's what the spec should say in my experience - everyone agree?"

@ryan-lp
Copy link
Author

ryan-lp commented Apr 7, 2023

Please: I gave a (hamfisted) example of what the spec should say. Here's where you can do similar.

"Please", no need to suggest that. I already did precisely that, and you disagreed with my proposed wording to explicitly mention special case languages in the spec. We are at a point now where we disagree, and you haven't given any reasons for what you have against my wording that aims to mention these special case languages, and it was your continued resistance to that that brought about more detailed reasoning from me for why we really do need to do this. If you want to put a plug in hearing further reasons, "please" engage with the specifics of the proposal and tell me why do you not accept my proposed wording to explicitly specify a special limit for the C/J/K languages. Sure, like your own, my initial wording may be hamfisted in some ways, but I could have actually improved on that if you had engaged directly with the proposal and said specifically why you don't accept it.

Propose what the spec should say, and we can all agree to it. This isn't a space for a long argument, this is a space for "here's what the spec should say in my experience - everyone agree?"

We have already done that. Everyone did not agree. The next step would be to hear the reasons on "both" sides and then make an informed decision weighing both sides. A simple vote that is not an informed vote would be terrible for international languages, and the fact that most people who tend contribute to projects such as this are probably not well informed on CJK typesetting issues is the very reason why the current specification made the oversight on international language issues in the first place.

You don't need to rush a decision about an important issue that most contributors may happen to be not-so-well informed about. Let's hear from you what you specifically don't like about explicitly mentioning the special case languages (which is actually a common industry practice!) and I promise I am the type of person who is amenable to reason.

@jamescridland
Copy link
Contributor

We are at a point now where we disagree

We're not. We're at a point where I have tried to suggest wording for a specification, and you have told me it's wrong.

Let's hear from you what you specifically don't like about explicitly mentioning the special case languages

I do not disagree with mentioning them. I was under the impression you were discussing encoding, and nothing more. I'm grateful for the additional information.

Please write what the spec should say. I'm not disagreeing with you. I just don't know what you want the spec to say.

@ryan-lp
Copy link
Author

ryan-lp commented Apr 7, 2023

See the above comment from two weeks ago.

@jamescridland
Copy link
Contributor

OK,

"SRT lines should be limited to 32 characters with each full-width character (Chinese/Japanese/Korean) counting as 2 characters."

Could I suggest a simpler:

"SRT lines should be limited to 32 characters, except where the transcript is in Chinese, Japanese or Korean, in which case SRT lines should be limited to 16 characters."

I don't understand the reference to "full-width characters".

So, the full wording proposed is:

"SRT files are for use in simple closed-captions or can be parsed for display on a website. SRT lines should be limited to 32 characters, except where the transcript is in Chinese, Japanese or Korean, in which case SRT lines should be limited to 16 characters. This is to ensure they are visible on a wide variety of screen sizes without alteration. SRT files are more widely supported than other formats, and are highly recommended."

"JSON files are for use in more advanced circumstances, and contain timing for each word's start and end."

@ryan-lp
Copy link
Author

ryan-lp commented Apr 7, 2023

That's sort of the "first level" of what I'm proposing, and I agree it is simple to express at this level.

But the second level is to account for multilingual podcasts where the speaker switches between different languages. Let's consider a primarily English language podcast where the host gives you Japanese lessons in English. In other words, 80-90% of the words will be in English, but occasionally new vocabulary will be introduced in Japanese and then explained in English. Now neither the 32 character limit nor the 16 character limit is appropriate, but my proposed (admittedly less simple) wording will handle it. Here is an example, where we choose the 32 character limit and the second line ends up being too wide:

Today we'll learn to write words
in カタカ like フライドポテト and ステーキハウス。

Since it's mixed, the second line is not as wide as a line that is 100% in Japanese, and it's not as narrow as a line that's 100% in English.

To account for multilingual podcasts, what we want here is to set the character limit at 32 standard characters, but then to say that each full-width CJK character counts as 2 standard characters (and to be clear, I'm not extending that to the odd English character that happens to be wide, like 'M', so just CJK full-width characters here). Then, we have a specification of how long the second line should be before the line break:

Today we'll learn to write words
in カタカ like フライドポテト

where we count "in " (3) + "カタカ" (6) + " like " (6) + "フライドポテト" (14) totaling 29. The second line leaves room for 3 more characters, but had we added " and" (4) onto the end of that line, it would have totaled 33 and exceeded the limit by one character, hence we break here before " and". This way of wording the spec produces reasonable line widths even for multilingual podcasts, even though I grant that the wording is less simple.

So if you want to approach this in stages and delay the multilingual aspect for a later decision, then yes, the wording in your last comment is a clear expression of the first level of this proposal.

@jamescridland
Copy link
Contributor

So if you want to approach this in stages and delay the multilingual aspect for a later decision

I'm merely asking for what you think is the whole proposal. I'm not wanting to complicate this - just gently probing to ensure we have the correct wording that won't confuse people.

In order to expedite this, how's this:

"SRT lines should be limited to 32 characters. Because of their complexity, Chinese, Japanese or Korean characters are counted as two characters wide."

Your link to "full-width" links to a page about unicode, which takes us back down the confusing rabbit hole of encoding methods, which is what I initially thought you were talking about. I'm merely trying to find something that is easily understandable. You're clearly knowledgeable about this, so I'd welcome your proposed wording.

(I know it says "maintainer" on my name here; but I'm a volunteer just like you are, and I only have additional privileges here to assist others.)

@ryan-lp
Copy link
Author

ryan-lp commented Apr 7, 2023

"SRT lines should be limited to 32 characters. Because of their complexity, Chinese, Japanese or Korean characters are counted as two characters wide."

Not because of their complexity, you could delete that clause.

Your link to "full-width" links to a page about unicode, which takes us back down the confusing rabbit hole of encoding methods, which is what I initially thought you were talking about.

"Full-width" is just the name of the category of the "visually" wide CJK characters we're talking about. If you don't like the previous link, here's another link which explains it independently of Unicode, just so we're clear we're not talking about how a CJK character is encoded (i.e. what bytes and how many bytes), we are just talking about the specific category of wide characters that are used in CJK.

If you object to that word, instead of saying "full-width CJK characters", I suppose you could say it more colloquially as "wide CJK characters" or "CJK characters" etc.

I'm merely trying to find something that is easily understandable. You're clearly knowledgeable about this, so I'd welcome your proposed wording.

I've suggested a concise wording, but I think clarity would be improved by making it less concise:


The maximum number of characters per line is 32, with the following exceptions:

  • Chinese: 16 characters.
  • Japanese: 16 characters.
  • Korean: 16 characters.

When more than one language appears within the same line, count the number of characters belonging to each language and divide each character count by the respective character limit for that language. The sum of these fractions should not exceed 1. For example, if a line contains 16 English characters and 8 Japanese characters, then it fits perfectly within the character limit, because 16/32 + 8/16 is exactly 1.


By enumerating out the language-specific exceptions, it allows us to add further exceptions in the future for languages we didn't consider right now (e.g. Arabic).

By explaining the multilingual case separately, it simplifies the language for the common case of monolingual transcripts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion needed This needs more discussion help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants