-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SRT 32 character line limit doesn't take into account wide/fullwidth characters #370
Comments
Is this a reference to something in the namespace transcript spec or something in the SRT spec itself? |
It is in reference to the podcast namespace spec: https://github.com/Podcastindex-org/podcast-namespace/blob/main/transcripts/transcripts.md#srt As far as I'm aware, there is no official specification for SRT on maximum line length, although various organisations have their own guidelines. The general idea is the same, to put just enough text on the screen that it appears for at least 1-2 seconds, and doesn't exceed two lines. The namespace's standard of 32 characters is possibly a little bit narrower than most other recommendations I have found for English, but too wide for Chinese/Japanese/Korean due to the double-width characters. |
Looking across the web for existing guidelines, I found the following page from JBI which also arrived at the same 16 character line limit: https://jbilocalization.com/tips-subtitling-captioning-japanese/ It would be relatively straightforward to amend the spec to halve the character limit for languages with double-width characters, although I think it would be more flexible when considering multilingual transcripts to instead count each double-width character as 2 characters. That way, if a single line contains a mix of both Chinese and English words, it should still add up to the correct line length. |
Linking this on Podcastindex.social for more visibility, along with #407. https://podcastindex.social/@dave/110083279500496922 |
How's this? "SRT files are for use in simple closed-captions or can be parsed for display on a website. SRT lines should be limited to 32 visible characters, to ensure they are visible on a wide variety of screen sizes without alteration. SRT files are more widely supported than other formats, and are highly recommended." "JSON files are for use in more advanced circumstances, and contain timing for each word's start and end." |
This doesn't address the issue at hand (maybe you posted it on the wrong issue?). I would suggest wording along these lines: "SRT lines should be limited to 32 characters with each full-width character (Chinese/Japanese/Korean) counting as 2 characters." |
As another reference, here is Netflix's limits. They give a special character limit for Chinese, Japanese and Korean, and then they have an "everything else" category. The CJK character limit is roughly half that of the "everything else" category, however Netflix doesn't actually specify what should happen when using a mix of scripts in a single line, so the wording I proposed above (or something like it) would better handle that case (not only for multilingual podcasts, but also for the Japanese language itself whose writing system already mixes 4 different scripts including Latin.) |
Ah, unless the word "visible" is intended to indirectly do all the work of accounting for full-width characters, not by mentioning them explicitly but just by implying that anything "invisible" would be bad. Although I find the wording a bit problematic. On Netflix and Teletext, we can tell subtitle publishers to make sure their text will be visible precisely because the Netflix and Teletext font proportions and layout are specified, and visibility can be determined from that. But in podcast apps, different developers are free to use different screen layouts and font proportions. SRT publishers can't simply be told to make sure their text will be "visible" in any app, and then somehow be able to derive that the character limit for CJK languages should be 16. I think it's got to be the other way around, since SRT publishers only deal with character limits, not font proportions and layouts which vary from app to app. So we need to give SRT publishers a concrete character limit, and THEN it would be up to the app developers to ensure visibility. |
Yes, I meant "visible" as exactly that - what is visible on the screen once you've decoded it. "SRT files are for use in simple closed-captions or can be parsed for display on a website. SRT lines should be limited to 32 visible characters on-screen after decoding, to ensure they are visible on a wide variety of screen sizes without alteration. When using some alphabets or characters, that may mean an SRT file is longer than 32 characters before decoding. SRT files are more widely supported than other formats, and are highly recommended." Does that work better? So - 💩 is (I think) three characters - U+1F4A9 - but displays as one character on-screen after decoding. Or maybe it's two characters. But it displays as one character on-screen. That gives us a hard and fast rule - character limits, not worrying about the proportions of text and trying to calculate the difference between the length of 'm' and of 'i'. That's up to the app to deal with (and the app knows that were they |
I'd probably add: it's likely that Netflix shows bigger text for complex characters like Chinese and Korean, to aid in comprehension. My memory from watching SD television on a CRT screen in Hong Kong would suggest that's the case: the resolution wasn't capable of showing text as small as the characters we use. |
But that's not a side note or an afterthought, that is actually THE whole purpose of this issue :-) Compare these two lines, the first in English, the second in Japanese: Hello and welcome to this week's Both of these lines contain exactly 32 characters each. The 32 character limit suffered from the same sort of oversight that often happens with CJK languages (Chinese, Japanese, Korean) (see this), although these 3 languages are a well-recognised exception in typography, which is why subtitle guidelines tend to have a general rule that applies to most languages, and then tend to have language-specific exceptions for these specific languages. It's not as though we have 100s of languages and therefore we have 100s of potentially different character limits to address each language. No, actually latin script has been highly influential throughout the world's languages, with many languages simply borrowing and extending the latin alphabet, and even for scripts that don't follow the latin alphabet, their typography has followed latin conventions in that they have adopted a similar average character width to latin. The exceptions are very few, and so we can list these exceptions explicitly to make it absolutely clear. That is to say, we don't need to come up with some general and vague wording that could apply to any language in the future, we can actually list out the exceptions right now and make it clear and unambiguous. This issue is not about how 'i' and 'm' may have different widths in proportional fonts, that's ultimately irrelevant when designing a publishing guideline for subtitles because the guidelines are based on the averages that are appropriate for the particular language. And in English, lines of text with similar character counts will tend to have roughly similar widths due to the way averages play out, even though proportional fonts allow for theoretically worst case extremes. This is somewhat addressed in most guidelines which advise having some margin space on either side when rendering, which means that there is some extra room for proportional fonts to grow. This issue is also not about unicode encoding and decoding. The number of bytes used to encode a unicode codepoint does not determine the visible width of that character, and those bytes are not "characters" by modern character counting conventions (on UNIX, see (In my original proposal, I was also trying to address the case of multilingual podcasts. But even then, I think we should avoid going to the level of byte encodings which has nothing to do with rendering, and would confuse the matter. In a UNIX terminal of 80 columns by 24 rows, originally designed for latin script in ASCII encoding, that originally meant an 80 character line limit. But when this was extended to support CJK languages, that meant a line limit of 40 CJK characters, or 80 latin characters, or when a line contains some combination of both kinds of characters, each CJK character occupies the same amount of space as 2 latin characters. If we are to have a special case for certain languages like this, rather than define a special character limit of 16, we could instead specify a special width factor of 2, where that number could contain decimal places.) |
Please: I gave a (hamfisted) example of what the spec should say. Here's where you can do similar. Propose what the spec should say, and we can all agree to it. This isn't a space for a long argument, this is a space for "here's what the spec should say in my experience - everyone agree?" |
"Please", no need to suggest that. I already did precisely that, and you disagreed with my proposed wording to explicitly mention special case languages in the spec. We are at a point now where we disagree, and you haven't given any reasons for what you have against my wording that aims to mention these special case languages, and it was your continued resistance to that that brought about more detailed reasoning from me for why we really do need to do this. If you want to put a plug in hearing further reasons, "please" engage with the specifics of the proposal and tell me why do you not accept my proposed wording to explicitly specify a special limit for the C/J/K languages. Sure, like your own, my initial wording may be hamfisted in some ways, but I could have actually improved on that if you had engaged directly with the proposal and said specifically why you don't accept it.
We have already done that. Everyone did not agree. The next step would be to hear the reasons on "both" sides and then make an informed decision weighing both sides. A simple vote that is not an informed vote would be terrible for international languages, and the fact that most people who tend contribute to projects such as this are probably not well informed on CJK typesetting issues is the very reason why the current specification made the oversight on international language issues in the first place. You don't need to rush a decision about an important issue that most contributors may happen to be not-so-well informed about. Let's hear from you what you specifically don't like about explicitly mentioning the special case languages (which is actually a common industry practice!) and I promise I am the type of person who is amenable to reason. |
We're not. We're at a point where I have tried to suggest wording for a specification, and you have told me it's wrong.
I do not disagree with mentioning them. I was under the impression you were discussing encoding, and nothing more. I'm grateful for the additional information. Please write what the spec should say. I'm not disagreeing with you. I just don't know what you want the spec to say. |
See the above comment from two weeks ago. |
OK,
Could I suggest a simpler:
I don't understand the reference to "full-width characters". So, the full wording proposed is:
|
That's sort of the "first level" of what I'm proposing, and I agree it is simple to express at this level. But the second level is to account for multilingual podcasts where the speaker switches between different languages. Let's consider a primarily English language podcast where the host gives you Japanese lessons in English. In other words, 80-90% of the words will be in English, but occasionally new vocabulary will be introduced in Japanese and then explained in English. Now neither the 32 character limit nor the 16 character limit is appropriate, but my proposed (admittedly less simple) wording will handle it. Here is an example, where we choose the 32 character limit and the second line ends up being too wide: Today we'll learn to write words Since it's mixed, the second line is not as wide as a line that is 100% in Japanese, and it's not as narrow as a line that's 100% in English. To account for multilingual podcasts, what we want here is to set the character limit at 32 standard characters, but then to say that each full-width CJK character counts as 2 standard characters (and to be clear, I'm not extending that to the odd English character that happens to be wide, like 'M', so just CJK full-width characters here). Then, we have a specification of how long the second line should be before the line break: Today we'll learn to write words where we count "in " (3) + "カタカ" (6) + " like " (6) + "フライドポテト" (14) totaling 29. The second line leaves room for 3 more characters, but had we added " and" (4) onto the end of that line, it would have totaled 33 and exceeded the limit by one character, hence we break here before " and". This way of wording the spec produces reasonable line widths even for multilingual podcasts, even though I grant that the wording is less simple. So if you want to approach this in stages and delay the multilingual aspect for a later decision, then yes, the wording in your last comment is a clear expression of the first level of this proposal. |
I'm merely asking for what you think is the whole proposal. I'm not wanting to complicate this - just gently probing to ensure we have the correct wording that won't confuse people. In order to expedite this, how's this:
Your link to "full-width" links to a page about unicode, which takes us back down the confusing rabbit hole of encoding methods, which is what I initially thought you were talking about. I'm merely trying to find something that is easily understandable. You're clearly knowledgeable about this, so I'd welcome your proposed wording. (I know it says "maintainer" on my name here; but I'm a volunteer just like you are, and I only have additional privileges here to assist others.) |
Not because of their complexity, you could delete that clause.
"Full-width" is just the name of the category of the "visually" wide CJK characters we're talking about. If you don't like the previous link, here's another link which explains it independently of Unicode, just so we're clear we're not talking about how a CJK character is encoded (i.e. what bytes and how many bytes), we are just talking about the specific category of wide characters that are used in CJK. If you object to that word, instead of saying "full-width CJK characters", I suppose you could say it more colloquially as "wide CJK characters" or "CJK characters" etc.
I've suggested a concise wording, but I think clarity would be improved by making it less concise: The maximum number of characters per line is 32, with the following exceptions:
When more than one language appears within the same line, count the number of characters belonging to each language and divide each character count by the respective character limit for that language. The sum of these fractions should not exceed 1. For example, if a line contains 16 English characters and 8 Japanese characters, then it fits perfectly within the character limit, because 16/32 + 8/16 is exactly 1. By enumerating out the language-specific exceptions, it allows us to add further exceptions in the future for languages we didn't consider right now (e.g. Arabic). By explaining the multilingual case separately, it simplifies the language for the common case of monolingual transcripts. |
Characters in Chinese, Japanese and Korean are twice the width of characters in the Latin alphabet. As such, applying the same 32 character line limit in SRT files to Chinese as you would do to English would effectively allow Chinese SRT lines to be twice as long as English lines and may overflow the viewing window. Probably for Chinese, the character limit should be half of 32, i.e. 16.
If we allow for multilingual transcripts (#367 ), then there may be a mix of Chinese words and English words in the same sentence, and so it would make more sense to specify that each Chinese/Japanese/Korean character counts as 2 characters when computing the 32 character line limit.
For more information, see: https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms
The text was updated successfully, but these errors were encountered: