Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance no entities spike #13

Closed
wants to merge 40 commits into from
Closed

Conversation

pietrop
Copy link
Owner

@pietrop pietrop commented Mar 12, 2020

Is your Pull Request request related to another issue in this repository ?

Fixes this bbc#150 and addresses comment bbc#150 (comment)

Describe what the PR does

TL:DR: Seeing if removing entities can remove the performance issue. (spoiler alert, it seems like it does 🎉 )

At first glance removing entities means that we'd loose the correspondence between words and timecodes. However thanks to stt-align-node (original algo by @chrisbaume and integration with @bbc/react-transcript-editor by @murezzda) it seems like we have an inexpensive way to restore alignment between plain text and timecoded words.

This PR Removes word level entities. But keeps the data attribute at block paragraph level, that contains the words list for each paragraph in the draftJS json.

It introduce paragraph level highlight (vs previous word level) by adding previous timecodes of words in previous blocks in paragraph html data attribute, to maintain css injection similar to how current and previous words where displayed before.

Screen Shot 2020-03-13 at 5 19 04 PM

I tested the performance by selecting current text and replacing with 2h42min.txt from bbc#150 (comment)

But you can test with something over one hour eg with Facebook CEO Mark Zuckerberg FULL testimony before U.S. senate ~ 5 hours (eg using electron-video-downloader to download the video),

this is the json of the transcription in digitalpaperedit format Facebook CEO Mark Zuckerberg FULL testimony before U.S. senate-pXq-5L2ghhg.mp4.dpe.json.txt (remove .txt from the file extension after downloading to get the .json - github doesn't allow to upload json files 🤷‍♂ )

or this is the json of the transcript in draftjs format
Facebook CEO Mark Zuckerberg FULL testimony before U.S. senate-pXq-5L2ghhg.mp4.draftjs.json.txt

Either with the whole 5 hours or a trim of 2 hours.

There might be further optimizations that can be done, like not realigning as often. eg only realigning when exporting or saving content.

State whether the PR is ready for review or whether it needs extra work

Looking for a review

I'd encourage to clone and stress test locally, but I've also deployed this version on the github pages of my fork to remove some friction etc..

Additional context

  • removed entities, commented out the function that creates entities. Lukily it was a shared function across all adapters.
  • removed generateConfidence in Word as not useful/needed
  • removed Word all together
  • added generatePreviousTimes to paragraphs, to do paragraph highlights
  • Removed timecode indications at paragraph level. in favour of this fuzzy double click on word jump to media
  • add sync 🔄 refresh button to resync transcript manually ?
  • seems like local storage auto save could work, after removing entities, . Tested in demo after fixing it, with 2h40min worth of text, and it saved and retrieved fine. Still of the opinion to keep it in demo app as example, and leave it to individual implementations to decide whether they'd want to add it etc..

Issues

  • Some issues around cursor position jumping, during various edge cases.

Cursor position during delete

  • deleting a char using delete key
  • deleting a text selection, seems fine
    Cursor position during enter
  • click enter end of paragraph is disabled
  • click enter beginning of paragraph is also disabled
  • click enter middle of paragraph, cursor jump to top of the page 🤷‍♂
    • partially fixed, if you click enter in between the char of a word, or at the end of a word before the space, then the cursor behaves
    • if you click enter, before a word or after a space, then it jumps to the top of the page - leaving for a separate PR, as in previous version this bug was already present.
  • select text and hit enter, disabled (but you and click delete and then enter)
    • select text, hit delete, hit enter - if you do it fast the line break is removed, almost like an undo (most likely race condition between state updates) - leaving for separate PR
  • for future PR, also handling delete key, to retain cursor position, eg if deleting a selection of text etc..

@pietrop pietrop marked this pull request as ready for review March 13, 2020 21:51
@pbirsinger
Copy link

pbirsinger commented Mar 15, 2020

You are a true hero and I cannot thank you enough for this fine and noble work.

Not to speak for the community, but our team would be 100% ok with going to paragraph level highlight for now (or even sacrificing significant other features) if it means getting rid of lag, as that essentially makes it fully usable again.

Pietro Passarelli added 3 commits March 15, 2020 22:50
added Facebook's Zuck Senate hearing 5 hours example  to the demo app to stress test edge cases
added try catch block around local storage saving to avoid app crashing if it runs out of memory during save in demo app
removed auto sync for performance on longer files, also added sync btn to restore timestamps
@emettely
Copy link

Happy to review it @pietrop 👍 Can you add me to the reviewer?

@emettely
Copy link

emettely commented Mar 16, 2020

  • Trying it out in storybook - see GIF below, it hangs 😢
    storybook-rte

Just realised that the long file could be loaded in the demo - my bad - I just downloaded + reuploaded the file to my S3 bucket... 🤦‍♀ If you notice it's part of a different path in the Storybook, that's why.

The Storybook seems laggy and quite delayed in response, but maybe that's just the limitation of having a 5 hour long transcription service. If you pause and select text to jump to the right place, it seems to do it well, but not so much when you edit - it hangs (e.g. GIF above).

@pietrop
Copy link
Owner Author

pietrop commented Mar 16, 2020

Thanks @pbirsinger,

I think the main thing is also deciding what's the max length we aim to support or that is reasonable for this component. For my interview editing use cases I think it's on average 1hour but there might be occasional edge cases where it could go up to 2 hours, very unlikely 5 hours.

@emettely yeah, sorry I had not updated the comments after last commit.

Some more tweaks in latest commits

  • Added 5 hours example, both in draftJS and in digitalpaperedit format and txt. (I created it by downloading the facebook youtube video with electron- video -downaloder and using electron DPE to get the transcription, with AssemblyAI, I then used transcript editor in electron DPE to export as draftjs, dpe, and txt format, see demo/sample-data folder)
  • for testing/setup I've uploaded facebook video to WSJ S3 bucket, @emettely, you can move it to the BBC S3 bucket that has the Ted Talk as well. (which sounds from previous comment, you might already have)
  • commented out the alignement updateTimestampsForEditorState to improve performance when editing longer transcripts, still present when getting content, and when saving thou
  • default not show timecodes so that they don't show as NA:NA:NA when it's aligned
  • adjusted css of timecode to be responsive if timcodes are not shows (moved from css file to component)
  • saving backup for reference, make sure, that it does that after converting to draftJS, to make it compatibel with other adapters. if it's dpe format, then skip extra conversion?
  • add btn to optional auto align
  • fixed, show hide speaker and timecode labels in settings
  • Done a 2 hour demo version of FB interview, using same video file as the 5 hour one.

In RTE using the 5 hour transcription deleted the rest of the transcript after the 2 hour mark, to delete the next 3 hours. then exported in react transcript editor, as draftjs and digitalpaperedit in demo/sample-data

  • when testing the demo, un check Use local storage in demo options

  • update draftJS version to 11 / latest

Pietro Passarelli added 2 commits March 16, 2020 10:50
added as option to pass in speaker and timecodes labels as attributes to transcript editor with  defaults
… but also save state to reflect in the editor
@pietrop
Copy link
Owner Author

pietrop commented Mar 16, 2020

Another thing I am thinking about is, should we make handleAutoSaveChanges optional? would that improve perfromance?

re 🔄 btn, I tried to add an animation, so that if it takes longer to sync, it start spinning, and then it stops when it's done. But it seems that when the alignment is started, it takes up a lot of memory and blocks the process, so you don't see the animation 🤷‍♂
Thinking maybe offloading that computation to a service worker? (if compatible with electron etc..) or whether there's a way to address it/what's the root cause etc..

@emettely
Copy link

emettely commented Mar 16, 2020

handleAutoSaveChanges() should probably be an option - by default on, but possible to set as off. Is there a way to cleverly do this - e.g. examining the length / size of transcription?

A shame about the animation, I'm sure there's a way.

CSS animations are handled by the browser’s compositor thread rather than the main thread responsible for painting and styling. Consequently, such animations are unaffected by the main thread’s more expensive tasks.

Not too familiar with animations generally, but I've found that quote here

@pietrop
Copy link
Owner Author

pietrop commented Mar 16, 2020

re CSS animation, ok, fair, then I guess the question remain, why is it freezing...

And yeah, handleAutoSaveChanges could have an on/off at interface level etc.. something look into

Copy link

@emettely emettely left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me mostly - functionally there 👍 The code is generally simplified because entity map is removed. More of a suggestion than actual fix: I think revisiting the getWordsBeforeBlock might be beneficial - it has comments to describe what it does, but probably could be simplified or renamed with getWordsBeforeParagraphBlock.

const DEMO_MEDIA_URL = demo.url;
const DEMO_TITLE = demo.title;
const DEMO_TYPE = demo.type

if(this.state.useLocalStorage && isPresentInLocalStorage(DEMO_MEDIA_URL)){

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* when testing the demo, un check `Use local storage` in demo options

This is a minor point, but when selecting a sample video in demo before I select the local storage checkbox, the video is automatically loaded and crashes. This might be something you want to uncheck in-code or add a load button, to allow the user to select all the options before loading the demo.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the flow would be, untick user local storage then click load demo.
Coz I have had people using the transcript editor page, in conjunction with autoEdit to correct transcripts, and didn't want to disable the local storage default auto save, just in case they were still using it.

But yeah, it's not ideal UX, probl untick as default would be better for development?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah seems sensible to untick by default

packages/components/media-player/index.js Outdated Show resolved Hide resolved
packages/components/media-player/index.js Outdated Show resolved Hide resolved
packages/components/timed-text-editor/CustomEditor.js Outdated Show resolved Hide resolved
packages/components/transcript-editor/index.js Outdated Show resolved Hide resolved
Comment on lines +50 to +51
showTimecodes: typeof this.props.showTimecodes ==='boolean'? this.props.showTimecodes: true,
showSpeakers: typeof this.props.showSpeakers ==='boolean'? this.props.showSpeakers: true,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a hack - aren't they normally a boolean?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, so basically if you do

showSpeakers: this.props.showSpeakers ? this.props.showSpeakers: true,
  • ifthis.props.showSpeakers it is undefined it becomes true
  • but if this.props.showSpeakers is false it also becomes true
    So I needed a way to check that it's not undefined, but without overriding the param, if that make, sense, so I thought type checking could be a fix?
showSpeakers: typeof this.props.showSpeakers ==='boolean'? this.props.showSpeakers: true,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe the right thing is make sure that in the parent props, it's always defined so you don't have to type check.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you do that if it’s an optional parameter?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like your default option is true so I would've just set that in the parent

packages/components/timed-text-editor/WrapperBlock.js Outdated Show resolved Hide resolved
packages/components/timed-text-editor/WrapperBlock.js Outdated Show resolved Hide resolved
@pbirsinger
Copy link

Happy to beta-test this with our team whenever things are stable

@pbirsinger
Copy link

What are the remaining todos here? Really looking forward to these improvements

@pietrop
Copy link
Owner Author

pietrop commented Apr 6, 2020

@pbirsinger I think it needs some testing and QA, let us know if you'd have time to pull the branch locally and go through some of the QA steps?

As a way to ensure we are not introducing new bugs since it's a pretty major change/refactor etc..

You could also probl use the demo in github pages on my fork if that helps saving time, forking, cloning, etc..

@pbirsinger
Copy link

pbirsinger commented Apr 10, 2020

Pulled in the branch to try out and this definitely does wonders for the lag! Did an informal run through and things mostly seem to be working.

However, inside handleAutoSaveChanges after setting isAutoSave to true, I printed out newAutoSaveData.data and noticed a discrepancy between the data words and text in blocks:

savebug

Note I changed the first word "There" to Thereyyyy" and the change was only reflected in the text in blocks but not in data words. Is this intentional?

@pietrop
Copy link
Owner Author

pietrop commented Apr 10, 2020

Thanks for looking into it @pbirsinger and nice catch!

Yeah, I think this is because handleAutoSaveChanges does not run updateTimestampsForEditorState on the content before returning it.

One of the problem is that running the alignment has a performance cost and if you were to trigger it on every auto save it could slow everything down, for longer files, unless it is optimized in some other way.

I haven't been using handleAutoSaveChanges, and have been thinking that, the logic that returns the content for that function, could be moved outside of the component. eg the function can still trigger to let you know when there have been changes etc.. but then outside of the component you could decide how to handle that, eg do you run the alignment on the client side or send it to the server/backend to do it etc...

eg in the demo/app.js, handleAutoSaveChanges could be modified like this

  handleAutoSaveChanges = newAutoSaveData => {
- const { data, ext } = newAutoSaveData;
+ const { ext } = newAutoSaveData;
+const { data } = this.transcriptEditorRef.current.getEditorContent( ext );
    this.setState({ autoSaveData: data, autoSaveExtension: ext });
    // Saving to local storage 
    if(this.state.useLocalStorage){
      localSave(this.state.mediaUrl, this.state.fileName, data);
    }
  };

Because getEditorContent runs updateTimestampsForEditorState

haven't tested the code snippet above, more of a proof of concept

Let me know what you think

@pbirsinger
Copy link

@pietrop an approach like that makes sense to me. Tested out those changes but got :

image

So think that snippet close to working but might need some more tweaks.

Thanks!!

@pbirsinger
Copy link

As a heads up, this is currently a major blocker for our team, along with other bugs that cause the editor to frequently crash. I think we're going to rebuild our own from scratch in the latest react, typescript, not laggy, etc since we can't really afford to keep waiting on this one unfortunately.

@pietrop
Copy link
Owner Author

pietrop commented Apr 16, 2020

@pbirsinger whatever works for you and your team. You are always welcome to fork this repo, and make an alternative TimedTextEditor if that makes it easier to try something new. (I did that for some of my projects where I use react-transcript-editor, and then I contribute back when I make a breakthrough on things, but that way I don't have to wait on the maintainers for some of those fixes to be incorporated in new releases etc... )

other bugs that cause the editor to frequently crash

It be good to log those and raise issues, so that they can be addressed.

@pietrop pietrop moved this from In progress to Review in progress in autoEdit 3 (Digital Paper Edit) Apr 18, 2020
@pietrop pietrop moved this from Review in progress to In progress in autoEdit 3 (Digital Paper Edit) Apr 18, 2020
@Laurian
Copy link

Laurian commented Apr 20, 2020

@pbirsinger if you're going to rebuild from scratch it would be great if you could open source it, or at least keep in touch with regard to the approach you will try; maybe we could improve this one later based on what you learn or provide feedback or be a sounding board on issues you may encounter.

@pietrop pietrop moved this from In progress to To do in autoEdit 3 (Digital Paper Edit) Apr 21, 2020
@pbirsinger
Copy link

Sure guys I'm going to experiment and will keep you posted

@pietrop pietrop added this to Reviewer approved in Performance no entities spike #13 Apr 23, 2020
@pietrop pietrop moved this from Reviewer approved to Review in progress in Performance no entities spike #13 Apr 23, 2020
@pietrop pietrop moved this from Review in progress to To do in Performance no entities spike #13 Apr 23, 2020
@pietrop pietrop moved this from To do to In progress in autoEdit 3 (Digital Paper Edit) Apr 23, 2020
they where showing up as NAN, eg if a transcript time started at zero, edge case
@pietrop pietrop removed this from In progress in autoEdit 3 (Digital Paper Edit) Apr 24, 2020
@pietrop pietrop closed this Apr 28, 2020
Performance no entities spike #13 automation moved this from To do to Done Apr 28, 2020
@pietrop pietrop reopened this Apr 28, 2020
Performance no entities spike #13 automation moved this from Done to In progress Apr 28, 2020
@pietrop
Copy link
Owner Author

pietrop commented Apr 28, 2020

👋 I made an alternative version using SlateJs slate-transcript-editor

Also see

I am closing this PR as I no longer have a need for it, but fell free to use any of it if useful.

@pietrop pietrop closed this Apr 28, 2020
Performance no entities spike #13 automation moved this from In progress to Done Apr 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants