You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the past, I have used the Trafilatura text extraction functions, which have worked very well to produce high quality text. The main reason we have not added it to NeMo Curator is because Trafilatura takes the longest to execute out of jusText and Resiliparse (with Trafilatura > jusText > Resiliparse).
However, considering Trafilatura is widely used and trusted by many to perform well, I think it would be good to add a TrafilaturaExtractor to NeMo Curator. It should be very straightforward to implement, so I can work on this.
The text was updated successfully, but these errors were encountered:
In the past, I have used the Trafilatura text extraction functions, which have worked very well to produce high quality text. The main reason we have not added it to NeMo Curator is because Trafilatura takes the longest to execute out of jusText and Resiliparse (with Trafilatura > jusText > Resiliparse).
However, considering Trafilatura is widely used and trusted by many to perform well, I think it would be good to add a
TrafilaturaExtractor
to NeMo Curator. It should be very straightforward to implement, so I can work on this.The text was updated successfully, but these errors were encountered: