Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrote textrank_sentences() #7

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

emillykkejensen
Copy link

I rewrote the textrank_sentences() as it could not run my dataset (ran for 3 days without finishing). In doing so I added pbapply to show progress for the sentence_dist function as well as enable parallelization.

A pretty solid upgrad to an already pretty solid function, if I should say so my self!

@jwijffels
Copy link
Collaborator

I think you should really test out using the minhash algorithm. That is more a solution if you have large volumes of sentences as if you don't use it it will have to calculate all pairwise sentence similarities. Please try out the minhash algorithm.

@jwijffels
Copy link
Collaborator

On another note. Some advise in reducing the dimensionality of the number of sentences: it is better to use text clustering first (e.g. using the topicmodels R package or the BTM package (https://cran.r-project.org/web/packages/BTM/index.html) and inside that cluster apply textrank

@emillykkejensen
Copy link
Author

emillykkejensen commented Jan 22, 2019

Thanks for the advise - I'm trying out different approaches and have had a look at the minhash algorithm - but it takes longer to run the textrank_candidates_lsh function itself then running the rewritten textrank_sentences. And that's only when it will run - if I run it on all my 12.000 sentences - it will fail and throw an error..

Also had a look at the BTM package, but again it takes a long time to complete. Really the fastes way to do it, is using the textrank_sentences.

@jwijffels
Copy link
Collaborator

I've read a bit the changes. Am I correct that the speed difference is basically because you calculate overlap in batches by groups of textrank_id's and because you parallelise the mapply loop?

@emillykkejensen
Copy link
Author

Not really - actually haven't even used the parallelise function. The reason is, that I have used data.table and thus using reference = lower memoy and faster speed.

@jwijffels
Copy link
Collaborator

Ok, but in that case, can you drop the usage of the pbapply package. In general I'm against adding package dependencies which are not needed. Adding a dependency on another package seems to me overkill. Why not add a simple trace argument and print out something every say 1000 comparisons. That removes another dependency which might give problems in maintaining later on.

@emillykkejensen
Copy link
Author

That is a good principle - one I tend to stick with as well, but guess i got carried away :)

I'll have a look at it and write the pbapply package out..

@jwijffels
Copy link
Collaborator

great

@emillykkejensen
Copy link
Author

Removed the use of pdapply and replaced it with cat - it's not as pretty but it gets the job done, if you want to monitor the progress of the function..

@jwijffels
Copy link
Collaborator

Thanks, I'm going to review this soon and incorporate

@jwijffels
Copy link
Collaborator

I've reviewed your code and updated it according to what I thought was better readeable. Can you try it out on your own dataset and let me know if this is fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants