Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LDAvis throws error for some LDA models #84

Open
carlosparadis opened this issue Nov 26, 2017 · 3 comments
Open

LDAvis throws error for some LDA models #84

carlosparadis opened this issue Nov 26, 2017 · 3 comments

Comments

@carlosparadis
Copy link
Member

The following error is displayed and no visualization is generated:

Error in stats::cmdscale(dist.mat, k = 2) : NA values not allowed in 'd'

Verified to occur in both old and new crawler, on year 2013, months Feb, Apr, Dec.

@carlosparadis
Copy link
Member Author

carlosparadis commented Nov 27, 2017

The problem lies upstream on LDAvis package itself. See the opened issue on the project.

The problem can be circumvented by defining another jsPCA function which is the parameter mds.method in the createJSON:

jsPCA <- function(phi) {
  # first, we compute a pairwise distance between topic distributions
  # using a symmetric version of KL-divergence
  # http://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence
  jensenShannon <- function(x, y) {
    m <- 0.5*(x + y)
    0.5*sum(x*log(x/m)) + 0.5*sum(y*log(y/m))
  }
  dist.mat <- proxy::dist(x = phi, method = jensenShannon)
  # then, we reduce the K by K proximity matrix down to K by 2 using PCA
  pca.fit <- stats::cmdscale(dist.mat, k = 2)
  data.frame(x = pca.fit[,1], y = pca.fit[,2])
}

Also, we can follow the route to fix the existing function by adding something to smooth the probability distribution 0s.


When executing createJSON, the following error will be thrown:

Error in stats::cmdscale(dist.mat, k = 2) : NA values not allowed in 'd'

I traced it down to:

https://github.com/cpsievert/LDAvis/blob/51bb51e6f2dd26c9d495a76482018d94a9945ddc/R/createJSON.R#L298-L304

To reproduce the issue:

Reproducible dataset

x <- c(0.2,0.3,0.3)
y <- c(0.2,0.3,0.4) 
b <- c(0.2,0.3,0) 

Using LDAvis implementation shown at the start of this issue:

> jensenShannon(x=x,y=y)
[1] 0.003583677
> jensenShannon(x=x,y=b)
[1] NaN

The same test, using cosine function from lsa package:

> cosine(x=x,y=y)
          [,1]
[1,] 0.9897595
> cosine(x=x,y=b)
          [,1]
[1,] 0.7687061

@carlosparadis
Copy link
Member Author

For usage, plotLDAVis(models[["Jan"]],as.gist=FALSE) now allows a new parameter which is a variant of the default accepted by createJSON:

plotLDAVis(models[["Jan"]],as.gist=FALSE,topicSimilarityMethod = CalculateTopicCosineSimilarity)

With the new parameter and passing the new function, it will use the cosine function from package lsa, which is also the one used to compare topics between different months.

@carlosparadis
Copy link
Member Author

The issue was fixed in the original code. Should test locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant