Project 3: latent Dirichlet allocation (LDA)

Goals

Use LDA models with the Gibbs sampling training algorithm to classify documents use a set list of classes (e.g. sports, politics).

Method

model: LDA
training: Gibbs sampling

Implementation goals:

make inner loop of LDA fast
write function to print words with highest probability for each topic
write function to visualize documents based on the topics of the trained model (possible 3D)

Datasets

Apply LDA to two datasets.

classic400.mat

The 2D array 'classic400' contains the number of times each word is found in each document (rows=document, columns=word in vocalubary). Example: classic400(1,1) is the count of word 1 in document 1, classic400(100,500) is the count of word 500 in document 100, etc.

The array truelabels shows which of three domains each topic came from and can be used as a check to see if the LDA is picking up the correct classifications.

Second Dataset (of our choice)

our choice
idea #1: interesting collection of documents
idea #2: non-text dataset for which LDA model is appropriate

Report

Report should try to answer this questions (which do not have definitive answers):

What is a sensible way to define and compute the goodness-of-fit, for a given dataset, of LDA models with different hyperparameters K, alpha, and beta?
How can you determine whether an LDA model is overfitting its training data?

For the two datasets, present and justify good values for K, alpha, and beta. The values can be chosen informally, but we need justify our choices.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
KOS400		KOS400
classic400		classic400
data		data
report		report
.gitignore		.gitignore
Gibbs.py		Gibbs.py
README.md		README.md
classic400_thetas.pdf		classic400_thetas.pdf
kos_thetas.pdf		kos_thetas.pdf
project3.pdf		project3.pdf
simplex_plot.py		simplex_plot.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Project 3: latent Dirichlet allocation (LDA)

Goals

Method

Implementation goals:

Datasets

classic400.mat

Second Dataset (of our choice)

Report

About

Uh oh!

Releases

Packages

Sudoka/CSE250B_LDA

Folders and files

Latest commit

History

Repository files navigation

Project 3: latent Dirichlet allocation (LDA)

Goals

Method

Implementation goals:

Datasets

classic400.mat

Second Dataset (of our choice)

Report

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages