Skip to content

NLP Analysis of Washington State Supreme Court Opinions for my capstone project at Galvanize

License

Notifications You must be signed in to change notification settings

dvalp/galvanize-capstone-project

Repository files navigation

NLP Analysis of Washington State Supreme Court Opinions

Galvanize Data Science Immersive Capstone Project using Apache Spark and NLP to analyze Washington State Supreme Court opinions. The data has been made available by the Free Law Project via the CourtListener website. CourtListener provides the data via an API or as a bulk download of all the documents in a compressed archive of individual files, each containing a single JSON object for one court opinion.

Apache Spark is an engine for large scale data processing that makes it possible to work on large amounts of data in parallel wherever possible. Where Spark does not have native functions to implement machine learning algorithms, it can also run external functions in parallel (in this case functions are imported from Python.

Where functions are not available in Spark, I have primarily made use of the Natural Language Toolkit (NLTK) and Scikit-Learn to process my data.

Project implementation

The main goal of this project is to provide relevant information from the court opinions. I have chosen to do this in two ways. First, by providing a set of words using tf/idf (CountVectorizer and IDF in Spark) to evaluate their importance within the corpus. Second, I have used the Spark implementation of word2vec to provide a set of vectors which can be used to evaluate document similarity. Using the squared distance method supplied by word2vec has resulted in a set of documents that show remarkable similarities to a single reference document.

The first step was to import the data. The FreeLaw Project conveniently makes the opinions of a court available for download in one large batch archive. Each individual file in the archive contains one single JSON object which describes one opinion. Each case that comes before the court can have multiple opinions; most notably lead, concurring, and dissenting opinions. Although these opinions are generally displayed together on the CourtListener website, they can also be connected via a cluster id that uniquely identifies the case. I chose to leave each opinion as a separate file, leaving the option available to compare opinions on a single case with each other. This also makes it possible to compare a particular type of opinion (ie, dissenting) with each other without the interference of the other opinion types on that case. At least a few cases have more than one dissenting opinion, which may be interesting to explore later.

The second step was to limit the data to relevant features. For my first two goals this required preserving both the cluster id (for grouping opinions on a single case) and document id to identify each document uniquely while preserving their relationships to each other. I also included the text of the opinions and the list of opinions cited in each opinion. The text was primarily stored as HTML in one of four columns. I parsed the text from the HTML using BeautifulSoup. The citations have been stored for future work in creating a graph of the connections between opinions. One goal I have is to use this information in a comparison with the cosine/distance similarity measures I get with using word vectors and see if there is any relationship.

Using Spark's word count and IDF functions, I created tf/idf vectors for each word in each document. I used these vectors to find relevant words in a particular document. As an example, I used the landmark WA Supreme Court opinion State v. Gunwall. The tf/idf vectors identified words such as cocaine, pen, register, toll, and call as important in this document. These words were in fact central to the case itself. Interesting to note is that the main name in the case (Gunwall) was not recognized as uniquely important to this specific case. This can be explained by the idf weighting which penalizes terms for being too common across documents. State v. Gunwall is a landmark case cited in over 450 other opinions, making the name Gunwall slightly less of a specific identifier, even in the case that bears the name.

For document similarity (given a single document, which other documents are the most similar) I used the Spark implementation of word2vec. Using a vector size of 250, I created a single vector for each document. The Spark implementation includes a squared distance metric for document similarity. The squared distance can be affected by document size (particularly word repetition) to create distance where it is not necessarily present (for example, duplicate information in one document). Therefore, I have also created a cosine similarity measure. I will compare the results from each metric and check them for (purely subjective) similarity.

About

NLP Analysis of Washington State Supreme Court Opinions for my capstone project at Galvanize

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published