- 2 datasets solidified
- Universal Decompositional Semantics (UDS) dataset with Decomp toolkit
- CHILDES Narrative English Hicks Corpus
- project plan written
- read through a few of the transcribed narratives
- currently reading up on how to install Decomp toolkit and use it
UDS dataset (with Decomp toolkit) (notebook)
Overall: This is a massive dataset, and figuring out how to use the Decomp toolkit has been a bit challenging, so it feels like I have yet to make a dent in the exploration of it. I worked through the Quick Start portion of the documentation and also read through a lot of the documentation (withouth implementing the code) to try to get some context for what I was doing. I expected this part of my project to take more time than I would have before the first progress report. Exploration of the UDS dataset with the Decomp toolkit is one of the main aspects of my project instead of UDS being just a data source. Additionally, because I have not encountered some of the types of linguistic features and analyses before, I am having to do a lot of outside reading in addition to dataset exploration.
Sharing: The UDS dataset is comprised of many different datasets and is bundled with the Decomp toolkit, which I have installed using pip
. All of the individual datasets, though, are located at http://decomp.io/data/.
CHILDES Narrative English Hicks Corpus (notebook)
Overall: I spent a lot (like a lot) of time trying to figure out how to read in and look at .cha (CHAT) files which is the format of most of the CHILDES database data. I started out using the chamd
library from ChatReader
, but it was not working out the way I wanted to - I couldn't get to the annotations. I eventually came across PyLangAcq, or Language Acquisition Research in Python (not sure how I didn't come across this before chamd
), and it's a lot easier to work with. Because I spent so much time figuring out how to even get to the CHAT data, I didn't have as much time to put it in the format that I need. This dataset, however, is very clean and in the standardized format of the CHILDES database data. There is lots of documentation about the transcription and annotation processes that I have been reading through.
Sharing: (data_samples/
directory) Since all of the data are readily available from the CHILDES database and the CHAT files can be viewed on there without having to use additional software, I have chosen to only put files (one from each narrative genre) from one child per grade into my data_samples/
directory.
UDS dataset (with Decomp toolkit) (notebook - this notebook is the same notebook from progress report 1; I have just added to it)
Overall: To be open and frank, I have been getting increasingly overwhelmed by the UDS dataset. There is a lot of documentation on it, but this also means there is a lot to read through and be overwhelmed by. I ran into an issue where I could no longer load the decomp
package and import the UDSCorpus, and as this is the first step to doing anything with my dataset, I was both unable to run the previous code I had written and unable to explore the dataset further. I ended up having to update a bunch of individual components of the package, and now I am back on track.
Specifics:
- more exploration
- updated packages
- Continued working through the Decomp tutorials (https://decomp.readthedocs.io/en/latest/tutorial/index.html)
- read all about the visualizations and how they work
- attempted to try out some visualizations, but was unable to
Sharing: The Decomp Toolkit and UDS Dataset are licensed under the MIT license by Aaron Steven White. I have decided to go with the GNU GPLv3 license for the code elements of my project. This license is compatible with MIT-licensed software, and I want to make my code publicly available. As I included in my first progress report, the UDS dataset is comprised of many different datasets and is bundled with the Decomp toolkit, so I am accessing it through there, not the individual data files themselves. All of the individual datasets, though, are located at http://decomp.io/data/.
CHILDES Narrative English Hicks Corpus (notebook - this notebook is the same notebook from progress report 1; I have just added to it)
Overall: A lot more exploration with PyLangAcq - I made my way through the features offered by the library with my specific dataset.
Specifics:
- more exploration
- figured out I am only really interested in the dependent tiers of the datasets
- created a dataframe with %mor and %gra tiers for the entire dataset
Sharing: I added a CC license to go along with the non-code elements of my project and to align with the licensing on the TalkBank data from which the CHILDES Narrative English corpus comes. The original CHILDES TalkBank data is licensed under the Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) license, so I, too, am required to use this same license. As I included in my first progress report, since all of the data are readily available from the CHILDES database and the CHAT files can be viewed on there without having to use additional software, I have chosen to only put files (one from each narrative genre) from one child per grade into my data_samples/
directory.
UDS dataset (with Decomp toolkit) (notebook using CRC v3)
For now, I am putting each of the individual versions of my notebooks on GitHub since I am downloading them from the CRC OnDemand interface. Once I have a final one, I will clean up my notebooks
subdirectory. The most up-to-date JNB is the latest version.
Overall: Due to the vastness of the UDS dataset combined with a lack of (helpful) documentation, I needed professorial help to sift through UDS. Na-Rae kindly stepped in and checked out UDS and decomp herself. Weird things happened for her as well when trying to import the dataset and toolkit (as in the corpus wouldn't even load), but she discovered that she could get it semi-working on CRC. This discovery occurred at the perfect time since I was yet once more unable to load and access the corpus even after getting it to work again for the last progress report (old notebook).
I am now running things using CRC and have some basic code to work off of. I know the specific linguistic phenomenon I am interested in exploring with the UDS framework (see CHILDES section below), and have begun analyzing it. I think that looking at factuality could be a good avenue, as well as event structure with telicity and perhaps also pred-hypothetical
within the genericity subspace of the semantic types.
Additional note: After playing around with rendering the visualizations, we were ultimately unsuccessful due in part to package incompatibilites, so I had to abandon that route of inquiry. I did, however, put a graph provided by the Decomp repository in my repo's visualizations
subdirectory because I want to include it in my presentation.
CHILDES Narrative English Hicks Corpus (notebook-continuation from previous progress reports)
This dataset is being used mainly to discover a phenomenon that can be further explored using UDS.
Overall The DataFrame containing the utterances and tiers is finalized, but the coding tier is a bit wonky. To get a closer look at these CHAT files, I used the command line and grep
. I am focusing on the annotation $modv
that came up quite frequently in the coding tier (%cod) of the Hicks transcripts. Utterances annotated with $modv
all have the verb "try" in them. I am interested in what using "try" does to the main verb that is dependent on it, particularly what it does to convey to what extent the action (verb dependent on "try") has been completed. Here is where UDS comes in.