Michael O'Brien, [email protected], 14 February 2022
State of the Word : Russian Textbook Vocabulary Frequencies According to the Russian National Corpus
- Many new textbooks in Slavic pedagogy boast online versions, e-books, and supplemental materials this project seeks to examine those materials, and scrape vocabulary resources (if they are available) to compare to a frequency dictionary to be developed from the Russian national corpus.
Update 1 - Having gotten the go-ahead to use to use textbook vocab within fair use, I'm going to start digitizing the relevant materials as my data and curating them to make them intuitive and straight-forward to manipulate during the data-wrangling phase.
Despite materials development and marketing of textbooks to be largely opaque, I know that it is within fair use to use vocabulary coming from these sources. I have confirmed that access to the online components accompanying the textbooks is restricted; however, it is entirely within my power to create digitized versions of the portions I need so that will be step 1. Once I have my vocabulary corpora made, I will have to look into how I can use the pdfs to make plain-text string data to work with. That will part of my first progress report.
I have contacted my colleagues in Slavic pedagogy and they have been able to provide me physical copies of the textbooks I need.
Much of language teaching is teaching vocabulary and, with the limited time students have to study languages, it is important that the vocabulary they are being exposed to matches the use and frequency they would expect to encounter "in the wild." In order to answer the question of vocabulary suitability by textbook/level/topic I intend to create frequency lists according to the Russian National corpus to compare the textbook vocabulary to, much like lextutor's vocabulary profiler for Russian.
Update 2 I still have to confirm that the Russian National corpus is available for download and local use. I anticipate this being the case, but this is why the plan and progress reports are so important.
I am confident that the tools and methods I intend to use will give a pretty clear picture of percentages of the vocabulary items, and their respective frequencies, that are targeted in the textbook/level/topic used by the University of Pittsburgh, and how their frequncies, reflected in the National Russian Corpus, speaks to their utility as targeted items in pedagogical material.
- Digitize vocabulary and begin data wrangling/manipulation process
- how to encode Cyrillic
- how to convert pdf data to plain text, etc.
- Get access to Russian national corpus
- Begin creating frequency dictionary and k-band function
- Start working on data investigation