This repo contains some python code I used to download form10k filings from EDGAR database, and then extract the MDA section from the downloaded form10k filings heuristically
I used python3.6
#python36
pip install -r requirements.txtSpecify the starting year and end year and the directory to save outputs.
By default, indices, forms and mdas will be saved to ./data
# Downloads and parses MDA section from 2016 to 2016 quarter 1 and 2, and saves to `./data/`
python edgar.py --start_year 2016 --end_year 2016 --quarters 1 2 --data_dir ./data/usage: edgar.py [-h] -s START_YEAR -e END_YEAR [-q QUARTERS [QUARTERS ...]]
                [-d DATA_DIR] [--overwrite] [--debug]
optional arguments:
  -h, --help            show this help message and exit
  -s START_YEAR, --start_year START_YEAR
                        year to start
  -e END_YEAR, --end_year END_YEAR
                        year to end
  -q QUARTERS [QUARTERS ...], --quarters QUARTERS [QUARTERS ...]
                        quarters to download for start to end years
  -d DATA_DIR, --data_dir DATA_DIR
                        path to save data
  --overwrite           If True, overwrites downloads and processed files.
  --debug               Debug modeThe code runs the extraction in the following steps
- Download indices for form 10k to 
./data/index - Combines all indices into a single csv 
./data/index/combined.csv - From Step2 combined csv, downloads all form 10k to 
./data/form10k - Parses the html forms with BeautifulSoup to 
./data/form10k.parsed - Parses MDA section to 
./data/mda 
- MDA section is parsed heuristically, and may not work for all forms. You'll probably need to modify the 
find_mda_from_textfunction for coverage. - You also might need to modify 
normalize_textfunction for MDA parsing.