Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean factorsSPGMI and stocksCRSP data sets #83

Merged
merged 1 commit into from
Mar 26, 2022
Merged

clean factorsSPGMI and stocksCRSP data sets #83

merged 1 commit into from
Mar 26, 2022

Conversation

spinnj
Copy link
Contributor

@spinnj spinnj commented Mar 25, 2022

factorsSPGMI and stocksCRSP had the following issues:

  • certain sectors incorrectly named (misspelled),
  • 7 securities in the sample of 300 were in sectors that were to be removed (as per Doug Martin, requested no securities in Financials, Utilities, or Real Estate sectors),
  • per Doug Martin, the CapGroup assignments were incorrectly mapped (our comparison suggested about 75% accuracy) - we remapped CapGroup assignments on a point in time basis using an outside data set and the CRSP percentile breakpoints of 0.7, 0.85, 0.98 to divide Large, Mid, Small, and Micro cap stocks and also created a new variable "CapGroupL" which is the ending group assignment on 2015-12-31 for each stock,

Additionally, certain other changes were made at the request of Doug Martin:

  • The sector names were modified slightly for brevity,
  • "TickerLast" was renamed to "TickerL"

Incorporating this pull request would:

  1. Modify factorsSPGMI and stocksCRSP as described (80868 rows vs. previously 82800) with new CapGroupL variable (1 extra column).
  2. Add a folder within the Sandbox directory titled "Vestcor Data Cleaning" that has an R file with some rough code used to clean the data, and the MarketCapPercentiles.csv file historical data used to reclassify stocks CapGroup assignments.

We can be reached at [email protected] and [email protected] for comments/questions/concerns.

@braverock braverock marked this pull request as ready for review March 26, 2022 14:54
@braverock braverock merged commit 2550984 into braverock:master Mar 26, 2022
@JustinMShea
Copy link
Collaborator

@spinnj Thanks for your PR. It is nice to see folks contributing to the software who appear to be using both for work and teaching.

Concerning your comments above.

Additionally, certain other changes were made at the request of Doug Martin:

The sector names were modified slightly for brevity,
"TickerLast" was renamed to "TickerL"

I noticed that you changed some of the original source data names, which is highly inadvisable. This is something that Doug Martin insists on doing, and has ignored both Brian and I's experience and advice on the matter. Therefore, please revert the sector names to their original format asap.

Thank you,
Justin

@spinnj
Copy link
Contributor Author

spinnj commented Mar 26, 2022

@spinnj Thanks for your PR. It is nice to see folks contributing to the software who appear to be using both for work and teaching.

Concerning your comments above.

Additionally, certain other changes were made at the request of Doug Martin:

The sector names were modified slightly for brevity,
"TickerLast" was renamed to "TickerL"

I noticed that you changed some of the original source data names, which is highly inadvisable. This is something that Doug Martin insists on doing, and has ignored both Brian and I's experience and advice on the matter. Therefore, please revert the sector names to their original format asap.

Thank you, Justin

OK. Checking to make sure I understand exactly what you want me to revert - is the concern the renaming of TickerLast column to TickerL or is it the modified Sectors (changing row contents for each security)?

@braverock
Copy link
Owner

I'm tired of trying to impose best practices, but varying from vendor names is almost never a good idea, because then the analyst (or student) can't look up what things mean online, and they can't use other people's code that assumes the vendor data will be unmolested.

If we want the parsed R objects to have different names, I guess that is arguably OK fropm a pedantic point of view, but even then, use factor labels, and provide a function to change the change the labels for the factor levels, don't change the source data.

Changing the names of the input data sources is frankly dumb, because then you can never go back to the original data source and reconcile. You can also never merge new data from the input data source. We've already seen merge problems when Doug wanted to get updated data from S&P or further adjusted things. If people who are supposedly highly familiar with the data set can't merge it cleanly, then how are other people going to feel if they download data from S&P?

In a production data pipeline, as I'm sure you and your team are aware, you usually ingest the source data as is, and then create processes for cleaning or labeling or fixing bad data points, outliers, etc. These things are supposed to be documented and repeatable. Your sandbox code is a start, but it isn't documented, and the documentation for the data sources in the package doesn't explain that the data has been altered, how it was altered, and where the code to apply or reverse the alterations may be found.

Both Justin and I have voiced concerns several times about renaming things. They won't match S&P's naming conventions or documentation, you can't merge new data, the list is too long to mention. Not to mentio, we are far beyond the days of fixed-width terminals, how does shortening a name make it more readable or usable? I would suggest it is far more common in practice to clean things like badly placed spaces or puntuation in names, or to provide printed names that are "pretty" and usually longer (like full company names instead of data identifiers).

Best practice is to create code to make changes, and apply that code in a systematic way. Even with students or a textbook, we should be teaching people how real data sources work, warts and all, not cleaning it all up, giving them artificially perfect toy examples, and then wondering why they fail when they get to industry and 80% of the work is understanding, parsing, and manipulating input data.

Like I said, I'm sick of arguing about it where it is clear that there is no intention to adhere to industry best practices. Our time is probably better spent fixing bugs and making improvements in the functions themselves (there are many, see the issues list).

@spinnj
Copy link
Contributor Author

spinnj commented Mar 26, 2022

I haven't been part of any arguments and so I think I've come in to the middle of a conversation here.

Overall, understood, can we simply revert everything from my pull request and move forward in your preferred way - I'm not opposed to that in any way.

@JustinMShea
Copy link
Collaborator

OK. Checking to make sure I understand exactly what you want me to revert - is the concern the renaming of TickerLast column to TickerL or is it the modified Sectors (changing row contents for each security)?

Yes, @spinnj it is both the TickerLast to TickerL and modified Sectors labels. As Brian helpfully provided a longer explanation, original names are best for a variety of reasons.

@spinnj
Copy link
Contributor Author

spinnj commented Mar 26, 2022

OK, understood. Best course of action may be to revert the whole thing, let me go back and keep things as they were and match the sector names to the GICS documentation, and I can submit another PR later?

@JustinMShea
Copy link
Collaborator

Done. Just to illustrate a related point to this conversations, I ran the R CMD check after the PR was merged and got a new error in the Examples files(See #43 on this issue).

Running examples in ‘FactorAnalytics-Ex.R’ failed
The error most likely occurred in:

> ### Name: fitFfm
> ### Title: Fit a fundamental factor model using cross-sectional regression
> ### Aliases: fitFfm coef.ffm fitted.ffm residuals.ffm
> 
> ### ** Examples
> 
> 
> # load data 
> data(stocksCRSP)
> data(factorsSPGMI)
> 
> stocks_factors <- selectCRSPandSPGMI(stocks = stocksCRSP, factors = factorsSPGMI,
+                                     dateSet = c("2006-01-31", "2010-12-31"), 
+                                     stockItems = c("Date", "TickerLast", 
+                                                    "CapGroup", "Sector", 
+                                                    "Return", "Ret13WkBill",
+                                                    "mktIndexCRSP"),
+                                     factorItems = c("BP", "LogMktCap", "SEV"), 
+                                     capChoice = "SmallCap",
+                                     Nstocks = 20)
Error in `[.data.table`(facModDat, , .SD, .SDcols = colNames) : 
  Some items of .SDcols are not column names: [TickerLast]
Calls: selectCRSPandSPGMI -> [ -> [.data.table
Execution halted

This is due to "TickerLast" being changed to "TickerL". It will impact a bunch of other files too, since we recycle many of the examples, but the Examples error only prints the first one.

The main goal is to get this on CRAN and then have it stay there for many years with minor maintenance. Thus we need to stamp out bugs and eliminate errors, warnings, and notes on build in the existing code base. We don't want to add others in the process if we can help it because we have a lot of other issues to stamp out (documented under the issues tab).

In addition, please see the project board here for the strategy I outlined to tackle existing issues and get this on CRAN. I believe your PR will also solve issue #73, which is most welcome, so thanks again.

@braverock
Copy link
Owner

This commit broke things, so reverting in #84 was the correct response for now.

It is actually very close to mergable.

I commented on the cleaning script file directly, but I will summarize my comments here to keep the discussion in the thread.

lines 7/8 of the cleaning script use absolute paths. this will break for anyone who doesn't have the code at C:/FA?FactorAnalytics. please change to relative paths from project root.

lines 102/103 rename column TickerLast to TickerL. This is a breaking change, as Justin noted above. My personal opinion is that making a breaking change to save three characters isn't necessary. Anyone familiar with the CRSP data set is used to the name TickerLast.

If we want to make breaking changes anyway, then the requirement for mergability is to refactor to make it non-breaking. There are ~18 places in the R/ directory that reference TickerLast, and about 14 places in the man/ directory. Refactoring would change the .R files first, and regenerate the relevant roxygen files in man/. If this changed all references to TickerLast and R CMD check runs without breaking on TickerLast, then I think this would be mergable.

lines 89 and 96 (reordering columns) will make future merges of vendor data break. there are already other breaking changes in the data (like changing the text names of Industry Sectors rather than changing labels for factor levels), so I don't consider this reordering to be a chnage that breaks future merges any more than other previous changes. it doesn't seem necessary, but there's no prevailing reason not to include it in the commit.

Sorry for the noise and the (minor) additional work, and thanks for contributing.

@spinnj
Copy link
Contributor Author

spinnj commented Mar 27, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants