Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQLite bulk load is failing silently #115

Open
waldoj opened this issue May 29, 2020 · 3 comments
Open

SQLite bulk load is failing silently #115

waldoj opened this issue May 29, 2020 · 3 comments

Comments

@waldoj
Copy link
Member

waldoj commented May 29, 2020

update.sh is bailing entirely at at this step:

sqlite3 temp.sqlite < ../scripts/load-data.sql

Like, so entirely that the exit trap doesn't execute, which is grim.

Here's the entirely unremarkable output:

[...]
../data/officer.csv:890289: unescaped " character
../data/officer.csv:890289: unterminated "-quoted field
../data/officer.csv:890289: expected 5 columns but found 4 - filling the rest with NULL

#

(There is a constant stream of errors emitted by the importer. This is normal.)

I don't get what's going on here.

@waldoj
Copy link
Member Author

waldoj commented May 29, 2020

Looking through the resulting SQLite file, it looks like only a subset of the data is being imported, across tables. (For example, only 267,367 officer records, versus an expected 802,103; only 436,591 LLC records, versus an expected 1,309,775; only 436,591 corporation records, versus an expected 1,215,425.) I'm confused about how this is possible, because load-data.sql includes a series of import statements, each of which should be atomic.

My best guess is that this is a result of new CSV formats being emitted by the SCC. The data may be so fatally malformed that SQLite is either quietly bailing on the attempt, or rejecting a far higher number of rows than it used to.

@waldoj
Copy link
Member Author

waldoj commented May 31, 2020

This is suspicious:

head -10000 llc.csv |csvstat
  1. EntityID,Name,Status,Status
	<class 'str'>
	Nulls: False
	Unique values: 9730
	Max length: 8
  2. Date,Duration,IncorpDate,IncorpState,IndustryCode,Street1,Street2,City,State,Zip,PrinOffEffDate,RA-Name,RA-Street1,RA-Street2,RA-City,RA-State,RA-Zip,RA-EffDate,RA-Status,RA-Loc,StockInd,TotalShares,MergerInd,AssessInd
	<class 'NoneType'>
	Nulls: True
	Values:

Row count: 9730

Looks to me like the csvkit parser believes there to be just 2 columns.

I figured out that the problem (or a problem, anyway) is on line 1,116 of the CSV, or the 2nd line here ("MCCLUNEY KIDS"):

	S2760140  ,GIBBY F. ENTERPRISE LLC,INACTIVE  ,2015-12-31,9999-12-31,          ,VA        ,0         ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,0
	S2165480  ,MCCLUNEY KIDS  LLC,INACTIVE  ,2015-12-31,9999-12-31,          ,VA        ,0         ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,0
	S0672453  ,WINDSOR CAPITAL  L.L.C.,INACTIVE  ,2008-12-31,9999-12-31,          ,VA        ,0         ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,          ,0

That is, if I pipe the first 1,115 rows through csvstat, it recognizes all columns. But once that 1,116th row is included, it collapses to two. But...why? Whatever it is, it doesn't bother Pages.app (but CSVs vary so widely, that doesn't say much).

I don't get it.

@waldoj
Copy link
Member Author

waldoj commented May 31, 2020

If I remove line 1,116 and pipe every preceding line and the one following line through csvstat, all columns are recognized.

If I construct a file that consists of the leader row and just line 1,116, csvstat recognizes all columns. 🤯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant