Skip to content
Jonathan Scott edited this page Jun 17, 2023 · 48 revisions

Welcome to the CSVComparer wiki!

The CSVComparer is simple tool that checks two csv style files and reports the differences between them in a structured way.

How to run a CSV Comparison

What's new

2023-06-17

Fix bug where row numbers were out by 1 in the report: https://github.com/jscott7/CSVComparer/issues/40

2023-05-22

Replace Queue with lock statements with ConcurrentQueue. This reduced the benchmarks for different file comparison from 1.5ms to 1.28ms.

2023-05-15

Created a NuGet package on NuGet.org https://www.nuget.org/packages/CSVComparer (version 1.0.0)

2023-05-04

  • Change naming convention from Reference/Candidate to LeftHandSide/RightHandSide

2023-04-11

  • Refactor SplitRowWithQuotes to use ReadOnlySpan

Benchmark. Before

Method Mean Error StdDev Ratio RatioSD Gen0 Gen1 Allocated Alloc Ratio
StringSplit 60.68 ns 0.351 ns 0.328 ns 0.000 0.00 0.0178 - 224 B 0.000
StringSplitWithQuotesControl 339.66 ns 0.935 ns 0.875 ns 0.000 0.00 0.0267 - 336 B 0.000
StringSplitWithQuotes 352.94 ns 6.950 ns 6.501 ns 0.000 0.00 0.0267 - 336 B 0.000

After

Method Mean Error StdDev Ratio Gen0 Gen1 Allocated Alloc Ratio
StringSplit 60.82 ns 0.655 ns 0.613 ns 0.000 0.0178 - 224 B 0.000
StringSplitWithQuotesControl 141.54 ns 1.322 ns 1.172 ns 0.000 0.0267 - 336 B 0.000
StringSplitWithQuotes 138.74 ns 1.322 ns 1.236 ns 0.000 0.0267 - 336 B 0.000

2023-03-28

  • Fix bug with splitting string containing quotes and complex delimiter, for example, using "##" delimiter:
"A##\"B contains a quote##comma\"##\"Also contains a##comma\"##D"

2023-03-22

  • Change default branch from master to main

Make the following changes to update local branch

git branch -m master main
git fetch origin
git branch -u origin/main main
git remote set-head origin -a

2022-12-13

  • The column(s) used for the key are included in the results table, for example Key - ABC:DEF below:
Break Type,Key - ABC:DEF,Column Name,Reference Row, Reference Value, Candidate Row, Candidate Value
ValueMismatch,B:1,AnotherColumn,3,y,3,z
ValueMismatch,B:2,AValueColumn,4,1.2,4,1.0
RowInCandidateNotInReference,C:1,,-1,,5,

2022-10-10

  • Add CodeQL vulnerability scan to GitHub workflow
  • Refactor Orphan handling

2022-02-28

  • Refactor Console Application into new project. Use .NET 6.0 Console template

2022-02-01

  • Create output folder if it doesn't exist
  • Expand unit test coverage

2022-01-25

  • Updated to .NET 6.0

2021-09-01

  • Add support for excluding value breaks based on a regex pattern match of key
  • Fix typo in xml

2021-07-09

  • Improve logging. Example output now
Searching for comparison definition for C:\temp\ReferenceDirectory\File.3456.csv
Found Comparison Definition. ID = Test3
Exact file match for reference: 'File.3456.csv' not found. Search using pattern: '^File.[a-zA-Z0-9]*.csv'
Comparing C:\temp\ReferenceDirectory\File.3456.csv with C:\temp\CandidateDirectory\File.1234.csv
Reference: C:\temp\ReferenceDirectory\File.3456.csv
Candidate: C:\temp\CandidateDirectory\File.1234.csv
No differences found.
Saving results to C:\temp\testresults\Reconciliation-Results-Test3.csv
Comparison took 25ms

Finished

2021-07-05

  • Delimit summary data by comma instead of colon. This will split summary into different cells if opened in a spreadsheet

2021-04-26

  • Add ability to exclude Orphans based on a Regex pattern matching the key

2021-04-06

2021-04-01

  • Setup Azure Pipeline for CI
  • Process output correctly reports filenames when they end with .BREAKS.csv

2021-03-25

  • Perform tolerance based comparison on numeric values when they are enclosed by quotes
COL A COL B COL C
"ROW 1" "SOME VALUE" "42.1"

2021-03-18

  • Add support for quotes within column fields. This now follows CSV RFC-4180
 If double-quotes are used to enclose fields, then a double-quote
       appearing inside a field must be escaped by preceding it with
       another double quote.  For example:

       "aaa","b""bb","ccc"

2021-01-11

  • In orphan report, missing row index is now -1

2020-10-01

  • Populate Date in Comparison Results

2020-09-28

  • Add support for empty csv files. If one or both are empty the comparison will complete with meaningful results
  • Add Column Name to break details, for example:
Break Type Key Column Name Reference Row Reference Value Candidate Row Candidate Value
ValueMismatch 7 COL B 8 32.1 8 42.1

2020-09-22

  • Optionally exclude columns from the comparison. A list of columns to exclude can be defined in the configuration
  • Output files for comparisons with differences are saved with Filename.BREAKS.csv
  • The output path is now specified as a directory. If it doesn't exist it will be created

2020-09-15

  • Improve exception reporting when non-unique key columns are defined
  • Cosmetic improvements to logging/output file.
  • Add line counts for CSV files to output

2020-09-10

  • In directory comparison if a candidate file doesn't match exactly, use file pattern to search

2020-09-07

  • Use Regex to match file to configuration in directory comparisons
  • If a folder is passed as the outputfile parameter then in directory comparison a results file for each configuration key will be saved

2020-09-03

Added support for simple directory comparisons

2020-08-13

Change to API. ComparisonDefinition must now be applied in constructor

2020-08-12

Fix bug where SplitStringWithQuotes does not exit when a quote is last character, for example: A,B,"C,D"

2020-06-15

Add support for single-character delimiters to be enclosed within quotes. For example:

A,B,"This is a comma, in a quote", D

will resolve to:

  • A
  • B
  • "This is a comma, in a quote"
  • D

2020-04-14

A single instance of the CSVComparer class can now be reused for multiple comparisons

2020-04-13

Add support to optionally save output to file

2020-04-04

Add IgnoreInvalidRows flag to configuration. If this is set to true then all rows that do not contain the same number of columns as the header row will be excluded from the comparison. Typically this will happen if a footer row is present

2020-04-02

Improve output. Example break now reported as:

Break Type: ValueMismatch. Description Key:XY, Reference Row:100, Value:10.5 != Candidate Row:110, Value:1.5

The BreakDetail objects can also now be accessed programmatically

If none of the key columns listed in the configuration exist in the CSV files the comparison terminates with a break description added.

2020-03-23

Added tool to create large random test csv files

2020-03-12

Breaks include the row number

A: Row: 1 Value: 1.0 != Row: 1 Value: 1.2

2020-02-13

Renamed 'Target' to 'Candidate'. Naming convention change

2020-02-10

Added Tolerance for comparison of numeric fields