Skip to content

Commit 935ffc8

Browse files
committed
Revise deduplicate util to not reorder lines in
the input sequence. This is important to preserve integrity of the original input data. By comparison, the sort order doesn't matter when we do frequencies.
1 parent 1755f73 commit 935ffc8

File tree

1 file changed

+11
-1
lines changed

1 file changed

+11
-1
lines changed

log-analysis.sh

+11-1
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,17 @@ ensure_deps "gawk" "jq"
2020
# ####################
2121

2222
deduplicate() {
23-
sort | uniq
23+
# Ref. technique as seen here: https://stackoverflow.com/a/20639730
24+
# Or use awk '!x[$0]++', as seen here https://stackoverflow.com/a/11532197
25+
26+
# print stdin stream with line-number prefixed
27+
cat -n |
28+
# sort uniquely by second key (the incoming lines)
29+
sort -uk2 |
30+
# sort numeric by first key (line number)
31+
sort -nk1 |
32+
# select second field to the end
33+
cut -f2-
2434
}
2535

2636
frequencies() {

0 commit comments

Comments
 (0)