Skip to content

Commit d4f3f66

Browse files
authored
kaizen: add -q option for quoted fields (#29)
* kaizen: add -q option for quoted fields addresses #28 and #27 Signed-off-by: Tim Bray <[email protected]> * add missing test data Signed-off-by: Tim Bray <[email protected]> --------- Signed-off-by: Tim Bray <[email protected]>
1 parent 8c59c94 commit d4f3f66

File tree

10 files changed

+491
-48
lines changed

10 files changed

+491
-48
lines changed

INSTALLING.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# Installing Topfew
2+
3+
Each Topfew [release](https://github.com/timbray/topfew/releases) comes with binaries built for both the x86 and ARM
4+
flavors of Linux, MacOS, and Windows.
5+
6+
Topfew comes with a Makefile which is uncomplicated. Typing `make` will create an executable named `tf`,
7+
created by `go build` with no options, in the `./bin` directory.
8+
9+
## Arch Linux
10+
11+
Topfew [is available](https://aur.archlinux.org/packages/topfew) in the
12+
[Arch User Repository](https://wiki.archlinux.org/title/Arch_User_Repository) (AUR).
13+
If you have an AUR pacman wrapper installed you can install it directly. Otherwise, to install Topfew as an Arch package:
14+
```
15+
git clone https://aur.archlinux.org/topfew.git
16+
cd topfew
17+
makepkg -i
18+
```

README.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,13 @@ If no fieldlist is provided, **tf** treats the whole input record as a single fi
7070
Provides a regular expression that is used as a field separator instead of the default white space.
7171
This is likely to incur a significant performance cost.
7272
73+
`-q, --quotedfields`
74+
75+
Some files, for example Apache httpd logs, use space-separation but also
76+
allow spaces within fields which are delimited by `"`. The -q/--quotedfields
77+
argument allows **tf** to process these correctly. It is an error to specify both
78+
-p and -q.
79+
7380
`-g regexp`, `--grep regexp`
7481
7582
The initial **g** suggests `grep`.
@@ -114,6 +121,36 @@ Records are separated by newlines, fields within records by white space, defined
114121
115122
The field separator can be overridden with the --fieldseparator option.
116123
124+
## Case study: Apache access_log
125+
126+
Here is a line from an Apache httpd `access_log` file. For readability, the fields are
127+
separated by line-breaks and numbered. Note that the fields are mostly space-separated, but that field 6,
128+
summarizing the request and its result, is delimited by quote characters `"`.
129+
130+
```
131+
1. 202.113.19.244
132+
2. -
133+
3. -
134+
4. [12/Mar/2007:08:04:39
135+
5. -0800]
136+
6. "GET /ongoing/picInfo.xml?o=http://www.tbray.org/ongoing/When/200x/2007/03/10/Beautiful-Code HTTP/1.1"
137+
7. 200
138+
8. 137
139+
9. "http://www.tbray.org/ongoing/When/200x/2007/03/10/Beautiful-Code"
140+
10. "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2) Gecko/20070219 Firefox/2.0.0.2"
141+
```
142+
143+
The fetch of `picInfo.xml` signals that this is an actual browser request, likely signifying that
144+
a human was involved; the URL following the `o=` is the resource the human looked at. Here is a
145+
**tf** invocation that yields a list of the top 5 URLs that were fetched by a human:
146+
147+
```shell
148+
tf -g picInfo.xml -f 6 -q -s '\?utm.*' '' -s " HTTP/..." "" -s "GET .*\/ongoing" ""
149+
```
150+
151+
Note the `-g` to select only lines with `picInfo.xml`, the `-q` to request correct processing
152+
of quote-delimited fields, and the sequence of `-s` patterns to clean up the results.
153+
117154
## Performance issues
118155
119156
Since the effect of topfew can be exactly duplicated with a combination of `awk`, `grep`, `sed` and `sort`, you wouldn’t be using it if you didn’t care about performance.

internal/config.go

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ type config struct {
1717
filter filters
1818
width int
1919
sample bool
20+
quotedFields bool
2021
}
2122

2223
func Configure(args []string) (*config, error) {
@@ -75,6 +76,8 @@ func Configure(args []string) (*config, error) {
7576
}
7677
case arg == "--sample":
7778
config.sample = true
79+
case arg == "--quotedfields" || arg == "-q":
80+
config.quotedFields = true
7881
case arg == "-h" || arg == "-help" || arg == "--help":
7982
fmt.Println(instructions)
8083
os.Exit(0)
@@ -101,6 +104,9 @@ func Configure(args []string) (*config, error) {
101104
}
102105
i++
103106
}
107+
if (config.fieldSeparator != nil) && config.quotedFields {
108+
err = errors.New("only one of -p/--fieldseparator and -q/--quotedfields may be specified")
109+
}
104110

105111
return &config, err
106112
}
@@ -132,7 +138,8 @@ order of occurrences.
132138
Usage: tf
133139
-n, --number (output line count) [default is 10]
134140
-f, --fields (field list) [default is the whole record]
135-
-p, --fieldseparator (field separator regex) [default is white space]
141+
-p, --fieldseparator (field separator regex) [default is white space]
142+
-q, --quotedfields [default is false]
136143
-g, --grep (regexp) [may repeat, default is accept all]
137144
-v, --vgrep (regexp) [may repeat, default is reject none]
138145
-s, --sed (regexp) (replacement) [may repeat, default is no changes]
@@ -151,6 +158,11 @@ Fields are separated by white space (spaces or tabs) by default.
151158
This can be overridden with the --fieldseparator option, at some cost in
152159
performance.
153160
161+
Some files, for example Apache httpd logs, use space-separation but also
162+
allow spaces within fields which are quoted with ("). The -q/--quotedfields
163+
allows tf to process these correctly. It is an error to specify both
164+
-p and -q.
165+
154166
The regexp-valued fields work as follows:
155167
-g/--grep discards records that don't match the regexp (g for grep)
156168
-v/--vgrep discards records that do match the regexp (v for grep -v)

internal/config_test.go

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,12 @@ func TestArgSyntax(t *testing.T) {
1515
{"--sed"}, {"-s", "x"}, {"--sample", "--sed", "1"},
1616
{"--width", "a"}, {"-w", "0"}, {"--sample", "-w"},
1717
{"--sample", "-p"}, {"--fieldseparator", "a["},
18+
{"--fieldseparator", "x", "-q"}, {"--quotedfields", "-f", "z"},
1819
}
1920

2021
// not testing -h/--help because it'd be extra work to avoid printing out the usage
2122
goods := [][]string{
23+
{"-q", "fname"}, {"--quotedfields"},
2224
{"--number", "1"}, {"-n", "5"},
2325
{"--fields", "1"}, {"-f", "3,5"},
2426
{"--grep", "re1"}, {"-g", "re2"},

internal/keyfinder.go

Lines changed: 149 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -21,78 +21,127 @@ const NER = "not enough bytes in record"
2121
// does mean that the contents of the field are only valid until you call getKey again, and also that
2222
// the keyFinder type is not thread-safe
2323
type keyFinder struct {
24-
fields []uint
25-
key []byte
26-
separator *regexp.Regexp
24+
fields []uint
25+
key []byte
26+
separator *regexp.Regexp
27+
quotedFields bool
2728
}
2829

2930
// newKeyFinder creates a new Key finder with the supplied field numbers, the input should be 1 based.
3031
// keyFinder is not thread-safe, you should clone it for each goroutine that uses it.
31-
func newKeyFinder(keys []uint, separator *regexp.Regexp) *keyFinder {
32+
func newKeyFinder(keys []uint, separator *regexp.Regexp, quotedFields bool) *keyFinder {
3233
kf := keyFinder{
3334
key: make([]byte, 0, 128),
3435
}
3536
for _, knum := range keys {
3637
kf.fields = append(kf.fields, knum-1)
3738
}
3839
kf.separator = separator
40+
kf.quotedFields = quotedFields
3941
return &kf
4042
}
4143

4244
// clone returns a new keyFinder with the same configuration. Each goroutine should use its own
4345
// keyFinder instance.
4446
func (kf *keyFinder) clone() *keyFinder {
4547
return &keyFinder{
46-
fields: kf.fields,
47-
key: make([]byte, 0, 128),
48-
separator: kf.separator,
48+
fields: kf.fields,
49+
key: make([]byte, 0, 128),
50+
separator: kf.separator,
51+
quotedFields: kf.quotedFields,
4952
}
5053
}
5154

5255
// getKey extracts a key from the supplied record. This is applied to every record,
5356
// so efficiency matters.
5457
func (kf *keyFinder) getKey(record []byte) ([]byte, error) {
55-
// if there are no Key-finders just return the record, minus any trailing newlines
58+
// chomp
59+
if record[len(record)-1] == '\n' {
60+
record = record[:len(record)-1]
61+
}
62+
// if there are no Key-finders the key is the record
5663
if len(kf.fields) == 0 {
57-
if record[len(record)-1] == '\n' {
58-
record = record[0 : len(record)-1]
59-
}
6064
return record, nil
6165
}
6266
var err error
6367
kf.key = kf.key[:0]
6468
if kf.separator == nil {
65-
field := 0
66-
index := 0
67-
first := true
68-
69-
// for each field in the Key
70-
for _, keyField := range kf.fields {
71-
// bypass fields before the one we want
72-
for field < int(keyField) {
73-
index, err = pass(record, index)
69+
// no regex provided, we're doing space-separation
70+
if kf.quotedFields {
71+
// if we're doing apache httpd style access_log files, with some "-quoted fields
72+
field := 0
73+
index := 0
74+
first := true
75+
76+
// for each field in the key
77+
for _, keyField := range kf.fields {
78+
// bypass fields before the one we want
79+
for field < int(keyField) {
80+
index, err = passQuoted(record, index)
81+
if err != nil {
82+
return nil, err
83+
}
84+
// in the special case where we might have just passed a quoted fields, we will
85+
// advance index past the closing quote
86+
if index < len(record) && record[index] == '"' {
87+
index++
88+
}
89+
field++
90+
}
91+
92+
// join(' ', kf)
93+
if first {
94+
first = false
95+
} else {
96+
kf.key = append(kf.key, ' ')
97+
}
98+
99+
kf.key, index, err = gatherQuoted(kf.key, record, index)
74100
if err != nil {
75101
return nil, err
76102
}
103+
// in the special case where we might have just passed a quoted fields, we will
104+
// advance index past the closing quote
105+
if index < len(record) && record[index] == '"' {
106+
index++
107+
}
77108
field++
78109
}
110+
} else {
111+
// basic space-separation
112+
field := 0
113+
index := 0
114+
first := true
79115

80-
// join(' ', kf)
81-
if first {
82-
first = false
83-
} else {
84-
kf.key = append(kf.key, ' ')
85-
}
116+
// for each field in the Key
117+
for _, keyField := range kf.fields {
118+
// bypass fields before the one we want
119+
for field < int(keyField) {
120+
index, err = pass(record, index)
121+
if err != nil {
122+
return nil, err
123+
}
124+
field++
125+
}
86126

87-
// attach desired field to Key
88-
kf.key, index, err = gather(kf.key, record, index)
89-
if err != nil {
90-
return nil, err
91-
}
127+
// join(' ', kf)
128+
if first {
129+
first = false
130+
} else {
131+
kf.key = append(kf.key, ' ')
132+
}
92133

93-
field++
134+
// attach desired field to Key
135+
kf.key, index, err = gather(kf.key, record, index)
136+
if err != nil {
137+
return nil, err
138+
}
139+
140+
field++
141+
}
94142
}
95143
} else {
144+
// regex separator provided, less code but probably slower
96145
allFields := kf.separator.Split(string(record), -1)
97146
for i, field := range kf.fields {
98147
if int(field) >= len(allFields) {
@@ -107,9 +156,10 @@ func (kf *keyFinder) getKey(record []byte) ([]byte, error) {
107156
return kf.key, err
108157
}
109158

110-
// pull in the bytes from a desired field
159+
// gather pulls in the bytes from a desired field, and leaves index positioned at the first white-space
160+
// character following the field, or at the end of the record, i.e. len(record)
111161
func gather(key []byte, record []byte, index int) ([]byte, int, error) {
112-
// eat leading space
162+
// eat leading space - if we're already at the end of the record, the loop is a no-op
113163
for index < len(record) && (record[index] == ' ' || record[index] == '\t') {
114164
index++
115165
}
@@ -118,13 +168,49 @@ func gather(key []byte, record []byte, index int) ([]byte, int, error) {
118168
}
119169

120170
// copy Key bytes
121-
for index < len(record) && record[index] != ' ' && record[index] != '\t' && record[index] != '\n' {
122-
key = append(key, record[index])
171+
startAt := index
172+
for index < len(record) && record[index] != ' ' && record[index] != '\t' {
123173
index++
124174
}
175+
key = append(key, record[startAt:index]...)
125176
return key, index, nil
126177
}
127178

179+
// same semantics as gather, but respects quoted fields that might create spaces. Leaves the index
180+
// value pointing at the closing quote
181+
func gatherQuoted(key []byte, record []byte, index int) ([]byte, int, error) {
182+
// eat leading space
183+
for index < len(record) && (record[index] == ' ' || record[index] == '\t') {
184+
index++
185+
}
186+
if index >= len(record) {
187+
return nil, 0, errors.New(NER)
188+
}
189+
190+
if record[index] == '"' {
191+
index++
192+
startAt := index
193+
for index < len(record) && record[index] != '"' {
194+
index++
195+
}
196+
key = append(key, record[startAt:index]...)
197+
// if we hit end-of-record before the closing quote, that's an error
198+
if index == len(record) {
199+
return nil, 0, errors.New(NER)
200+
}
201+
} else {
202+
startAt := index
203+
for index < len(record) && record[index] != ' ' && record[index] != '\t' {
204+
index++
205+
}
206+
key = append(key, record[startAt:index]...)
207+
}
208+
return key, index, nil
209+
}
210+
211+
// pass moves the index variable past any white space and a space-separated field,
212+
// leaving index pointing at the first white-space character after the field or
213+
// at the end of record, i.e. == len(record)
128214
func pass(record []byte, index int) (int, error) {
129215
// eat leading space
130216
for index < len(record) && (record[index] == ' ' || record[index] == '\t') {
@@ -138,3 +224,30 @@ func pass(record []byte, index int) (int, error) {
138224
}
139225
return index, nil
140226
}
227+
228+
// same semantics as pass, but for quoted fields. Leaves the index value pointing at the
229+
// closing "
230+
func passQuoted(record []byte, index int) (int, error) {
231+
// eat leading space
232+
for index < len(record) && (record[index] == ' ' || record[index] == '\t') {
233+
index++
234+
}
235+
if index == len(record) {
236+
return 0, errors.New(NER)
237+
}
238+
if record[index] == '"' {
239+
index++
240+
for index < len(record) && record[index] != '"' {
241+
index++
242+
}
243+
// if we hit end of record before the closing quote, that's a bug
244+
if index >= len(record) {
245+
return 0, errors.New(NER)
246+
}
247+
} else {
248+
for index < len(record) && record[index] != ' ' && record[index] != '\t' {
249+
index++
250+
}
251+
}
252+
return index, nil
253+
}

0 commit comments

Comments
 (0)