ABW

ABW stands for Awk Bag of Words.

This project has been renamed from ABoW to ABW. Update your config with this command:

cd path/to/abow
sed -i 's,abow.git,abw.git,g' .git/config

Usage

Process files

Process files to the standard output:

abw -p FILE [...]

abw --process FILE [...]

Process files to an output file using redirection:

abw -p FILE [...] > OUTPUT

abw --process FILE [...] > OUTPUT

Process files to an output file command line argument:

abw --p -w OUTPUT FILE [...]

abw --process --write-to OUTPUT FILE [...]

Process files with a list of fields:

abw -f token,count FILE [...]

abw --fields token,count FILE [...]

Process files with a list of options:

abw -o lang=pt,lower,ascii,nostopwords FILE [...]

abw --options lang=pt,lower,ascii,nostopwords FILE [...]

Where:

FILE: is a plain text file.
OUTPUT: is a tab-separated value file.

Options:

lang=pt: the language, e.g, pt
ascii: convert to ASCII characters
lower: convert to lower case characters
upper: convert to upper case characters
eol: keep end of line token (<eol>)
stopwords: keep stop-word tokens (requires lang option)
asc=token: ascending sort by token or count
desc=token: descending sort by token or count

As to the asc and desc options, the locale's sorting order does not come into play; comparisons are based on character values only.

Separate outputs

The --write-to parameter allows to write separate output files for each input file. If you want to process three files, you can pass a generic output name, so that each input file will produce a separate output file. The generic output name can use one of these two placeholders: :filedir :filename.

See these examples:

abw --process --write-to ':filedir/:filename.tsv' folder/file1.txt folder/file2.txt folder/file3.txt

ls folder/
'folder/file1.txt.tsv'
'folder/file2.txt.tsv'
'folder/file3.txt.tsv'

abw --process --write-to ':filedir/output.tsv' folder1/file.txt folder2/file.txt folder3/file.txt

ls folder*/
'folder1/output.tsv'
'folder2/output.tsv'
'folder3/output.tsv'

Where:

:filedir: is a placeholder to be replaced with the relative directory where the input file is.
:filename: is a placeholder to be replaced with the name of the input file with the extension.

List files

List files:

abw -l

abw --list

List files of a collection:

abw -l -c COLLECTION

abw --list --collection COLLECTION

List files of a collection selecting some metadata fields:

abw -l -c COLLECTION -m "uuid,collection,name,size,date"

abw --list --collection COLLECTION --meta "uuid,collection,name,size,date"

Show files

Show a text.txt file:

abw -s UUID

abw --show UUID

Show a meta.txt file:

abw -s -m UUID

abw --show --meta UUID

Show a data.tsv file:

abw -s -d UUID

abw --show --data UUID

Search in files

Search in text.txt files using a regex:

abw -g REGEX

abw --grep REGEX

Import files

Import files:

abw -i FILE [...]

abw --import FILE [...]

Import files to database:

abw -i -d FILE [...]

abw --import --database FILE [...]

Import files recursivelly:

abw -i -r DIRECTORY [...]

abw --import --recursive DIRECTORY [...]

Import files recursivelly into a collection:

abw -i -r -c COLLECTION DIRECTORY [...]

abw --import --recursive --collection COLLECTION DIRECTORY [...]

Where:

FILE: is a plain text file.
DIRECTORY: is a directory containing files with ".txt" suffix.
COLLECTION: is an arbitrary name for groups of imported files.

If no collection name is informed, the "default" collection is implied.

If a file exists, an error message is written to /dev/stderr.

You can force to import files again using the --force option.

The collection database

The imported files are inserted into a SQLite database called "collection.db" under the data folder.

CREATE TABLE text_ (
    uuid_ TEXT PRIMARY KEY, content_ TEXT
) STRICT;

CREATE TABLE meta_ (
    uuid_ TEXT PRIMARY KEY, collection_ TEXT,
    hash_ TEXT, name_ TEXT, path_ TEXT,
    mime_ TEXT, date_ TEXT, lines_ INTEGER,
    words_ INTEGER, bytes_ INTEGER, chars_ INTEGER,
    CONSTRAINT meta_fk_ FOREIGN KEY (uuid_) REFERENCES text_ (uuid_)
) STRICT;

CREATE TABLE data_ (
    uuid_ TEXT, token_ TEXT, type_ TEXT,
    count_ INTEGER, ratio_ REAL, format_ TEXT,
    case_ TEXT, length_ INTEGER, indexes_ TEXT, 
    CONSTRAINT data_pk_ PRIMARY KEY (uuid_,
    FOREIGN KEY (uuid_) REFERENCES text_ (uuid_)
) STRICT;

Where:

data_: is the table containing the bag of words.
meta_: is the table containing some properties.
text_: is the table with the contents of the input text file.

By default, the files are imported to a collection directory (the next section). To import into a SQLite database, use the switch --database.

The collection directory

The imported files are grouped in sub-folders called "collections" under the data folder.

tree data

data
└── default
    ├── 25/cc/
    │   └── 25cc123e-66c5-35ac-8b32-bc8ef803abdf
    │       ├── data.tsv
    │       ├── meta.txt
    │       └── text.txt
    └── 80/a7/
        └── 80a7c9ab-8f0c-31ab-9f72-251367cf6557
            ├── data.tsv
            ├── meta.txt
            └── text.txt

Where:

data.tsv: is a tab-separated value file containing the bag of words.
meta.txt: is a key-value file containing some properties.
text.txt: is a copy of the input text file.

And where:

default: is the default collection name.
25/cc/: is a subdirectory to support more than 21845.33 (2^16/3) per directory on VFAT file system.
25cc123e-66c5-35ac-8b32-bc8ef803abdf: is a UUIDv8 derived from the file content.

The maximum number of files per directory on VFAT is 21845.33 (2^16/3). The maximum number of entries is 2^16 on VFAT. Each file on VFAT occupies at least 3 entries: 1 entry for short name (DOS compatible), and at least 2 entries for 2 long name (up to 20 entries = 255 chars).

The `data.tsv` file

The data.tsv file is a tab-separated value file containing a bag of words.

Fields:

TOKEN: is a word, a punctuation symbol or a <EOL> symbol.
TYPE: is the token type -- the result of toascii(tolower(token)).
COUNT: is the number of occurencies of the token in the text.
RATIO: is the COUNT divided by the number of all tokens in the text.
FORMAT: is a POSIX character classes:
- 'W' for [:alpha:], matching "word"
- 'N' for [:digit:], matching "1234"
- 'P' for [:punct:], matching "!?"
- 'NA' for none of the above.
CASE: is one of these letter cases:
- 'L' for lower-case, matching "word" and "word-wOrd"
- 'U' for upper-case, matching "WORD" and "WORD-wOrd"
- 'S' for start-case, matching "Word" and "Word-wOrd"
- 'C' for camel-case, matching "wordWord" and "WordWord"
- 'NA' for none of the above.
LENGTH: is the number of characters in the token.
INDEXES: is the comma-separated list of all positions of a token in the text.

Where:

<eol>: is a symbol for the end of line.
'NA': is a missing value borrowed from R language.

The fields shown in the output is controlled by passing a list of --fields.

If a token has only 1 character and this character is an uppercase letter, then this token is treated as a capitalized word.

Only line endings (\n) and tabs (\t) separate records and fields, respectively. No quotes are used, despite Github complaints about "unclosed quoted fields". If you are curiouse aboute TSV files, read this.

This is an example of a bag of words generated by this program:

TOKEN	COUNT	RATIO	CLASS	CASE	LENGTH	INDEXES
Doutrina	1	0.013698630	A	C	8	52
professor	1	0.013698630	A	L	9	33
Itália	1	0.013698630	A	C	6	8
consultor	1	0.013698630	A	L	9	40
Congregação	2	0.027397260	A	C	11	43,49
em	2	0.027397260	A	L	2	10,69
artigo	1	0.013698630	A	L	6	59
hoje	1	0.013698630	A	L	4	48
incorpora	1	0.013698630	A	L	9	60
texto	1	0.013698630	A	L	5	61
para	1	0.013698630	A	L	4	50
(	3	0.041095890	P	NA	1	3,12,47
Franciscanos	1	0.013698630	A	C	12	24
)	3	0.041095890	P	NA	1	14,15,55
XVIII	1	0.013698630	A	U	5	29
,	5	0.068493151	P	NA	1	5,7,20,25,65
.	3	0.041095890	P	NA	1	30,56,72
público	1	0.013698630	A	L	7	71
também	1	0.013698630	A	L	6	32
Ferraris	1	0.013698630	A	C	8	2
Encyclopedia	1	0.013698630	A	C	12	64
Ofício	1	0.013698630	A	C	6	46
?	1	0.013698630	P	NA	1	13
dos	1	0.013698630	A	L	3	23
da	6	0.082191781	A	L	2	21,36,41,42,53,62
de	1	0.013698630	A	L	2	67
Santo	1	0.013698630	A	C	5	45
sacerdote	1	0.013698630	A	L	9	18
Fé	1	0.013698630	A	C	2	54
canonista	1	0.013698630	A	L	9	26
Lucius	1	0.013698630	A	C	6	1
um	1	0.013698630	A	L	2	17
domínio	1	0.013698630	A	L	7	70
Alexandria	1	0.013698630	A	C	10	6
do	2	0.027397260	A	L	2	27,44
italiano	1	0.013698630	A	L	8	19
Solero	1	0.013698630	A	C	6	4
século	1	0.013698630	A	L	6	28
1913	1	0.013698630	D	NA	4	68
ordem	1	0.013698630	A	L	5	38
Provincial	1	0.013698630	A	C	10	35
a	1	0.013698630	A	L	1	51
Catholic	1	0.013698630	A	C	8	63
Este	1	0.013698630	A	C	4	58
e	2	0.027397260	A	L	1	34,39
Ordem	1	0.013698630	A	C	5	22
Foi	1	0.013698630	A	C	3	31
publicação	1	0.013698630	A	L	10	66
sua	1	0.013698630	A	L	3	37
<EOL>	2	0.027397260	NA	NA	5	57,73
1763	1	0.013698630	D	NA	4	11
foi	1	0.013698630	A	L	3	16
falecido	1	0.013698630	A	L	8	9

The text was extracted from a random Wikipedia page in Portuguese.

Note that letter case, punctuation, end of line, and stop words are preserved. They can be transformed or removed passing a list of --options.

License

This project is Open Source software released under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
bin		bin
data/default		data/default
lib/lang/pt		lib/lang/pt
samples		samples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
abw.conf		abw.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ABW

Usage

Process files

Separate outputs

List files

Show files

Search in files

Import files

The collection database

The collection directory

The `data.tsv` file

License

About

Releases

Packages

Languages

License

fabiolimace/abw

Folders and files

Latest commit

History

Repository files navigation

ABW

Usage

Process files

Separate outputs

List files

Show files

Search in files

Import files

The collection database

The collection directory

The data.tsv file

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

The `data.tsv` file

Packages