Skip to content

Latest commit

 

History

History
135 lines (102 loc) · 6.48 KB

search.md

File metadata and controls

135 lines (102 loc) · 6.48 KB

xan search

Search for (or replace) patterns in CSV data. That is to say keep rows of given
CSV file if ANY of the selected column matches the given pattern or patterns.

This command has several flags to select the way to perform a match:

    * (default): matching a substring (e.g. "john" in "My name is john")
    * -e, --exact: exact match
    * -r, --regex: using a regular expression
    * -u, --url-prefix: matching by url prefix (e.g. "lemonde.fr/business")
    * -N, --non-empty: finding non-empty cells (does not need a pattern)
    * -E, --empty: finding empty cells (does not need a pattern)

Searching for rows with any column containing "john":

    $ xan search "john" file.csv > matches.csv

Searching for rows where any column has *exactly* the value "john":

    $ xan search -e "john" file.csv > matches.csv

Keeping only rows where selection is not fully empty:

    $ xan search -s user_id --non-empty file.csv > users-with-id.csv

Keeping only rows where selection has any empty column:

    $ xan search -s user_id --empty file.csv > users-without-id.csv

When using a regular expression, be sure to mind bash escape rules (prefer single
quotes around your expression and don't forget to use backslashes when needed):

    $ xan search -r '\bfran[cç]' file.csv

To restrict the columns that will be searched you can use the -s, --select flag.

All search modes (except -u/--url-prefix) can also be case-insensitive
using -i, --ignore-case.

This command is also able to search for multiple patterns at once.
To do so, you must give a text file with one pattern per line to the --patterns
flag, or a CSV file containing a column of to indicate using --pattern-column.

One pattern per line of text file:

    $ xan search --patterns patterns.txt file.csv > matches.csv

CSV column containing patterns:

    $ xan search --patterns people.csv --pattern-column name tweets.csv > matches.csv

Feeding patterns through stdin (using "-"):

    $ cat patterns.txt | xan search --patterns - file.csv > matches.csv

Feeding CSV column as patterns through stdin (using "-"):

    $ xan slice -l 10 people.csv | xan search --patterns - --pattern-column name file.csv > matches.csv

This command can also count the number of matches and report it in a new column,
using the -c/--count flag.

Finally, this command is able to replace matched values through the -R/--replace
flag and the --replacement-column flag when combined with --patterns & --pattern-column.

Cleaning thousands separators (usually commas "," in English) from numerical columns:

    $ xan search , --replace . -s 'count_*' file.csv

Replacing color names to their French counterpart:

    $ echo 'english,french\nred,rouge\ngreen,vert' | \
    $ xan search -e \
    $   --patterns - --pattern-column english --replacement-column french \
    $   -s color file.csv > translated.csv

Usage:
    xan search [options] --non-empty [<input>]
    xan search [options] --empty [<input>]
    xan search [options] --patterns <index> [<input>]
    xan search [options] <pattern> [<input>]
    xan search --help

search options:
    -e, --exact                  Perform an exact match.
    -r, --regex                  Use a regex to perform the match.
    -E, --empty                  Search for empty cells, i.e. filter out
                                 any completely non-empty selection.
    -N, --non-empty              Search for non-empty cells, i.e. filter out
                                 any completely empty selection.
    -u, --url-prefix             Match by url prefix, i.e. cells must contain urls
                                 matching the searched url prefix. Urls are first
                                 reordered using a scheme called a LRU, that you can
                                 read about here:
                                 https://github.com/medialab/ural?tab=readme-ov-file#about-lrus
    --patterns <path>            Path to a text file (use "-" for stdin), containing multiple
                                 patterns, one per line, to search at once.
    --pattern-column <name>      When given a column name, --patterns file will be considered a CSV
                                 and patterns to search will be extracted from the given column.
    --replacement-column <name>  When given with both --patterns & --pattern-column, indicates the
                                 column containing a replacement when a match occurs. Does not
                                 work with -R/--replace.
    -i, --ignore-case            Case insensitive search.
    -s, --select <arg>           Select the columns to search. See 'xan select -h'
                                 for the full syntax.
    -v, --invert-match           Select only rows that did not match
    -A, --all                    Only return a row when ALL columns from the given selection
                                 match the desired pattern, instead of returning a row
                                 when ANY column matches.
    -c, --count <column>         If given, the command will not filter rows but will instead
                                 count the total number of non-overlapping pattern matches per
                                 row and report it in a new column with given name.
                                 Does not work with -v/--invert-match.
    -R, --replace <with>         If given, the command will not filter rows but will instead
                                 replace matches with the given replacement.
                                 Does not work with --replacement-column.
    --overlapping                When used with -c/--count, return the count of overlapping
                                 matches. Note that this can sometimes be one order of magnitude
                                 slower that counting non-overlapping matches.
    -l, --limit <n>              Maximum of number rows to return. Useful to avoid downstream
                                 buffering some times (e.g. when searching for very few
                                 rows in a big file before piping to `view` or `flatten`).

Common options:
    -h, --help             Display this message
    -o, --output <file>    Write output to <file> instead of stdout.
    -n, --no-headers       When set, the first row will not be interpreted
                           as headers. (i.e., They are not searched, analyzed,
                           sliced, etc.)
    -d, --delimiter <arg>  The field delimiter for reading CSV data.
                           Must be a single character.