Skip to content

Files

Latest commit

dfeb477 · Mar 26, 2025

History

History
49 lines (40 loc) · 2.19 KB

url-join.md

File metadata and controls

49 lines (40 loc) · 2.19 KB

xan url-join

Join a CSV file containing a column of url prefixes with another CSV file.

The default behavior of this command is to be an 'inner join', which
means only matched rows will be written in the output. Use the --left
flag if you want to perform a 'left join' and keep every row of the searched
file in the output.

The file containing urls will always be completely read in memory
while the second one will always be streamed.

You can of course work on gzipped files if needed and feed one of both
files from stdin by using `-` instead of a path.

Not that this command indexes the hierarchical reordering of a bunch of urls
into a prefix tree. This reordering scheme is named LRUs and you can read about
it here: https://github.com/medialab/ural#about-lrus

If you only need to filter rows of the second file and don't
actually need to join columns from the urls file, you should
probably use `xan search --url-prefix --patterns` instead.

Usage:
    xan url-join [options] <column> <input> <url-column> <urls>
    xan url-join --help

join options:
    -S, --simplified             Drop irrelevant parts of the urls, like the scheme,
                                 `www.` subdomains etc. to facilitate matches.
    --left                       Write every row from input file in the output, with empty
                                 padding cells on the right when no url from the second
                                 file produced any match.
    -L, --prefix-left <prefix>   Add a prefix to the names of the columns in the
                                 searched file.
    -R, --prefix-right <prefix>  Add a prefix to the names of the columns in the
                                 patterns file.

Common options:
    -h, --help                  Display this message
    -o, --output <file>         Write output to <file> instead of stdout.
    -n, --no-headers            When set, the first row will not be interpreted
                                as headers. (i.e., They are not searched, analyzed,
                                sliced, etc.)
    -d, --delimiter <arg>       The field delimiter for reading CSV data.
                                Must be a single character.