-
Notifications
You must be signed in to change notification settings - Fork 11
/
Copy pathREADME
124 lines (93 loc) · 4.87 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
libCSD is a C++ library providing some different techniques for managing
string dictionaries in compressed space. These approaches are inspired on
the paper:
==========================================================================
"Compressed String Dictionaries"
Nieves R. Brisaboa, Rodrigo Cánovas, Francisco Claude,
Miguel A. Martínez-Prieto, and Gonzalo Navarro
10th Symposium on Experimental Algorithms (SEA'2011), p.136-147, 2011.
==========================================================================
The current release is a beta version providing full functionality for
string location and ID extraction. Besides, some techniques support queries
by prefix and substrings. The set of techniques is described as follows:
- "Hash": implements four different techniques which combines hashing and
Huffman / Re-Pair compression.
- "FC": implements two different techniques based on Front-Coding
compression. PFC is a straightforward byte-oriented implementation,
and HTFC performs Hu-Tucker compression over the headers while
supports Huffman / Re-Pair compression for the internal strings.
- "RPDAC": uses Re-Pair for compressing the strings and.
- "FMI" : self-indexes the dictionary and provides all types of queries using
the FM-Index facilities.
- "XBW" : obtains a compressed trie of the dictionary and transforms it to
support all types of queries in highly compressed space.
The techniques support different parameterizations to achieve varied and
competitive space/time tradeoffs.
Compiling the library
====================
1) For building the libcds library, type "make" within the libcds folder.
2) Once built libcds, type "make" within the root folder and libCSD is ready.
Building a dictionary
=====================
The library provides a simple command-line script for building dictionaries:
./Build <type> <parameters> <in> <out>
- The first parameter chooses the <type> of dictionary to be built.
- The set of <parameters> provide specific configuration values.
- The <in> parameter locates the original dictionary file.
- The <out> parameter tells the destination path.
It is worth noting that "in" dictionary file must be lexicographically sorted
and strings must be ended with the '\0' ASCII char.
Examples:
=========
./Build 1 h 10 geonames dicts/geo.10
Builds a HASH dictionary for "geonames" and stores it as "dicts/geo.10". The
dictionary uses a hash table which hash 10% more cells than strings in
"geonames" and compresses them using Huffman (h).
./Build 4 r 16 geonames dicts/geo.16
Builds a HTFC (Hu-Tucker Front-Coding) dictionary for "geonames" and stores
it as "dicts/geo.16". The dictionary uses buckets of 16 strings and
compresses them using Re-Pair.
Testing a dictionary
====================
The library also provides a command-line script for testing purposes:
./Test <mode> <opt> <in> <file>
- This script supports four different <modes>:
- 'r' is used for running the test chosen in <opt>:
- 'l' (for testing locate), 'e' (extract).
- 'pl' (prefix location), 'pe' (prefix extraction).
- 'sl' (substring location), 'pe' (substring extraction).
- 'g' is used for generating a basic testbed comprising <opt> valid strings
for locate and <opt> valid IDs for extract.
- 'p' is used for generating a prefix testbed comprising 100,000 query
patterns of some different lengths. The <opt> parameter tells the
a reference string length, and the testbed obtain different files
with patterns longer and shorter than this length.
- 's' is used for generating a substring testbed with the features described
above.
- The parameter <in> locates the file storing the compressed string dictionary.
- The last <file> parameter locates the file comprising the testbed (for 'r'
mode) or the destination path for saving the corresponding testbed (modes
'g', 'p', or 's').
Examples:
=========
./Test r l dicts/geo.10 tests/geo.strings
Uses the testbed "tests/geo.strings" for analyzing how the dictionary stored
at "dicts/geo.10" behaves for 'locate'.
./Test g 100000 dicts/geo.10 tests/geo
Uses the dictionary stored at "dicts/geo.10" for generating a basic testbed
of 100,000 strings and IDs which are stored as "tests/geo".
./Test p 16 dicts/geo.10 tests/geo
Uses the dictionary stored at "dicts/geo.10" for generating a prefix testbed
comprising patterns of lengths longer and shorter than 16 chars. The
resulting pattern set are stored at "tests/geo".
If you find bugs or have any issue with library, please ask us. Enjoy the
library and if you find it useful for your research, please cite our paper:
@article{MPBCCN16,
author = "M.A. Mart{\'{\i}}nez-Prieto and N. Brisaboa and R. C{\'a}novas
and F. Claude and G. Navarro",
title = "Practical Compressed String Dictionaries",
booktitle = "Information Systems",
year = "2016",
volume = {56},
pages = "73--108"
}