This library contains some functionality commonly used at the Data and Web Science Research Group, University of Mannheim.
If you use Maven for building your project, just add
<dependency>
<groupId>de.uni-mannheim.informatik.dws.dwslab</groupId>
<artifactId>dwslib</artifactId>
<version>2.0.0</version>
</dependency>
to your pom.xml.
Versioning of the dwslib is done according to the Semantic Versioning guidelines (http://semver.org/). This means that it is safe to include new releases which only changed in the MINOR and PATCH component without modifying your code. However, new major versions are allowed to break backward compatibility.
dwslib is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
- Processor: Framework using multiple parallel threads to process a list of objects. The filling of the objectsToProcess and the actual processing has to be implemented.
- DomainUtil: Collection of often used functionalities to process URLs (e.g. get PLD, Domain, Compress based on CC)
- InputUtil: Collection of often used functionalities to read input files (e.g. get all files in a directory, get input stream for file)
- FileUtil: Collection of often used functionalities to handle files.
- BufferedChunkingWriter: A BufferedWriter (Using GZIPOutputStream) taking care of chunking the output in multiple files.
- LoadURI: URI Shortener (both ways) based on prefix.cc list
- Query: Virtuoso Sparql Query Processor based on direct JDBC driver (not http sparql endpoint -> no 1mio line limit)
For usage information call the following classes from command-line without arguments or with the -h option.
- QueryCLI: Command-line tool for querying Virtuoso triple stores via JDBC and writing the result to TSV files
- SplitGZIPFile: Command-line for splitting large GZIP files into smaller GZIP files of given size
- Collection: collection helper methods (sortHashMapByValue, ...)
- Counter: Counter for abitrary objects (python counter like)
- MyFileReader: utf-8 file reader line-by-line; utf-8 tab (or any other character) separated file reader