Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up database construction pipeline #43

Merged
merged 51 commits into from
Mar 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
c0d73e3
Merge pull request #8 from stijndcl/feature/rust-chunks
stijndcl Jan 31, 2024
df55342
Merge pull request #9 from stijndcl/feature/rust-dat-parser
stijndcl Jan 31, 2024
3f8e340
Try using rust binaries
stijndcl Oct 10, 2023
4551b6d
Actually download dependencies
stijndcl Oct 10, 2023
6389a08
Don't pull files from gh
stijndcl Oct 11, 2023
d9c00ec
Update comments
stijndcl Oct 16, 2023
e09fed1
Add path to xml file
stijndcl Oct 16, 2023
c5d33ec
Use local xml file
stijndcl Oct 16, 2023
7f5417c
Fix arguments
stijndcl Oct 16, 2023
07a7b9a
List root
stijndcl Oct 16, 2023
a0091c9
Remove ls
stijndcl Oct 16, 2023
b4750c4
Add java version of downloadless
stijndcl Oct 16, 2023
4791ac0
Don't remove xml file yet
stijndcl Oct 20, 2023
da7a00f
Use rust tools
stijndcl Oct 20, 2023
215c30a
Use pigz instead of gzip
stijndcl Oct 20, 2023
4adda03
Fix zcat
stijndcl Oct 20, 2023
72646b5
Usel z4
stijndcl Oct 20, 2023
be583a8
Fix typo
stijndcl Oct 20, 2023
e1766ca
Decompress multiple filesg
stijndcl Oct 20, 2023
dddf00d
Try something
stijndcl Oct 20, 2023
2d2e0c1
Fix find command
stijndcl Oct 20, 2023
ddd6316
Fix missing lz4 extensions
stijndcl Oct 20, 2023
4906468
Add debug prints
stijndcl Oct 20, 2023
942b03d
Fix typo
stijndcl Oct 20, 2023
b7da44c
Print taxon id
stijndcl Nov 14, 2023
3b2ca13
Stash
stijndcl Nov 14, 2023
83bce27
Remove all sorts
stijndcl Nov 14, 2023
a0fb218
use two files
stijndcl Nov 14, 2023
7b6de76
Fix typo
stijndcl Nov 14, 2023
f099c11
Fix typo
stijndcl Nov 14, 2023
9c41264
Fix typo
stijndcl Nov 14, 2023
66e3f1a
Fix compression
stijndcl Nov 14, 2023
bc7c07d
Remove debugging line
stijndcl Nov 14, 2023
820e638
Parallellize
stijndcl Nov 15, 2023
38a43ac
Remove awk
stijndcl Nov 16, 2023
98755e8
Add parallellization back
stijndcl Nov 16, 2023
6e41f80
Produce actual valid output
stijndcl Jan 11, 2024
87e65a7
Optimise number-sequences and remove todo
stijndcl Jan 23, 2024
1f5bdd9
Remove binaries
stijndcl Jan 31, 2024
fec706f
Remove all non-rust code from the pipeline
stijndcl Jan 31, 2024
9fd58af
Add lz4 to db scripts
stijndcl Jan 31, 2024
8c854b0
Create script to compile and move binaries, ignore them
stijndcl Feb 6, 2024
a378d2c
Remove hardcoded list
stijndcl Feb 6, 2024
9005cf6
Merge pull request #40 from stijndcl/feature/pure-rust
pverscha Mar 6, 2024
e90e7cc
Fix changes from Pieter
stijndcl Mar 8, 2024
d963063
Fix ifs quoting
stijndcl Mar 8, 2024
c8886bb
Merge pull request #42 from stijndcl/fix/rust-changes
pverscha Mar 14, 2024
25013e9
Merge branch 'master' into feature/stijn-changes
pverscha Mar 14, 2024
f8b6b74
Merge branch 'feature/stijn-changes' of https://github.com/unipept/ma…
pverscha Mar 14, 2024
409f0d5
UniProtType no longer required
pverscha Mar 14, 2024
9ea99e5
Fix formatting errors
pverscha Mar 14, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,3 @@ out
scripts/helper_scripts/parser/output
scripts/helper_scripts/parser/src/META-INF
.idea/

19 changes: 19 additions & 0 deletions scripts/build_binaries.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#! /usr/bin/env bash

# All references to an external script should be relative to the location of this script.
# See: http://mywiki.wooledge.org/BashFAQ/028
CURRENT_LOCATION="${BASH_SOURCE%/*}"

checkdep() {
which $1 > /dev/null 2>&1 || hash $1 > /dev/null 2>&1 || {
echo "Unipept database builder requires ${2:-$1} to be installed." >&2
exit 1
}
}

checkdep cargo "Rust toolchain"

# Build binaries and copy them to the /helper_scripts folder
cd $CURRENT_LOCATION/helper_scripts/unipept-database-rs
cargo build --release
find ./target/release -maxdepth 1 -type f -executable -exec cp {} .. \;
308 changes: 138 additions & 170 deletions scripts/build_database.sh

Large diffs are not rendered by default.

9 changes: 9 additions & 0 deletions scripts/helper_scripts/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Ignore the compiled binaries that get moved here
dat-parser
functional-analysis
lcas
taxa-by-chunk
taxons-lineages
taxons-uniprots-tables
write-to-chunk
xml-parser
73 changes: 0 additions & 73 deletions scripts/helper_scripts/FunctionalAnalysisPeptides.js

This file was deleted.

Binary file not shown.
Binary file not shown.
10 changes: 0 additions & 10 deletions scripts/helper_scripts/ParallelXmlToTab.js

This file was deleted.

50 changes: 0 additions & 50 deletions scripts/helper_scripts/TaxaByChunk.js

This file was deleted.

Binary file removed scripts/helper_scripts/TaxonsUniprots2Tables.jar
Binary file not shown.
49 changes: 0 additions & 49 deletions scripts/helper_scripts/WriteToChunk.js

This file was deleted.

Binary file removed scripts/helper_scripts/XmlToTabConverter.jar
Binary file not shown.
8 changes: 4 additions & 4 deletions scripts/helper_scripts/filter_taxa.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ mkdir -p "$TMP_DIR"

filter_taxa() {
QUERY=$(echo "\s$1\s" | sed "s/,/\\\s\\\|\\\s/g")
RESULT=$(cat "$LINEAGE_ARCHIVE" | zcat | grep "$QUERY" | cut -f1 | sort -n | uniq | tr '\n' ',')
RESULT=$(lz4 -dc "$LINEAGE_ARCHIVE" | grep "$QUERY" | cut -f1 | sort -n | uniq | tr '\n' ',')
echo "$RESULT"
}

Expand All @@ -23,16 +23,16 @@ then
TAXA=$(filter_taxa "$TAXA")

# This associative array maps a filename upon the taxa that should be queried within this file
QUERIES=( $(echo "$TAXA" | tr "," "\n" | node "$CURRENT_LOCATION/TaxaByChunk.js" "$DATABASE_INDEX" "$TMP_DIR") )
QUERIES=( $(echo "$TAXA" | tr "," "\n" | $CURRENT_LOCATION/taxa-by-chunk --chunk-dir "$DATABASE_INDEX" --temp-dir "$TMP_DIR") )

if [[ ${#QUERIES[@]} -gt 0 ]]
then
parallel --jobs 8 --max-args 2 "cat {2} | zcat | sed 's/$/$/' | grep -F -f {1} | sed 's/\$$//'" ::: "${QUERIES[@]}"
parallel --jobs 8 --max-args 2 "lz4 -dc {2} | sed 's/$/$/' | grep -F -f {1} | sed 's/\$$//'" ::: "${QUERIES[@]}"
fi
else

# If the root ID has been passed to this script, we simply print out all database items (without filtering).
find "$DATABASE_INDEX" -name "*.chunk.gz" | xargs zcat
find "$DATABASE_INDEX" -name "*.chunk.lz4" -exec lz4 -mdc {} +
fi

# Remove temporary files
Expand Down
107 changes: 0 additions & 107 deletions scripts/helper_scripts/parser/pom.xml

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

Loading
Loading