Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing MSAs in the provided raw msa data #180

Open
tpdmskim opened this issue Feb 11, 2025 · 0 comments
Open

Missing MSAs in the provided raw msa data #180

tpdmskim opened this issue Feb 11, 2025 · 0 comments

Comments

@tpdmskim
Copy link

Hi,

I downloaded the raw MSA files you provided using the following commands:

wget https://boltz1.s3.us-east-2.amazonaws.com/rcsb_raw_msa.tar
tar -xf rcsb_raw_msa.tar
rm rcsb_raw_msa.tar

After extracting the archive, I noticed that some MSAs for certain sequences are missing, even though structural data for these sequences exists. (in the rcsb_processed_targets/structures/*.npz)

Upon checking, I found that approximately 16,130 sequences that present in the structure file but do not have corresponding raw msa data.

To illustrate this issue, I have identified some sequences that appear to be missing from the raw MSA dataset

id,sequence
25194588de88b5cded80db552b93c98f00928c8d73fa69bc76edfce6581b8f70,GSPEFSLDVRQEELGAVVDKEMAATSAAIEDAVRRIEDMMNQARHASSGVKLEVNERILNSCTDLMKAIRLLVTTSTSLQKEIVESGRGAATQQEFYAKNSRWTEGLISASKAVGWGATQLVEAADKVVLHTGKYEELIVCSHEIAASTAQLVAASKVKANKHSPHLSRLQECSRTVNERAANVVASTKSGQEQIEDRDTMDFSGL
8d86d4a88063d08a61a75672326b45dae86206057c63052bb343a2a29a930603,MSLKDVSLSSFDAHDLDLDKFPEVVRDRLTQFLDAQELTIADIGAPVTDAVAHLRSFVLNGGKRIRPLYAWAGFLAAQGHKNSSEKLESVLDAAASLEFIQACALIHDDIIDSSDTRRGAPTVHRAVEADHRANNFEGDPEHFGVSVSILAGDMALVWAEDMLQDSGLSAEALARTRDAWRGMRTEVIGGQLLDIYLESHANESVELADSVNRFKTAAYTIARPLHLGASIAGGSPQLIDALLHYGHDIGIAFQLRDDLLGVFGDPAITGKPAGDDIREGKRTVLLALALQRADKQSPEAATAIRAGVGKVTSPEDIAVITEHIRATGAEEEVEQRISQLTESGLAHLDDVDIPDEVRAQLRALAIRSTERREGHHHHHH
88bd00693617e61f245336c86cdcaa313cb9103d8dfb1417c451eb48581b1027,MKIEEGKLVIWINGDKGYNGLAEVGKKFEKDTGIKVTVEHPDKLEEKFPQVAATGDGPDIIFWAHDRFGGYAQSGLLAEITPAAAFQDKLYPFTWDAVRYNGKLIAYPIAVEALSLIYNKDLLPNPPKTWEEIPALDKELKAKGKSALMFNLQEPYFTWPLIAADGGYAFKYENGKYDIKDVGVDNAGAKAGLTFLVDLIKNKHMNADTDYSIAEAAFNKGETAMTINGPWAWSNIDTSAVNYGVTVLPTFKGQPSKPFVGVLSAGINAASPNKELAKEFLENYLLTDEGLEAVNKDKPLGAVALKSYEEELAKDPRIAATMENAQKGEIMPNIPQMSAFWYAVRTAVINAASGRQTVDAALAAAQTNAAAMARFEDPTRRPYKLPDLCTELNTSLQDIEITCVYCKTVLELTEVFEFARKDLFVVYRDSIPHAACHKCIDFYSRIRELRHYSDSVYGDTLEKLTNTGLYNLLIRCLRCQKPLNPAEKLRHLNEKRRFHNIAGHYRGQCHSCCNRARQERLQRGSAAAESSELTFQELLGERR
022c5e46aa4d58f82ea0b1dcb834398f4f1195826519611b0447b8b5e3536ef3,MERDGCAGGGSRGGEGGRAPREGPAGNGRDRGRSHAAEAPGDPQAAASLLAPMDVGEEPLEKAARARTAKDPNTYKVLSLVLSVCVLTTILGCIFGLKPSCAKEVKSCKGRCFERTFGNCRCDAACVELGNCCLDYQETCIEPEHIWTCNKFRCGEKRLTRSLCACSDDCKDKGDCCINYSSVCQGEKSWVEEPCESINEPQCPAGFETPPTLLFSLDGFRAEYLHTWGGLLPVISKLKKCGTYTKNMRPVYPTKTFPNHYSIVTGLYPESHGIIDNKMYDPKMNASFSLKSKEKFNPEWYKGEPIWVTAKYQGLKSGTFFWPGSDVEINGIFPDIYKMYNGSVPFEERILAVLQWLQLPKDERPHFYTLYLEEPDSSGHSYGPVSSEVIKALQRVDGMVGMLMDGLKELNLHRCLNLILISDHGMEQGSCKKYIYLNKYLGDVKNIKVIYGPAARLRPSDVPDKYYSFNYEGIARNLSCREPNQHFKPYLKHFLPKRLHFAKSDRIEPLTFYLDPQWQLALNPSERKYCGSGFHGSDNVFSNMQALFVGYGPGFKHGIEADTFENIEVYNLMCDLLNLTPAPNNGTHGSLNHLLKNPVYTPKHPKEVHPLVQCPFTRNPRDNLGCSCNPSILPIEDFQTQFNLTVAEEKIIKHETLPYGRPRVLQKENTICLLSQHQFMSGYSQDILMPLWTSYTVDRNDSFSTEDFSNCLYQDFRIPLSPVHKCSFYKNNTKVSYGFLSPPQLNKNSSGIYSEALLTTNIVPMYQSFQVIWRYFHDTLLRKYAEERNGVNVVSGPVFDFDYDGRCDSLENLRQKRRVIRNQEILIPTHFFIVLTSCKDTSQTPLHCENLDTLAFILPHRTDNSESCVHGKHDSSWVEELLMLHRARITDVEHITGLSFYQQRKEPVSDILKLKTHLPTFSQED
e7c9b9684782cf14c45d492f2db59bb891e75fa420a0f9dc20006e8ee4a4f341,MGSSHHHHHHSSGLVPRGSHMRMLPSFLALLLGSGLAFNAQANTSTLKVCAASDEMPYSNKQQEGFENQLAKILADTMDRELEFVWSDKAAIFLVTEKLLKNQCDVVMGVDKGDPRVATSDPYYKSGYAFIYPADKGLDIKNWQSPALKDMSKFAIVPGSPSEVMLREIDKYEGNFNYTMSLIGFKSRRNQYVRYAPDLLVSEVVSGKADIAHIWAPEAARYVKSASVPLKMVVSEEIAPTRDGEGVRQQFEQSIAVRSDDQELLKEINTALHKADPKIKAVLKDEGIPLL
c34e68840e1b514f02e2b06a18e871b663f9aeeb29769cbcbf0ab10269e0cc12,QLQLVETGGGLVKPGGSLRLSCVVSGFTFDDYRMAWVRQAPGKELEWVSSIDSWSINTYYEDSVKGRFTISTDNAKNTLYLQMSSLKPEDTAVYYCAAEDRLGVPTINAHPSKYDYNYWGQGTQVTVSS
6df645d3a8b0f9462ae8babcb971fb4200f2d7952a30e05ca5d8bd84b246f7ca,VERDKYANFTINFTMENQIHTGMEYDNGRFIGVKFKSVTFKDSVFKECYFEDVTSSNTFFRNCTFINTVFYNTDLFEYKFVNSRLINSTFLHNKEGTSPSASGGS

I would like to know if this is expected behavior or if there was an issue with the dataset.
Could you please confirm whether these MSAs were intentionally excluded, or if there is an error in the dataset?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant