-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generate embeddings from multiple rdf/ttl #91
Comments
Hi @ilseva, thank you for reaching out. Currently, jRDF2vec works best if you use NT files. If you have multiple NT files, If you cannot transform your files into NT format, the process you described is correct. I cannot comment in detail on your query issue. I recommend considering:
|
Thanks @janothan for your considerations. We think that the query results are influenced by the number of our datasets which is lower than the entries of public vocabularies. We are still working on a PoC that aim to confirm if our approch for build semantic search engine in right. Another question, if you can help us: do you think that is useful adding to the query concepts (aka filter) that belongs to the ontology? i.e. class names, object property names, ... Thanks! |
hello, First of all thank you for such a nice solution. so first of all i have used one hudge dataset of size around 60gb and performed some cleaning and preprocessing. after that i have created custom ontology for my knowledge graph, and created knowledge graph using python rdflib. now i wanted to created embeddings so i used your mode jrdf2vec, but as i mentioned file sizes are big even in ttl format they all combined to nearly 15 gb of size (1.4g +4.2g+ 4.2g +4.8g = 14.6 GB) as mentioned by @janothan i was able to create nt files too and by default jrdf2vec also create nt files form ttl, but due to java heap memory exceptions i can not walk all of the files in one go as i can only provid so much of heap space which wasnt enough for all of the files combine so i have generated walks for each of the 4 files seperate and move all gz to custom folders to save from overwritings. but the problem is that txt files are too big that one of the meregedWalks.txt reached 53.5 GB of size. now my questions are:
Thank you for this wonderfull solution and detailed explanation. |
I am not sure whether I get the question. If you find class nodes not helpful, you can filter them. If you are thinking about datatype properties ("names" provided via
Very likely this will work. The RAM requirements are significantly higher for the walk generation to speed-up the process. The actual training step is less memory consuming.
I do not advise doing this. These are separate embedding spaces. You could concatenate vectors etc. but I think this is quite a dirty approach. One last remark:
Please not that this leads to a different outcome than generating walks for the merged graph! I am not saying that it will not work but the walks will only be generated for each of the 4 files separately (rather than the complete graph). If you have an insufficient amount of memory, you could consider loading the 4 files in one HDT or one TDB store. jRDF2vec can also handle graphs stored on disk. The memory will for sure be sufficient but the walk generation will take significantly longer. |
@janothan so unfortunatly the super computer that i have access to work via sftp protocols and i have limited acces to it. on my personal laptop i have used jena fuseki but the thing is my laptop only have 8gb of ram so i cant imagine how much time it will take to generate walk for the given large size of data with 8gb of ram.
Sorry that i have asked so many questions and i undesrtand that some of them are not even directly related to your project. but i have to complete this project within a very short period of the time and generate recommendation using it and i am new to this kind of area, Thank you Again. |
There is no difference. TDB builds indices. Whatever input file you use, the result will be identical. (A relational database also does not care if you upload the data via CSV or TSV.)
The method call is identical. Just While generating walks from TDB is there any chance that jrdf2vec encounter heap memory error ? Unlikely.
I really can't answer this question for you. This depends on various factors such as the number of unique nodes in the graph. Just try it. On a general level: Do you use the
I can try to help you, just be understanding that this is not my main job and that I may take some time to answer your questions. |
Thank you for your time @janothan it means a lot to me :) That clears a lot and i will try all of the things you mentioned. |
Thanks for your time and useful hints @janothan! |
Hi,
thanks for sharing your works.
We would like to use jRDF2Vec to generate embeddings to have a base of knowledge for a semantic service engine.
Our starting point is a custom ontology where some of the object properties refer to public vocabularies (in rdf format) like Frequency Vocabulary and some others to our custom vocabularies.
The approch we follow is:
java -jar jrdf2vec-1.2-SNAPSHOT.jar -graph <ttl_file|rdf_file> -onlyWalks -walkDirectory <custom_folder>
walk_file_0.txt.gz
in a specific folder to avoid overwritingjava -jar jrdf2vec-1.2-SNAPSHOT.jar -mergeWalks -walkDirectory <specific_folder> -o <merged_walks>
java -jar jrdf2vec-1.2-SNAPSHOT.jar -onlyTraining -light entities.txt -minCount 5 -walkDirectory <specific_folder>
Is this process the correct one?
If not could you point out how to change it?
Furthermore, using the example of Jupyter Notebooks in your baseline we try to found most similar "concepts" in our model but we found the following unclear issue: if we build the query using keys that belongs to public vocabularies and to our individuals, the results we obtain refer only similar concepts of public vocabularies (expected concepts similar to our individuals seem not to be considered).
Thanks for you support.
Sevastian
The text was updated successfully, but these errors were encountered: