My dream is if we could start a project at github/huggingface, called Mappa #2

jbdatascience · 2023-05-06T07:23:45Z

In about a week I will retire from work, so from then on I could look more closely into these issues. But I think we are already on the right track!

A question: should we do all our communication via this Github channel? Of could we choose another one? Do you have a proposal? Please let me know.

…

My dream is if we could start a project at github/huggingface, called Mappa
Mundi V2 , run by us two and others, with the goal of doing ALL 1100 movie
scripts from the IMSDb website.

We would have to replace the sentence simplification and sentence
similarity portions of the old Mappa Mundi by these speedier, more
precise versions.

On Wed, May 3, 2023 at 4:01 AM Jan Bours @.***> wrote:

Looks promising indeed. If you can get your DagAtlas algorithm more
speedy, then you can produce a lot more DAGs from texts and that will make
it more interesting fro practical use cases, would not it?

—
Reply to this email directly, view it on GitHub
#1 (comment),
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ADZWTYLUAPA7UWNMX2RGETTXEIGEVANCNFSM6AAAAAAXMWE3C4
.
You are receiving this because you commented.Message ID:
@.***>

Originally posted by @rrtucci in #1 (comment)

rrtucci · 2023-05-06T08:52:34Z

I think communicating via this github channel has worked well so far, because it has shown with great transparency, to any person interested in this project, what obstacles might arise. So for now, I am in favor of continuing our communications this way. Except perhaps we might open more than one issue. I also wanted to suggest that if you can convince one or more Dutch companies to sponsor us, we might be able to make a small non-profit company out of this. I suspect that there are a lot of European companies that want to compete with USA/OpenAI/Google in the LLM arena.

…

On Sat, May 6, 2023 at 3:24 AM Jan Bours ***@***.***> wrote: In about a week I will retire from work, so from then on I could look more closely into these issues. But I think we are already on the right track! A question: should we do all our communication via this Github channel? Of could we choose another one? Do you have a proposal? Please let me know. … My dream is if we could start a project at github/huggingface, called Mappa Mundi V2 , run by us two and others, with the goal of doing ALL 1100 movie scripts from the IMSDb website. We would have to replace the *sentence simplification* and *sentence similarity* portions of the old Mappa Mundi by these speedier, more precise versions. On Wed, May 3, 2023 at 4:01 AM Jan Bours *@*.***> wrote: Looks promising indeed. If you can get your DagAtlas algorithm more speedy, then you can produce a lot more DAGs from texts and that will make it more interesting fro practical use cases, would not it? — Reply to this email directly, view it on GitHub #1 (comment) <#1 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZWTYLUAPA7UWNMX2RGETTXEIGEVANCNFSM6AAAAAAXMWE3C4 . You are receiving this because you commented.Message ID: *@*.***> *Originally posted by @rrtucci <https://github.com/rrtucci> in #1 (comment) <#1 (comment)>* — Reply to this email directly, view it on GitHub <#2>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZWTYJF7T5JL5CJQNL6FODXEX4AXANCNFSM6AAAAAAXX5ZLHQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

rrtucci · 2023-05-07T01:43:11Z

Hi Jan, I've been learning about sentence simplification and decided to do it with this software: https://github.com/amanbasu/sentence-simplification The BERT model is not bleeding edge, but it will be very easy to implement and is free under a permissive license. If BERT proves insufficient, then we can try a more sophisticated approach later on. Even if we trade BERT for a more modern model in the future, we will learn a lot of practical Transformer usage tips from the BERT experience.

…

On Sat, May 6, 2023 at 3:24 AM Jan Bours ***@***.***> wrote: In about a week I will retire from work, so from then on I could look more closely into these issues. But I think we are already on the right track! A question: should we do all our communication via this Github channel? Of could we choose another one? Do you have a proposal? Please let me know. … My dream is if we could start a project at github/huggingface, called Mappa Mundi V2 , run by us two and others, with the goal of doing ALL 1100 movie scripts from the IMSDb website. We would have to replace the *sentence simplification* and *sentence similarity* portions of the old Mappa Mundi by these speedier, more precise versions. On Wed, May 3, 2023 at 4:01 AM Jan Bours *@*.***> wrote: Looks promising indeed. If you can get your DagAtlas algorithm more speedy, then you can produce a lot more DAGs from texts and that will make it more interesting fro practical use cases, would not it? — Reply to this email directly, view it on GitHub #1 (comment) <#1 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZWTYLUAPA7UWNMX2RGETTXEIGEVANCNFSM6AAAAAAXMWE3C4 . You are receiving this because you commented.Message ID: *@*.***> *Originally posted by @rrtucci <https://github.com/rrtucci> in #1 (comment) <#1 (comment)>* — Reply to this email directly, view it on GitHub <#2>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZWTYJF7T5JL5CJQNL6FODXEX4AXANCNFSM6AAAAAAXX5ZLHQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jbdatascience · 2023-05-08T12:13:36Z

Interesting sentence simplification software !

I am thinking about our project.
I intend to do the following and points worth considering :

Do we have a good definition of the ultimate goal of this project?
Reread your white paper “Causal DAG Extraction from 3 Short Stories and 3 Movie Scripts” to be able to completely understand the principles involved. I saw you have updated that document, but what are the changes?
Is the example of DEFT on tic-tac-toe games still useful to consider? Is it still relevant for the purpose of our project?
Proposal: to be able to check if the complete MM algorithm functions as expected, it would be good to have a ground truth (text(s) for which we know the Causal DAG(s) that are the groundtruth) against which we can check the DAGs discovered by the MM algorithm. I am not sure how to do this yet.
The output from the MM algorithm is the set of discovered DAGs (the DagAtlas). How could we interpret them? And what could be a practical use case for this DagAtlas? Perhaps we could describe that in more detail. How could existing LLMs be updated by the DagAtlas so that they could represent causal knowledge (in the most general form)?
Are there comparable, already existing algorithms that implement the functionality that MM has? How do they compare?

Please share your thoughts on this!

jbdatascience · 2023-05-08T14:27:32Z

This quote was put on LinkedIn by me (I thought you would like it!):

https://www.linkedin.com/posts/janbours_science-causality-knowledge-activity-7061268335266541568-qF8a

The power of science is its discovery of causal law.

Bertrand Russell. Human Knowledge: Its Scope and Limits. Simon and Schuster, 1948.

https://twitter.com/scitechtalk/status/1655502709267505154?s=46&t=8nmTR-4kmZaFFBCGObS6fA

https://twitter.com/artistexyz/status/1655576091149451269?s=46&t=8nmTR-4kmZaFFBCGObS6fA

Excellent quote. Thanks for sharing.

rrtucci · 2023-05-08T20:14:08Z

After looking more carefully at the last software I proposed we use for sentence simplification, I realized that it doesn't quite do what we need. It simplifies a complicated sentence, but it doesn't split it into simpler sentences, which is what we need. So I am still looking around for a viable solution for sentence simplification. "Do we have a good definition of the ultimate goal of this project?" The immediate goal is to get the 1100 movies processed, and to reduce the sources of noise (like bad sentence simplification and bad similarity evaluation) as much as possible. Will we get a strong signal then, or is this algo junk? That is what we will find out. " I saw you have updated that document, but what are the changes?" Small changes to improve clarity. None significant. I am always trying to improve the clarity of what I write. That's me. If we ever finish a new, better Mappa Mundi, that will warrant a new paper explaining the changes. "Is the example of DEFT on tic-tac-toe games still useful to consider?" Not really. The old paper has some good ideas, but the Mappa Mundi algorithm is different and better. The tic-tac-toe example was simple to consider at the time, 7 months ago, but it now feels very artificial to me, compared to the movie scripts example. The other questions that you ask will best be answered as we go along.

…

On Mon, May 8, 2023 at 8:13 AM Jan Bours ***@***.***> wrote: Interesting sentence simplification software ! I am thinking about our project. I intend to do the following and points worth considering : - Do we have a good definition of the ultimate goal of this project? - Reread your white paper “Causal DAG Extraction from 3 Short Stories and 3 Movie Scripts” to be able to completely understand the principles involved. I saw you have updated that document, but what are the changes? - Is the example of DEFT on tic-tac-toe games still useful to consider? Is it still relevant for the purpose of our project? - Proposal: to be able to check if the complete MM algorithm functions as expected, it would be good to have a ground truth (text(s) for which we know the Causal DAG(s) that are the groundtruth) against which we can check the DAGs discovered by the MM algorithm. I am not sure how to do this yet. - The output from the MM algorithm is the set of discovered DAGs (the DagAtlas). How could we interpret them? And what could be a practical use case for this DagAtlas? Perhaps we could describe that in more detail. How could existing LLMs be updated by the DagAtlas so that they could represent causal knowledge (in the most general form)? - Are there comparable, already existing algorithms that implement the functionality that MM has? How do they compare? Please share your thoughts on this! — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZWTYPNMHAJ6GTLLXOQ3IDXFDPPVANCNFSM6AAAAAAXX5ZLHQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jbdatascience · 2023-05-09T07:26:15Z

OK, I will also be on the look out for a viable solution for sentence simplification. Could you specify what we could consider as a good solution (and what not)?

Also I am wondering why the existing solution for simplification is not good enough?

jbdatascience · 2023-05-09T13:13:14Z

Also I am wondering what would be the optimal way of subdividing the text for the purpose of creating the nodes in the DAG. Now we are using sentence clauses, but I can imagine other possible subdivisions of the text. There are many instances of NLP applications where chunking is used for this.

rrtucci · 2023-05-09T13:58:58Z

When time comes to evaluate the similarity of 2 nodes, if the nodes represent very long compound sentences, the nodes will not be very well separated in similarity space. Chunking and breaking into clauses is too crude and often leads to unintelligible sets of words for each node. What is needed is to break every compound sentence into a set of simple, easily understandable, declarative sentences without loss of meaning, and call each of those simple sentences a node. ChatGPT does this 1 to N sentence conversion really well. I give an example in the white paper. If we had money, we could just ask ChatGPT to do it. I think chatGPT could do it for $10-$100 per movie. What I would love to find is a FREE way of doing what ChatGPT does so well. I've scoured GitHub trying to find a decent solution. Some people have tried to do this in various ways, but their methods have huge failure rates, like 30%.

…

On Tue, May 9, 2023 at 9:13 AM Jan Bours ***@***.***> wrote: Also I am wondering what would be the optimal way of subdividing the text for the purpose of creating the nodes in the DAG. Now we are using sentence clauses, but I can imagine other possible subdivisions of the text. There are many instances of NLP applications where chunking is used for this. — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZWTYLWQKWON4RD3RL2B6DXFI7HJANCNFSM6AAAAAAXX5ZLHQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

rrtucci · 2023-05-09T17:37:42Z

This sounds very promising. https://arxiv.org/pdf/2304.07774v1.pdf

…

On Tue, May 9, 2023 at 9:13 AM Jan Bours ***@***.***> wrote: Also I am wondering what would be the optimal way of subdividing the text for the purpose of creating the nodes in the DAG. Now we are using sentence clauses, but I can imagine other possible subdivisions of the text. There are many instances of NLP applications where chunking is used for this. — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZWTYLWQKWON4RD3RL2B6DXFI7HJANCNFSM6AAAAAAXX5ZLHQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

rrtucci · 2023-05-10T06:23:37Z

This looks promising https://github.com/serenayj/ABCD-ACL2021

…

On Tue, May 9, 2023 at 9:13 AM Jan Bours ***@***.***> wrote: Also I am wondering what would be the optimal way of subdividing the text for the purpose of creating the nodes in the DAG. Now we are using sentence clauses, but I can imagine other possible subdivisions of the text. There are many instances of NLP applications where chunking is used for this. — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZWTYLWQKWON4RD3RL2B6DXFI7HJANCNFSM6AAAAAAXX5ZLHQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jbdatascience · 2023-05-10T06:40:28Z

OK, you gave a good description of what we expect from a good simplification algorithm.

I find it interesting that ChatGPT is able to produce good simplifications! Could there already be an open source version of ChatGPT that we could use for this purpose?
Also I wonder if HuggingFace doesn’t have any FREE algorithms for doing this? They have a large database of algorithms which I didn’t explore fully up till now.
I will explore those 2 alternatives seeking for a solution.

What still is not clear to me what is wrong with the simplification method you are using at this moment? I what way is it not good enough?

I also have a question about the existing MM algorithm. When I look at the output of your notebook:
https://github.com/rrtucci/mappa_mundi/blob/master/jupyter_notebooks/navigating_short_stories.ipynb

and I look at the DAGs it produces at the bottom of that notebook, I am not sure how I can interpret them correctly. Could you take 1 example and describe that example fully so it becomes clear how to interpret it?

rrtucci · 2023-05-10T07:11:05Z

"What still is not clear to me what is wrong with the simplification method you are using at this moment? I what way is it not good enough?" It's too crude. Ideally, you want each node to be assigned a simple, intelligible sentence clearly describing an event. For example, "Bob ate a green apple". All the sentences in a movie script must be reduced to simple sentences like that. The similarity of two sentences only works well if those two sentences are simple. "I look at the DAGs it produces at the bottom of that notebook, I am not sure how I can interpret them correctly" They don't make sense because the current method of simplification of sentences is not very good. That introduces a lot of noise into the calculation of the similarity of 2 nodes.

…

On Wed, May 10, 2023 at 2:40 AM Jan Bours ***@***.***> wrote: OK, you gave a good description of what we expect from a good simplification algorithm. I find it interesting that ChatGPT is able to produce good simplifications! Could there already be an open source version of ChatGPT that we could use for this purpose? Also I wonder if HuggingFace doesn’t have any FREE algorithms for doing this? They have a large database of algorithms which I didn’t explore fully up till now. I will explore those 2 alternatives seeking for a solution. What still is not clear to me what is wrong with the simplification method you are using at this moment? I what way is it not good enough? I also have a question about the existing MM algorithm. When I look at the output of your notebook: https://github.com/rrtucci/mappa_mundi/blob/master/jupyter_notebooks/navigating_short_stories.ipynb and I look at the DAGs it produces at the bottom of that notebook, I am not sure how I can interpret them correctly. Could you take 1 example and describe that example fully so it becomes clear how to interpret it? — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZWTYNWWLJB7BXLEUDMKWTXFMZ6NANCNFSM6AAAAAAXX5ZLHQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jbdatascience · 2023-05-10T07:46:20Z

OK, so we must have a new, better method of sentence simplification! That is clear for me now.

jbdatascience · 2023-05-10T08:47:48Z

I did an interesting experiment with GPT4 instead of ChatGPT to do sentence simplification on the same example you give in your white paper:

PROMPT:
Consider the following quote:
“A range of mountains takes form in the haze.
Moving closer.
The mountains are piles of TRASH. The entire surface is nothing but waste.
“..We’re gonna find adventure in the evening air...”
A silhouetted city in the distance.
What looks like skyscrapers turns into trash.
Thousands of neatly stacked CUBES OF TRASH, stories high.
Rows and rows of stacked cubes, like city avenues. They go on for miles.”

Give an enumerated list summarizing this quote.
Each list item must be a simple declarative sentence consisting of a subject phrase and a verb phrase.
The list items must be in chronological order of occurrence in the quote.

REPLY:

This bot may make incorrect statements. It does not have knowledge of events after 2021. Tap to learn more.

Mountains appear in the haze.
Distance closes.
Mountains transform into trash piles.
Evening air promises adventure.
Silhouetted city emerges.
Skyscrapers reveal themselves as trash.
Cubes of trash form structures.
Trash cube rows extend for miles.

I think the quality of these sentence simplifications is even a lot better than that produced by ChatGPT !
What do you think?

jbdatascience · 2023-05-10T10:31:08Z

I have found this tweet of Huggingface (it is a tweet from today, May 10th 2023 !):

https://twitter.com/gpt_index/status/1656091102221570048?s=20

"Today’s release highlight: Native @huggingface LLM support.
Want to run any open-source LLM from 🤗? (Camel, StableLM, alpaca, Dolly, etc.)

Customize model/tokenizer/chat prompts 📝
Full streaming support 🌊

Now you can in 🦙! (huge s/o to Logan)"

Then I looked further and found this at Huggingface:

https://huggingface.co/LLMs

https://github.com/deep-diver/LLM-As-Chatbot :
About
LLM as a Chatbot Service

    With this complete notebook of using a LLM as a chatbot:
https://github.com/deep-diver/LLM-As-Chatbot/blob/main/notebooks/llm_as_chatbot_in_colab.ipynb
    ----->
https://raw.githubusercontent.com/deep-diver/LLM-As-Chatbot/main/notebooks/llm_as_chatbot_in_colab.ipynb

I am trying to get this to work and then prompting it with your example of sentence simplifying from your white paper.
Up till now no luck, but that is a technical issue. I will try to solve that (and search Huggingface for other FREE implementations).

But what is more important: I have tried your prompt in a number of different online ChatGPT-like LLMs, and even the more simple ones give excellent replies to your sentence simplifying prompt!
See and try for yourself:

https://poe.com/Sage
https://poe.com/Claude%2B (3 FREE messages per day)
https://poe.com/GPT-4 (1 FREE message per day <----- !!!!!!!)
https://poe.com/Claude-instant
https://poe.com/ChatGPT (FREE !!!!!)
https://poe.com/Dragonfly

jbdatascience · 2023-05-10T10:58:17Z

Also I am looking into this Open Source Open ChatGPT variant:

StableVicuna: The New King of Open ChatGPTs?
https://www.youtube.com/watch?v=m_xD0algP4k

Accompanied by this COLAB Notebook with uses a GPU T4 Runtime (!):
YT Stable Vicuna13B 8bit in Colab.ipynb
https://colab.research.google.com/drive/1Kvf3qF1TXE-jR-N5G9z1XxVf5z-ljFt2?usp=sharing

It can hande a prompt like this:
raw_output = pipe(get_prompt('Answer the following question by reasoning step by step. The cafeteria had 23 apples. If they used 20 for lunch, and bought 6 more, how many apple do they have?'))
parse_text(raw_output)

They started with 23 apples. After using 20 for lunch, they were left with 3 apples. Then, they
bought 6 more apples, so now they have a total of 9 apples. Therefore, the cafeteria has 9 apples in
total.

I think it will be a small step to get it to answer your sentence simplification prompt from your white paper, which I shall try with this notebook. (Up till now it crashes due to insufficient memory errors ...)

rrtucci · 2023-05-10T12:15:04Z

Yes! I agree. GPT-4 does sentence simplification better than ChatGPT! Excellent experiment. Thanks.

…

On Wed, May 10, 2023 at 4:48 AM Jan Bours ***@***.***> wrote: I did an interesting experiment with GPT4 instead of ChatGPT to do sentence simplification on the same example you give in your white paper: PROMPT: Consider the following quote: “A range of mountains takes form in the haze. Moving closer. The mountains are piles of TRASH. The entire surface is nothing but waste. “..We’re gonna find adventure in the evening air...” A silhouetted city in the distance. What looks like skyscrapers turns into trash. Thousands of neatly stacked CUBES OF TRASH, stories high. Rows and rows of stacked cubes, like city avenues. They go on for miles.” Give an enumerated list summarizing this quote. Each list item must be a simple declarative sentence consisting of a subject phrase and a verb phrase. The list items must be in chronological order of occurrence in the quote. REPLY: This bot may make incorrect statements. It does not have knowledge of events after 2021. Tap to learn more. 1. Mountains appear in the haze. 2. Distance closes. 3. Mountains transform into trash piles. 4. Evening air promises adventure. 5. Silhouetted city emerges. 6. Skyscrapers reveal themselves as trash. 7. Cubes of trash form structures. 8. Trash cube rows extend for miles. I think the quality of these sentence simplifications is even a lot better than that produced by ChatGPT ! What do you think? — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZWTYPFTMECJXGCK2EREY3XFNI35ANCNFSM6AAAAAAXX5ZLHQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

rrtucci · 2023-05-10T12:41:48Z

Nice! Thanks. Sentence Vectors seem to be part of the foundation of most LLM software, and of some "Knowledge Graph" software, and now of some Causal Inference software. A pleasant surprise.

…

On Wed, May 10, 2023 at 6:31 AM Jan Bours ***@***.***> wrote: I have found this tweet of Huggingface (it is a tweet from today, May 10th 2023 !): https://twitter.com/gpt_index/status/1656091102221570048?s=20 <http://url> "Today’s release highlight: Native @huggingface <https://github.com/huggingface> LLM support. Want to run *any* open-source LLM from 🤗? (Camel, StableLM, alpaca, Dolly, etc.) - Customize model/tokenizer/chat prompts 📝 - Full streaming support 🌊 Now you can in 🦙! (huge s/o to Logan)" Then I looked further and found this at Huggingface: https://huggingface.co/LLMs https://github.com/deep-diver/LLM-As-Chatbot : About LLM as a Chatbot Service With this complete notebook of using a LLM as a chatbot:https://github.com/deep-diver/LLM-As-Chatbot/blob/main/notebooks/llm_as_chatbot_in_colab.ipynb ----->https://raw.githubusercontent.com/deep-diver/LLM-As-Chatbot/main/notebooks/llm_as_chatbot_in_colab.ipynb I am trying to get this to work and then prompting it with your example of sentence simplifying from your white paper. Up till now no luck, but that is a technical issue. I will try to solve that (and search Huggingface for other FREE implementations). But what is more important: I have tried your prompt in a number of different online ChatGPT-like LLMs, and even the more simple ones give excellent replies to your sentence simplifying prompt! See and try for yourself: https://poe.com/Sage https://poe.com/Claude%2B (3 FREE messages per day) https://poe.com/GPT-4 (1 FREE message per day <----- !!!!!!!) https://poe.com/Claude-instant https://poe.com/ChatGPT (FREE !!!!!) https://poe.com/Dragonfly — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZWTYNW7ARTFIUEK7TRZXLXFNU7PANCNFSM6AAAAAAXX5ZLHQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jbdatascience · 2023-05-10T14:53:37Z

Also I am looking into this Open Source Open ChatGPT variant:

StableVicuna: The New King of Open ChatGPTs? https://www.youtube.com/watch?v=m_xD0algP4k

Accompanied by this COLAB Notebook with uses a GPU T4 Runtime (!): YT Stable Vicuna13B 8bit in Colab.ipynb https://colab.research.google.com/drive/1Kvf3qF1TXE-jR-N5G9z1XxVf5z-ljFt2?usp=sharing

It can hande a prompt like this: raw_output = pipe(get_prompt('Answer the following question by reasoning step by step. The cafeteria had 23 apples. If they used 20 for lunch, and bought 6 more, how many apple do they have?')) parse_text(raw_output)

They started with 23 apples. After using 20 for lunch, they were left with 3 apples. Then, they bought 6 more apples, so now they have a total of 9 apples. Therefore, the cafeteria has 9 apples in total.

I think it will be a small step to get it to answer your sentence simplification prompt from your white paper, which I shall try with this notebook. (Up till now it crashes due to insufficient memory errors ...)

ANSWER TO MY QUESTION ABOUT THIS:

I was trying to run the COLAB notebook but it crashes due to insufficient memory errors. Is it ONLY possible to run this on the PRO paid version of COLAB? How could I run this on a free version of COLAB or perhaps even locally on a PC in a Jupyter notebook? Perhaps using 4-bit version of the LLM as you mention?

@samwitteveenai
34 minuten geleden
yeah unfortunately you need a GPU with a lot of VRAM to run this so Colab free isn't going to work.
1

jbdatascience · 2023-05-11T07:11:46Z

At this moment I am looking into this list of Open Source implementations of LLMs:

GitHub - eugeneyan/open-llms: 🤖 A list of open LLMs available for commercial use.

https://github.com/eugeneyan/open-llms

Trying h2oGPT from that list online with your sentence simplification prompt gives this result (which is not bad I think):

http://pseudotensor.hopto.org:37860/

prompt:

“
Consider the following quote:
“A range of mountains takes form in the haze.
Moving closer.
The mountains are piles of TRASH. The entire surface is nothing but waste.
“…We’re gonna find adventure in the evening air…”
A silhouetted city in the distance.
What looks like skyscrapers turns into trash.
Thousands of neatly stacked CUBES OF TRASH, stories high.
Rows and rows of stacked cubes, like city avenues. They go on for miles.”

Give an enumerated list summarizing this quote.
Each list item must be a simple declarative sentence consisting of a subject phrase and a verb phrase.
The list items must be in chronological order of occurrence in the quote.
“

OUTPUT:
“

A range of mountains takes form in the haze.
Moving closer.
The mountains are piles of TRASH.
The entire surface is nothing but waste.
“…We’re gonna find adventure in the evening air…”
A silhouetted city in the distance.
What looks like skyscrapers turns into trash.
Thousands of neatly stacked CUBES OF TRASH, stories high.
Rows and rows of stacked cubes, like city avenues. They go on for miles.
“

I think this would be acceptable.

Trying this h2oGPT in a notebook on Google Colab (also on Kaggle) gives memory error.

Maybe we could try:
hivemind/gpt-j-6B-8bit · Hugging Face

https://huggingface.co/hivemind/gpt-j-6B-8bit?text=My+name+is+Mariama%2C+my+favorite

It uses some kind of compression (8 bit arithmetic I think) so it fits within COLAB free version memory. I will try it out in this COLAB notebook

finetune-gpt-j-6B-8bit.ipynb - Colaboratory

https://colab.research.google.com/drive/1ft6wQU0BhqG5PRlwgaZJv2VukKKjU4Es#scrollTo=DuW4H6HTS82r

rrtucci · 2023-05-11T12:30:52Z

Sounds good. It occurs to me that this WALL-E example is not very good because it doesn't have compound sentences, or complex sentences, or compound-simplex sentences. It would be nice if the LLM would split those into simple sentences. To show you what I mean, I asked ChatGPT this: What is a simple, complex, compound and complex compound sentence, with examples? Here is its answer: Sure, I'd be happy to provide some examples of simple, complex, compound, and complex-compound sentences. 1. Simple sentence: A simple sentence consists of a subject and a verb and expresses a complete thought. Example: The cat sat on the mat. 1. Complex sentence: A complex sentence consists of an independent clause and a dependent clause. Example: Although I studied hard, I failed the exam. 1. Compound sentence: A compound sentence consists of two or more independent clauses joined by a conjunction. Example: I wanted to go to the park, but it started raining. 1. Complex-compound sentence: A complex-compound sentence consists of two or more independent clauses and at least one dependent clause. Example: After I finish my homework, I'll go for a walk, but only if it's not too late. Note that these are just a few examples of each type of sentence, and there can be many variations and combinations of each type.

…

On Thu, May 11, 2023 at 3:12 AM Jan Bours ***@***.***> wrote: At this moment I am looking into this list of Open Source implementations of LLMs: GitHub - eugeneyan/open-llms: 🤖 A list of open LLMs available for commercial use. https://github.com/eugeneyan/open-llms Trying h2oGPT from that list online with your sentence simplification prompt gives this result (which is not bad I think): http://pseudotensor.hopto.org:37860/ prompt: “ Consider the following quote: “A range of mountains takes form in the haze. Moving closer. The mountains are piles of TRASH. The entire surface is nothing but waste. “…We’re gonna find adventure in the evening air…” A silhouetted city in the distance. What looks like skyscrapers turns into trash. Thousands of neatly stacked CUBES OF TRASH, stories high. Rows and rows of stacked cubes, like city avenues. They go on for miles.” Give an enumerated list summarizing this quote. Each list item must be a simple declarative sentence consisting of a subject phrase and a verb phrase. The list items must be in chronological order of occurrence in the quote. “ OUTPUT: “ 1. A range of mountains takes form in the haze. 2. Moving closer. 3. The mountains are piles of TRASH. 4. The entire surface is nothing but waste. 5. “…We’re gonna find adventure in the evening air…” 6. A silhouetted city in the distance. 7. What looks like skyscrapers turns into trash. 8. Thousands of neatly stacked CUBES OF TRASH, stories high. 9. Rows and rows of stacked cubes, like city avenues. They go on for miles. “ I think this would be acceptable. Trying this h2oGPT in a notebook on Google Colab (also on Kaggle) gives memory error. — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZWTYIZ4XVO5E6WE3SY57TXFSGLZANCNFSM6AAAAAAXX5ZLHQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

rrtucci · 2023-05-12T16:03:01Z

This looks VERY promising. https://github.com/explosion/spacy-llm

…

On Thu, May 11, 2023 at 3:12 AM Jan Bours ***@***.***> wrote: At this moment I am looking into this list of Open Source implementations of LLMs: GitHub - eugeneyan/open-llms: 🤖 A list of open LLMs available for commercial use. https://github.com/eugeneyan/open-llms Trying h2oGPT from that list online with your sentence simplification prompt gives this result (which is not bad I think): http://pseudotensor.hopto.org:37860/ prompt: “ Consider the following quote: “A range of mountains takes form in the haze. Moving closer. The mountains are piles of TRASH. The entire surface is nothing but waste. “…We’re gonna find adventure in the evening air…” A silhouetted city in the distance. What looks like skyscrapers turns into trash. Thousands of neatly stacked CUBES OF TRASH, stories high. Rows and rows of stacked cubes, like city avenues. They go on for miles.” Give an enumerated list summarizing this quote. Each list item must be a simple declarative sentence consisting of a subject phrase and a verb phrase. The list items must be in chronological order of occurrence in the quote. “ OUTPUT: “ 1. A range of mountains takes form in the haze. 2. Moving closer. 3. The mountains are piles of TRASH. 4. The entire surface is nothing but waste. 5. “…We’re gonna find adventure in the evening air…” 6. A silhouetted city in the distance. 7. What looks like skyscrapers turns into trash. 8. Thousands of neatly stacked CUBES OF TRASH, stories high. 9. Rows and rows of stacked cubes, like city avenues. They go on for miles. “ I think this would be acceptable. Trying this h2oGPT in a notebook on Google Colab (also on Kaggle) gives memory error. — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZWTYIZ4XVO5E6WE3SY57TXFSGLZANCNFSM6AAAAAAXX5ZLHQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jbdatascience · 2023-05-13T05:32:12Z

Sounds good. It occurs to me that this WALL-E example is not very good because it doesn't have compound sentences, or complex sentences, or compound-simplex sentences. It would be nice if the LLM would split those into simple sentences. To show you what I mean, I asked ChatGPT this: What is a simple, complex, compound and complex compound sentence, with examples? Here is its answer: Sure, I'd be happy to provide some examples of simple, complex, compound, and complex-compound sentences. 1. Simple sentence: A simple sentence consists of a subject and a verb and expresses a complete thought. Example: The cat sat on the mat. 1. Complex sentence: A complex sentence consists of an independent clause and a dependent clause. Example: Although I studied hard, I failed the exam. 1. Compound sentence: A compound sentence consists of two or more independent clauses joined by a conjunction. Example: I wanted to go to the park, but it started raining. 1. Complex-compound sentence: A complex-compound sentence consists of two or more independent clauses and at least one dependent clause. Example: After I finish my homework, I'll go for a walk, but only if it's not too late. Note that these are just a few examples of each type of sentence, and there can be many variations and combinations of each type.
…

Then we have to search for more complex texts. Do you have any examples in mind?

jbdatascience · 2023-05-13T06:08:43Z

This looks VERY promising. https://github.com/explosion/spacy-llm
…

That looks promising indeed! Have you already tried?

In the meantime I am looking into this interesting list in my search for a good open source LLM that ideally can be run locally on your or mine PC to perform sentence simplification or other NLP tasks (our main goal is to be able to generate a DagAtlas based e.g. on a large number of texts, so if we can accomplish that by letting run those LLMs locally for enough time, we could reach that goal!):

List of Open Sourced Fine-Tuned Large Language Models (LLM) | by Sung Kim | Geek Culture | Mar, 2023 | Medium

An incomplete list of open-sourced fine-tuned Large Language Models (LLM) you can run locally on your computer

https://medium.com/geekculture/list-of-open-sourced-fine-tuned-large-language-models-llm-8d95a2e0dc76

rrtucci · 2023-05-13T06:39:06Z

You can choose a paragraph from one of the P.G. Wodehouse stories. If you want really long sentences, I believe Moby Dick by Herman Melville has sentences that are several pages long :) Legal documents have very long sentences too.

…

On Sat, May 13, 2023 at 1:32 AM Jan Bours ***@***.***> wrote: Sounds good. It occurs to me that this WALL-E example is not very good because it doesn't have compound sentences, or complex sentences, or compound-simplex sentences. It would be nice if the LLM would split those into simple sentences. To show you what I mean, I asked ChatGPT this: What is a simple, complex, compound and complex compound sentence, with examples? Here is its answer: Sure, I'd be happy to provide some examples of simple, complex, compound, and complex-compound sentences. 1. Simple sentence: A simple sentence consists of a subject and a verb and expresses a complete thought. Example: The cat sat on the mat. 1. Complex sentence: A complex sentence consists of an independent clause and a dependent clause. Example: Although I studied hard, I failed the exam. 1. Compound sentence: A compound sentence consists of two or more independent clauses joined by a conjunction. Example: I wanted to go to the park, but it started raining. 1. Complex-compound sentence: A complex-compound sentence consists of two or more independent clauses and at least one dependent clause. Example: After I finish my homework, I'll go for a walk, but only if it's not too late. Note that these are just a few examples of each type of sentence, and there can be many variations and combinations of each type. … <#m_-8535341612585800405_> Then we have to search for more complex texts. Do you have any examples in mind? — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZWTYK32XGHWHWJQZK3UGDXF4MGNANCNFSM6AAAAAAXX5ZLHQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

rrtucci · 2023-05-13T06:46:57Z

No. I put in a comment on Twitter to the developers of this software, but got no reply. I've been trying very hard to get some sponsors for this project, but no luck so far. https://twitter.com/artistexyz/status/1657109267323801600

…

On Sat, May 13, 2023 at 2:09 AM Jan Bours ***@***.***> wrote: This looks VERY promising. https://github.com/explosion/spacy-llm … <#m_6188584845194245083_> That looks promising indeed! Have you already tried? — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZWTYMMAWWAOTVMHG6FZXTXF4QPLANCNFSM6AAAAAAXX5ZLHQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jbdatascience · 2023-05-13T10:13:41Z

I am searching for a way to have texts for which we know the Ground Truth causal relations, so we could test the MM algorithm if it could find these causal relations.

Now I am considering this idea💡 :
would it be possible to have LLMs generating texts with custom, predefined causal relations in it? That would give us a way to test if the MM algorithm could correctly come up with those causal relations in a DAG.

I would like to hear your comments. Do you consider this as worth pursuing?

rrtucci · 2023-05-13T14:19:27Z

These are all interesting research questions to pursue in the future, but I think getting the 1100 movie scripts analyzed should be our immediate highest priority, because it's real world data not simulated data. Maybe you could start with a 500 node DAG that has a simple sentence attached to each node, and ask a LLM to produce 999 new DAGs that differ from the original DAG by rephrasings of the sentences. Then take your 1000 DAGs and "flatten" them out by transforming them into 1000 short stories. Then input those 1000 short stories into Mappa Mundi, and see if it can retrieve the original 500 node DAG.

…

On Sat, May 13, 2023 at 6:14 AM Jan Bours ***@***.***> wrote: I am searching for a way to have texts for which we know the Ground Truth causal relations, so we could test the MM algorithm if it could find these causal relations. Now I am considering this idea💡 : would it be possible to have LLMs generating texts with custom, predefined causal relations in it? That would give us a way to test if the MM algorithm could correctly come up with those causal relations in a DAG. I would like to hear your comments. Do you consider this as worth pursuing? — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZWTYJMY4I64OEIDNVKCR3XF5NF7ANCNFSM6AAAAAAXX5ZLHQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

rrtucci · 2023-05-13T14:50:15Z

Another thing that occurs to me is to analyze 10,000 games of chess with Mappa Mundi, and then ask a chess expert to judge whether the resulting DAGs make sense to him/her. The ground truth in this case is the chess expert :)

…

On Sat, May 13, 2023 at 6:14 AM Jan Bours ***@***.***> wrote: I am searching for a way to have texts for which we know the Ground Truth causal relations, so we could test the MM algorithm if it could find these causal relations. Now I am considering this idea💡 : would it be possible to have LLMs generating texts with custom, predefined causal relations in it? That would give us a way to test if the MM algorithm could correctly come up with those causal relations in a DAG. I would like to hear your comments. Do you consider this as worth pursuing? — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZWTYJMY4I64OEIDNVKCR3XF5NF7ANCNFSM6AAAAAAXX5ZLHQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

rrtucci · 2023-05-13T22:20:45Z

Note that for the chess example, the sentences are Moves of chess pieces. There is no need for sentence simplification, and sentence similarity is just equality of moves. Hence the chess example requires no NLP whatsoever. It is trivial compared to the movie scripts example.

…

On Sat, May 13, 2023 at 6:14 AM Jan Bours ***@***.***> wrote: I am searching for a way to have texts for which we know the Ground Truth causal relations, so we could test the MM algorithm if it could find these causal relations. Now I am considering this idea💡 : would it be possible to have LLMs generating texts with custom, predefined causal relations in it? That would give us a way to test if the MM algorithm could correctly come up with those causal relations in a DAG. I would like to hear your comments. Do you consider this as worth pursuing? — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZWTYJMY4I64OEIDNVKCR3XF5NF7ANCNFSM6AAAAAAXX5ZLHQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

rrtucci · 2023-05-19T18:18:56Z

Jan, I know I have cried wolf many times before, but this time, I think I really found a solution to the sentence simplification problem. And it's pretrained, so no need for GPU card.
https://github.com/dair-iitd/imojie
https://arxiv.org/abs/2005.08178

jbdatascience · 2023-05-19T21:30:45Z

Looks interesting, give it a try! But I think you should try the newer version as they mention (which is https://github.com/dair-iitd/openie6):

NEWEST VERSION:

[2010.03147] OpenIE6: Iterative Grid Labeling and Coordination Analysis for Open Information Extraction
https://arxiv.org/abs/2010.03147

CODE:
https://github.com/dair-iitd/openie6

OLDER VERSION :

[2005.08178] IMoJIE: Iterative Memory-Based Joint Open Information Extraction (OLDER VERSION, ALREADY OVERRIDDEN BY OIE6 !)
https://arxiv.org/abs/2005.08178
May 2020

CODE:
https://github.com/dair-iitd/imojie
Neural generation model for Open Information Extraction

rrtucci · 2023-05-20T00:04:28Z

Good catch. yes, the newest version.

…

On Fri, May 19, 2023 at 5:31 PM Jan Bours ***@***.***> wrote: Looks interesting, give it a try! But I think you should try the newer version as they mention (which is https://github.com/dair-iitd/openie6): ------------------------------ NEWEST VERSION: [2010.03147] OpenIE6: Iterative Grid Labeling and Coordination Analysis for Open Information Extraction https://arxiv.org/abs/2010.03147 CODE: https://github.com/dair-iitd/openie6 ------------------------------ OLDER VERSION : [2005.08178] IMoJIE: Iterative Memory-Based Joint Open Information Extraction (OLDER VERSION, ALREADY OVERRIDDEN BY OIE6 !) https://arxiv.org/abs/2005.08178 May 2020 CODE: https://github.com/dair-iitd/imojie Neural generation model for Open Information Extraction ------------------------------ — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZWTYK7RZUM5KO26NWA7TDXG7RA7ANCNFSM6AAAAAAXX5ZLHQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jbdatascience · 2023-05-21T17:20:16Z

Good catch. yes, the newest version.
…
On Fri, May 19, 2023 at 5:31 PM Jan Bours @.> wrote: Looks interesting, give it a try! But I think you should try the newer version as they mention (which is https://github.com/dair-iitd/openie6): ------------------------------ NEWEST VERSION: [2010.03147] OpenIE6: Iterative Grid Labeling and Coordination Analysis for Open Information Extraction https://arxiv.org/abs/2010.03147 CODE: https://github.com/dair-iitd/openie6 ------------------------------ OLDER VERSION : [2005.08178] IMoJIE: Iterative Memory-Based Joint Open Information Extraction (OLDER VERSION, ALREADY OVERRIDDEN BY OIE6 !) https://arxiv.org/abs/2005.08178 May 2020 CODE: https://github.com/dair-iitd/imojie Neural generation model for Open Information Extraction ------------------------------ — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZWTYK7RZUM5KO26NWA7TDXG7RA7ANCNFSM6AAAAAAXX5ZLHQ . You are receiving this because you were mentioned.Message ID: @.>

Can you get it working? I am trying, but having trouble installing all the requirements Python packages (in google COLAB). I will try it later.

rrtucci · 2023-05-21T17:31:45Z

Sorry, I haven't tried. Will try soon. I'm a bit slow due to personal problems.

…

On Sun, May 21, 2023 at 1:20 PM Jan Bours ***@***.***> wrote: Good catch. yes, the newest version. … <#m_-7294515081261667074_> On Fri, May 19, 2023 at 5:31 PM Jan Bours *@*.*> wrote: Looks interesting, give it a try! But I think you should try the newer version as they mention (which is https://github.com/dair-iitd/openie6 <https://github.com/dair-iitd/openie6>): ------------------------------ NEWEST VERSION: [2010.03147] OpenIE6: Iterative Grid Labeling and Coordination Analysis for Open Information Extraction https://arxiv.org/abs/2010.03147 <https://arxiv.org/abs/2010.03147> CODE: https://github.com/dair-iitd/openie6 <https://github.com/dair-iitd/openie6> ------------------------------ OLDER VERSION : [2005.08178] IMoJIE: Iterative Memory-Based Joint Open Information Extraction (OLDER VERSION, ALREADY OVERRIDDEN BY OIE6 !) https://arxiv.org/abs/2005.08178 <https://arxiv.org/abs/2005.08178> May 2020 CODE: https://github.com/dair-iitd/imojie <https://github.com/dair-iitd/imojie> Neural generation model for Open Information Extraction ------------------------------ — Reply to this email directly, view it on GitHub <#2 (comment) <#2 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZWTYK7RZUM5KO26NWA7TDXG7RA7ANCNFSM6AAAAAAXX5ZLHQ <https://github.com/notifications/unsubscribe-auth/ADZWTYK7RZUM5KO26NWA7TDXG7RA7ANCNFSM6AAAAAAXX5ZLHQ> . You are receiving this because you were mentioned.Message ID: @.*> Can you get it working? I am trying, but having trouble installing all the requirements Python packages (in google COLAB). I will try it later. — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZWTYPDRZPTPVQDAOKXQB3XHJFFVANCNFSM6AAAAAAXX5ZLHQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jbdatascience · 2023-05-21T17:59:05Z

Sorry, I haven't tried. Will try soon. I'm a bit slow due to personal problems.
…

Take it easy, no reason to hurry. I will help you with this. I find this project extremely interesting and now that I am retired from work I can devote a lot time to it and love to do that!

The last couple of days I was searching for information about comparable projects which could help us with our own MM problem. I encountered quite a few of them, and I will compile a list of them later over here.

rrtucci · 2023-05-24T14:53:49Z

I've been trying to run both imojie and openie6 on Colab. Failed so far. Get lots of errors loading the requirements.

This repo named openie6-revised seems interesting, because it claims to be as good as openie6, but much impler because it relies on HuggingFace tools that have only become available in the last year.
https://github.com/tkdrnjs0621/openie6-revised

rrtucci · 2023-05-25T17:47:08Z

I am trying to clean up openie6-revised and train it on Colab. I started a new github repo for this
https://github.com/rrtucci/openie6-revised2
My computer doesn't have a GPU card but Colab offers limited GPU for free.
Even if this doesn't work, it will be a good learning experience because it will teach me basic stuff about Colab, Torch, HuggingFace and transformers. I am currently a newbie in those areas but slowly learning.,

rrtucci · 2023-05-26T21:40:29Z

I've been reading the code for openie6-revised (based on BERT), googling everything I don't understand.
Hugging face articles for BERT are excellent. For example

https://huggingface.co/blog/bert-101

https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb

https://huggingface.co/docs/transformers/training

According to these articles, BERT can be fine-tuned in 2-3 hours with a single GPU

rrtucci · 2023-05-27T02:45:38Z

Openie6-revised is a BERT model fine tuned with what is called an Openie dataset. I asked ChatGPT what is an Openie dataset and it gave a nice answer. There are several Openie datasets. Openie6-revised uses the Openie4 dataset.

An OpenIE dataset is a collection of text data that has been annotated or labeled with open information extraction (OpenIE) outputs. It typically consists of a set of sentences or documents along with the extracted relations or facts from those texts.

The purpose of an OpenIE dataset is to provide a resource for training and evaluating OpenIE systems. These datasets enable researchers and developers to analyze and improve the performance of their OpenIE models by benchmarking them against the annotated data.

Each data point in an OpenIE dataset usually includes the original sentence or document from which the information is extracted and the corresponding extracted relations or facts. These relations are typically represented in a structured format, such as subject-predicate-object triples, where the subject, predicate, and object are the components of the extracted relation.

OpenIE datasets are valuable resources for various NLP tasks, including relation extraction, knowledge base construction, and information retrieval. They help in advancing the development of OpenIE systems by providing a standardized benchmark for evaluating their performance and facilitating the comparison of different approaches.

Several OpenIE datasets have been created and made publicly available, such as OpenIE-4, ReVerb, OIE2016, and NYTOpenIE. These datasets are often used for training and evaluating OpenIE models and can be accessed through online repositories or research platforms.

rrtucci · 2023-05-27T02:58:58Z

The same people that wrote Openie6 also wrote Carb, which contains several openie datasets
https://github.com/dair-iitd/CaRB
https://github.com/dair-iitd/imojie
https://github.com/dair-iitd/openie6
https://github.com/dair-iitd

rrtucci · 2023-05-27T07:06:37Z

Today, I discovered that Amazon offers a similar service to Google Colab. But with 15Gb of persistent memory
https://studiolab.sagemaker.aws/

jbdatascience · 2023-05-27T08:36:32Z

Today, I discovered that Amazon offers a similar service to Google Colab. But with 15Gb of persistent memory https://studiolab.sagemaker.aws/

Interesting. I certainly will try that! I hope it circumvents the limited memory problems of the free Google Colab with I most often use. But with the gigantic Large Language Models that I try to use in Google Colab I reach the memory limits quite often there. Will https://studiolab.sagemaker.aws/ be better at that?

jbdatascience · 2023-05-27T08:58:49Z

Last 2 weeks I am experimenting with State of the Art Large FREE OPEN SOURCE Language Models. Not only out of pure curiosity, but also I think they would be able to contribute to our ultimate goal of the Mappa Mundi project!
It is a pity that that these FREE OPEN SOURCE LLMs (of which there are a lot of them around these days and nearly every day a new one appears !) often have bugs and produce a lot of errors, which will be caused by the fact they are under heavy development at this moment. But it did cost me a lot of time these last 2 weeks, so I hope this will get better over time.

The things I am working on now:

exploring what Free Open Source Large Language Models can mean for us
I think an important step in the existing code base of the Mappa Mundi project is the use of stemming and/or lemmatisation of the text. These techniques brings each word / term to its most basic standard form (called the "stem" of a word). For example: the word "riding" will be transformed to "ride". The purpose of this step is to increase the quality of the embeddings, because if you omit this step, "rinding" and "ride" (and other derivatives with comparable semantic meaning of the word "ride") will get different tokens during the tokenisation process and these different tokens will not have any relation with each other. And because of that, these different tokens will get different vector embeddings during the embedding process, which will give a lower quality embedding space because all these different tokens should have gotten embedded to the SAME vector because they are semantically the same!

My question: do you have already incorporated this stemming and/or lemmatisation step in your code? This has an important effect on the quality of the vector embeddings that are based on it, and consequently also on the quality of the calculation of similarities in the texts, because I will base these similarities on the vector embeddings of the texts!

Intuitively, I think causal relations in texts has something to do with textual entailments (text_1 entails text_2 if text_2 can be seen as a consequence of text_1).

See e.g.
https://en.wikipedia.org/wiki/Textual_entailment

Textual entailment (TE), also known as Natural Language Inference (NLI), in natural language processing is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text. In the TE framework, the entailing and entailed texts are termed text (t) and hypothesis (h), respectively. Textual entailment is not the same as pure logical entailment – it has a more relaxed definition: "t entails h" (t ⇒ h) if, typically, a human reading t would infer that h is most likely true.[1] (Alternatively: t ⇒ h if and only if, typically, a human reading t would be justified in inferring the proposition expressed by h from the proposition expressed by t.[2]) The relation is directional because even if "t entails h", the reverse "h entails t" is much less certain.[3][4]

Determining whether this relationship holds is an informal task, one which sometimes overlaps with the formal tasks of formal semantics (satisfying a strict condition will usually imply satisfaction of a less strict conditioned); additionally, textual entailment partially subsumes word entailment.

Examples
Textual entailment can be illustrated with examples of three different relations:[5]

An example of a positive TE (text entails hypothesis) is:

text: If you help the needy, God will reward you.
hypothesis: Giving money to a poor man has good consequences.
An example of a negative TE (text contradicts hypothesis) is:

text: If you help the needy, God will reward you.
hypothesis: Giving money to a poor man has no consequences.
An example of a non-TE (text does not entail nor contradict) is:

text: If you help the needy, God will reward you.
hypothesis: Giving money to a poor man will make you a better person.

""

See also:
https://huggingface.co/EndMO/text-entailment-bert?text=I+like+you.+I+love+you

https://huggingface.co/roberta-large-mnli?text=The%2Bdog%2Bwas%2Blost.%2BNobody%2Blost%2Bany%2Banimal

https://medium.com/@zeyuanhu/recognizing-contextual-entailment-using-nneural-network-in-nlp-ea9c5f1a216a

Textual Entailment Recognition with Semantic Features from Empirical Text Representation
https://arxiv.org/abs/2210.09723

29 Textual Entailment
https://academic.oup.com/edited-volume/42643/chapter-abstract/358152055?redirectedFrom=fulltext

I think this will be of importance to our MM project.
Perhaps textual entailments are not the only causal relations we are after, but I do think textual entailments are a particular string form of causal relations that can be present in texts !
Please give your vision!

I am searching for good texts to use for MM. With good I mean, that we check if MM extracts the correct causal relations from these texts. That means that good texts have a known ground truth: when we know the causal relations in them beforehand, before we let MM try to extract these causal relations. With the goal of testing the workings of MM and checking if MM can indeed extract the correct causal relations.
Are there any other NLP applications that can perform the same function(s) that MM aims to be able to do?
Her I think e.g. CausalBERT could be comparable (at least may be made to do comparable things).
I am in the process of researching that.

See:
https://github.com/rpryzant

Pytorch implementation of "Adapting Text Embeddings for Causal Inference" (!):
https://github.com/rpryzant/causal-bert-pytorch
This seems a very interesting approach to me, because of the adapted embeddings for Causal Inference !

Tensorflow 2 implementation of Causal-BERT
https://github.com/rpryzant/causal-text-embeddings-tf2

https://raw.githubusercontent.com/rpryzant/causal-bert-pytorch/master/CausalBert.py

See also:
Find text features that are most related to an outcome, controlling for confounds.
https://github.com/rpryzant/deconfounded-lexicon-induction

Causal Effects of Linguistic Properties
Package for computing causal effects of text (as treatment)
https://github.com/rpryzant/causal-text

https://github.com/rpryzant/awesome-causality-algorithms#learning-causal-effects

Natural Language Processing

Name Paper Code

Causal Text Embeddings Veitch, Victor, Dhanya Sridhar, and David M. Blei. "Using Text Embeddings for Causal Inference." arXiv preprint arXiv:1905.12741 (2019). Python

Handling Missing/Noisy Treatment Wood-Doughty, Zach, Ilya Shpitser, and Mark Dredze. "Challenges of Using Text Classifiers for Causal Inference." In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4586-4598. 2018. Python

Conditional Treatment-adversarial Learning Based Matching Yao, Liuyi, Sheng Li, Yaliang Li, Hongfei Xue, Jing Gao, and Aidong Zhang. "On the estimation of treatment effect with text covariates." In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 4106-4113. AAAI Press, 2019. NA

Causal Inferences Using Texts Egami, Naoki, Christian J. Fong, Justin Grimmer, Margaret E. Roberts, and Brandon M. Stewart. "How to make causal inferences using texts." arXiv preprint arXiv:1802.02163 (2018). NA

Causal FS for text classification Michael J. Paul. Feature selection as causal inference: experiments with text classification. Conference on Computational Natural Language Learning (CoNLL), Vancouver, Canada. August 2017. NA

rrtucci · 2023-05-27T19:58:09Z

This textual entailment stuff sounds nice, but quite different from causal inference. It produces no DAGs and no DAG atlas, These are the main objectives of MM!!

for now, my immediate goal is very well defined and reachable. I want to

replace the simplification and sentence similarity operations of mappa mundi by something quicker and more efficient.
Get the causal maps for all the IMSDb movie scripts.

I think BERT is good enough to achieve 1,2

jbdatascience · 2023-05-28T13:04:02Z

When we have divided up the original texts in pieces of sentences, we can do 2 things:

1-After embedding sentence pieces (for which we already have the open source method from HuggingFace (HF) !), we can do the similarity searches using the cosine similarity (as I showed in an earlier Python notebook posted here). I think this will be much more accurate than the NLTK method of doing this and a lot speedier too! Therefore I expect this will enable us to get the causal maps for much more or even all the IMSDb movie scripts!

2-My intention with the textual entailment could be comparison with the results of step 1, because if 2 pieces of text show this entailment than there is an unequivocal causal relationship between the 2 ! And my goal with this is to test if the method in step 1 discovers these same causally related pairs: does it discover all of them, just a few of them or none at all? It seems a good testing method to me.

If in step 2 there are causally related pairs which we did not discover in step 1, I think we should decide what to do: add them to the set of discovered causally related pairs from step 1? I think we should, because they are causally related pairs for sure! Why should we not do that?

rrtucci · 2023-05-28T15:58:52Z

Like I said, this is nice for future research. But I want to do 1 and 2 first.

jbdatascience · 2023-05-28T16:33:38Z

I will try to implement step 1 first and then see where it brings us ! I am particularly focused on the assuring of being able to validate (some of) the extracted causal relations found by the MM algorithm just by looking at the texts of the 2 nodes involved.

In my view it is crucial that we will be able to verify if (a sample of) the MM extracted causal pairs are defendable. Can you please comment on this because I find this very important and would like us to be on the same page with this !
Do you think this is possible or not? The causal relations found should be human interpretable if need be don’t you think?

rrtucci · 2023-05-28T19:49:51Z

I'm not interested in working on this right now. I want to finish what I started, instead of going off in a tangential direction before finishing what I started. When I was in high school, I did some art and took some art classes. I learned from the art teacher that the best art teachers don't tell their pupils what style to paint in. They just coach you in whatever style you choose. So I am not telling you not to do this. You should do whatever you think is best. I am just telling you that I will not participate in it because at the present time I want to do 1 and 2.

rrtucci · 2023-05-28T19:55:08Z

I am trying to get in contact with the authors of Carb, Imojie and Openie6

jbdatascience · 2023-05-28T20:04:39Z

I'm not interested in working on this right now. I want to finish what I started, instead of going off in a tangential direction before finishing what I started. When I was in high school, I did some art and took some art classes. I learned from the art teacher that the best art teachers don't tell their pupils what style to paint in. They just coach you in whatever style you choose. So I am not telling you not to do this. You should do whatever you think is best. I am just telling you that I will not participate in it because at the present time I want to do 1 and 2.

As I said : I will try to implement step 1 first and then see where it brings us !

jbdatascience · 2023-05-28T20:05:29Z

I am trying to get in contact with the authors of Carb, Imojie and Openie6

A good step. Could be useful !

rrtucci · 2023-06-03T20:38:19Z

I think I know what the problem with installing openie6 is. The latest pip versions are more strict than the older ones. It's a little risky, but it is possible to tell pip not to check for dependency conflicts as follows:

%pip install --no-deps -r requirements.txt

https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies

jbdatascience · 2023-06-04T13:15:59Z

Are you able to install openie6 using %pip install --no-deps -r requirements.txt ?

rrtucci · 2023-06-04T15:43:27Z

yes.

…

On Sun, Jun 4, 2023 at 9:16 AM Jan Bours ***@***.***> wrote: Are you able to install openie6 using %pip install --no-deps -r requirements.txt ? — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZWTYN4BGBIICIMVTT6AS3XJSDBVANCNFSM6AAAAAAXX5ZLHQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

rrtucci · 2023-06-09T04:14:08Z

Unfortunately, the people who wrote Openie6 are not very eager to help. I was trying to understand their code and simplify it but it's very hard because

they provide no docstrings and their code is somewhat sloppy and overly complicated.
I'm a newbie to this. This is the first time I use most of the libraries they use
Many of the library versions they use (e.g., torchtext, allennlp, pytorch-lighting, transformers) are very outdated
The code was designed to compare Openie6 with many other models across many benchmarks. So if all you want to do is to use Openie6, without comparing it or benchmarking it, large parts of the code are unnecessary.

So I've gone to plan B, which is to use the following
https://github.com/tkdrnjs0621/openie6-revised/

jbdatascience · 2023-06-09T06:48:37Z

I saw you created a new github repository: https://github.com/rrtucci/openie6A/tree/master, with
https://github.com/rrtucci/openie6A/tree/master/short_stories_clean
and
https://github.com/rrtucci/openie6A/tree/master/m_scripts_clean

Should I use these new cleaned texts as input to the new to be developed version of the Mappa Mundi algorithm (using vector embeddings and similarities based on those embeddings on which I am working)? Or use the existing ones?

I have a week off for vacation, planning to go on working on the updated version of MM from next week on.
Please keep me updated ! I will do the same!

rrtucci · 2023-06-09T13:50:51Z

Open6A resulted from my trying to update the libraries of Openie6 and simplifying it by throwing out the benchmarking and comparing with other models. But I have totally given up on that.

My new plan is to study Openie6-revised
https://github.com/tkdrnjs0621/openie6-revised/tree/main

very carefully, googling every line of the program that I don't understand. Hopefully, once I learn it really well, I will be able to train it. I've joined 3 online sites that give a limited amount of free GPU time: Google Colab, Kaggle and Amazon Sagemaker Studio Lab. So even though my computer has no GPU, I believe it will be possible for me to train it on one of those 3 services.

rrtucci · 2023-06-12T12:02:42Z

Good news. Just received this email from my friend Anton from St.Petersburg,
https://alexeyev.github.io/

Dear Dr. Tucci,

I've finally managed to run the trained coordination analysis model by Kolluru et al. 2020 (IGL-CA) on colab. Here's the notebook.
https://colab.research.google.com/drive/1samvO-SH6Xgjf9ItlhAF1EmBZo5grBQb?usp=sharing

I wish you luck with your project, best regards,
Anton M. Alekseev.

rrtucci · 2023-06-17T23:40:40Z

The latest version of Mappa Mundi uses

Openie6 for SOTA sentence simplification and

an sBert sentence transformer for SOTA sentence similarity.

The new notebooks for the 3 PG Wodehouse short stories and the 3 Pixar movie scripts, have been run on Google Colab with GPU.

My dream is if we could start a project at github/huggingface, called Mappa #2

My dream is if we could start a project at github/huggingface, called Mappa #2

Comments

jbdatascience commented May 6, 2023

rrtucci commented May 6, 2023 via email

rrtucci commented May 7, 2023 via email

jbdatascience commented May 8, 2023

jbdatascience commented May 8, 2023

rrtucci commented May 8, 2023 via email

jbdatascience commented May 9, 2023 • edited Loading

jbdatascience commented May 9, 2023

rrtucci commented May 9, 2023 via email

rrtucci commented May 9, 2023 via email

rrtucci commented May 10, 2023 via email

jbdatascience commented May 10, 2023

rrtucci commented May 10, 2023 via email

jbdatascience commented May 10, 2023

jbdatascience commented May 10, 2023

jbdatascience commented May 10, 2023

jbdatascience commented May 10, 2023

rrtucci commented May 10, 2023 via email

rrtucci commented May 10, 2023 via email

jbdatascience commented May 10, 2023 • edited Loading

jbdatascience commented May 11, 2023 • edited Loading

rrtucci commented May 11, 2023 via email

rrtucci commented May 12, 2023 via email

jbdatascience commented May 13, 2023

jbdatascience commented May 13, 2023 • edited Loading

rrtucci commented May 13, 2023 via email

rrtucci commented May 13, 2023 via email

jbdatascience commented May 13, 2023

rrtucci commented May 13, 2023 via email

rrtucci commented May 13, 2023 via email • edited Loading

rrtucci commented May 13, 2023 via email

rrtucci commented May 19, 2023

jbdatascience commented May 19, 2023

rrtucci commented May 20, 2023 via email

jbdatascience commented May 21, 2023

rrtucci commented May 21, 2023 via email

jbdatascience commented May 21, 2023

rrtucci commented May 24, 2023

rrtucci commented May 25, 2023 • edited Loading

rrtucci commented May 26, 2023 • edited Loading

rrtucci commented May 27, 2023

rrtucci commented May 27, 2023 • edited Loading

rrtucci commented May 27, 2023

jbdatascience commented May 27, 2023

jbdatascience commented May 27, 2023 • edited Loading

rrtucci commented May 27, 2023 • edited Loading

jbdatascience commented May 28, 2023 • edited Loading

rrtucci commented May 28, 2023

jbdatascience commented May 28, 2023 • edited Loading

rrtucci commented May 28, 2023 • edited Loading

rrtucci commented May 28, 2023

jbdatascience commented May 28, 2023

jbdatascience commented May 28, 2023

rrtucci commented Jun 3, 2023

jbdatascience commented Jun 4, 2023

rrtucci commented Jun 4, 2023 via email

rrtucci commented Jun 9, 2023 • edited Loading

jbdatascience commented Jun 9, 2023

rrtucci commented Jun 9, 2023

rrtucci commented Jun 12, 2023 • edited Loading

rrtucci commented Jun 17, 2023

jbdatascience commented May 9, 2023 •

edited

Loading

jbdatascience commented May 10, 2023 •

edited

Loading

jbdatascience commented May 11, 2023 •

edited

Loading

jbdatascience commented May 13, 2023 •

edited

Loading

rrtucci commented May 13, 2023 via email •

edited

Loading

rrtucci commented May 25, 2023 •

edited

Loading

rrtucci commented May 26, 2023 •

edited

Loading

rrtucci commented May 27, 2023 •

edited

Loading

jbdatascience commented May 27, 2023 •

edited

Loading

rrtucci commented May 27, 2023 •

edited

Loading

jbdatascience commented May 28, 2023 •

edited

Loading

jbdatascience commented May 28, 2023 •

edited

Loading

rrtucci commented May 28, 2023 •

edited

Loading

rrtucci commented Jun 9, 2023 •

edited

Loading

rrtucci commented Jun 12, 2023 •

edited

Loading