-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
My dream is if we could start a project at github/huggingface, called Mappa #2
Comments
I think communicating via this github channel has worked well so far,
because it has shown with great transparency, to any person interested in
this project, what obstacles might arise. So for now, I am in favor of
continuing our communications this way. Except perhaps we might open more
than one issue.
I also wanted to suggest that if you can convince one or more Dutch
companies to sponsor us, we might be able to make a small non-profit
company out of this. I suspect that there are a lot of European companies
that want to compete with USA/OpenAI/Google in the LLM arena.
…On Sat, May 6, 2023 at 3:24 AM Jan Bours ***@***.***> wrote:
In about a week I will retire from work, so from then on I could look more
closely into these issues. But I think we are already on the right track!
A question: should we do all our communication via this Github channel? Of
could we choose another one? Do you have a proposal? Please let me know.
…
My dream is if we could start a project at github/huggingface, called Mappa
Mundi V2 , run by us two and others, with the goal of doing ALL 1100 movie
scripts from the IMSDb website.
We would have to replace the *sentence simplification* and
*sentence similarity* portions of the old Mappa Mundi by these speedier,
more
precise versions.
On Wed, May 3, 2023 at 4:01 AM Jan Bours *@*.***> wrote:
Looks promising indeed. If you can get your DagAtlas algorithm more
speedy, then you can produce a lot more DAGs from texts and that will make
it more interesting fro practical use cases, would not it?
—
Reply to this email directly, view it on GitHub
#1 (comment)
<#1 (comment)>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ADZWTYLUAPA7UWNMX2RGETTXEIGEVANCNFSM6AAAAAAXMWE3C4
.
You are receiving this because you commented.Message ID:
*@*.***>
*Originally posted by @rrtucci <https://github.com/rrtucci> in #1
(comment)
<#1 (comment)>*
—
Reply to this email directly, view it on GitHub
<#2>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZWTYJF7T5JL5CJQNL6FODXEX4AXANCNFSM6AAAAAAXX5ZLHQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi Jan,
I've been learning about sentence simplification and decided to do it with
this software:
https://github.com/amanbasu/sentence-simplification
The BERT model is not bleeding edge, but it will be very easy to implement
and is free under a permissive license. If BERT proves insufficient, then
we can try a more sophisticated approach later on. Even if we trade BERT
for a more modern model in the future, we will learn a lot of practical
Transformer usage tips from the BERT experience.
…On Sat, May 6, 2023 at 3:24 AM Jan Bours ***@***.***> wrote:
In about a week I will retire from work, so from then on I could look more
closely into these issues. But I think we are already on the right track!
A question: should we do all our communication via this Github channel? Of
could we choose another one? Do you have a proposal? Please let me know.
…
My dream is if we could start a project at github/huggingface, called Mappa
Mundi V2 , run by us two and others, with the goal of doing ALL 1100 movie
scripts from the IMSDb website.
We would have to replace the *sentence simplification* and
*sentence similarity* portions of the old Mappa Mundi by these speedier,
more
precise versions.
On Wed, May 3, 2023 at 4:01 AM Jan Bours *@*.***> wrote:
Looks promising indeed. If you can get your DagAtlas algorithm more
speedy, then you can produce a lot more DAGs from texts and that will make
it more interesting fro practical use cases, would not it?
—
Reply to this email directly, view it on GitHub
#1 (comment)
<#1 (comment)>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ADZWTYLUAPA7UWNMX2RGETTXEIGEVANCNFSM6AAAAAAXMWE3C4
.
You are receiving this because you commented.Message ID:
*@*.***>
*Originally posted by @rrtucci <https://github.com/rrtucci> in #1
(comment)
<#1 (comment)>*
—
Reply to this email directly, view it on GitHub
<#2>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZWTYJF7T5JL5CJQNL6FODXEX4AXANCNFSM6AAAAAAXX5ZLHQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Interesting sentence simplification software ! I am thinking about our project.
Please share your thoughts on this! |
This quote was put on LinkedIn by me (I thought you would like it!): The power of science is its discovery of causal law. Bertrand Russell. Human Knowledge: Its Scope and Limits. Simon and Schuster, 1948. https://twitter.com/scitechtalk/status/1655502709267505154?s=46&t=8nmTR-4kmZaFFBCGObS6fA https://twitter.com/artistexyz/status/1655576091149451269?s=46&t=8nmTR-4kmZaFFBCGObS6fA Excellent quote. Thanks for sharing. |
After looking more carefully at the last software I proposed we use for
sentence simplification, I realized that it doesn't quite do what we need.
It simplifies a complicated sentence, but it doesn't split it into simpler
sentences, which is what we need. So I am still looking around for a
viable solution for sentence simplification.
"Do we have a good definition of the ultimate goal of this project?"
The immediate goal is to get the 1100 movies processed, and to reduce the
sources of noise (like bad sentence simplification and bad similarity
evaluation) as much as possible. Will we get a strong signal then, or is
this algo junk? That is what we will find out.
" I saw you have updated that document, but what are the changes?"
Small changes to improve clarity. None significant. I am always trying to
improve the clarity of what I write. That's me.
If we ever finish a new, better Mappa Mundi, that will warrant a new paper
explaining the changes.
"Is the example of DEFT on tic-tac-toe games still useful to consider?"
Not really. The old paper has some good ideas, but the Mappa Mundi
algorithm is different and better. The tic-tac-toe example was simple to
consider at the time, 7 months ago, but it now feels very artificial to me,
compared to the movie scripts example.
The other questions that you ask will best be answered as we go along.
…On Mon, May 8, 2023 at 8:13 AM Jan Bours ***@***.***> wrote:
Interesting sentence simplification software !
I am thinking about our project.
I intend to do the following and points worth considering :
- Do we have a good definition of the ultimate goal of this project?
- Reread your white paper “Causal DAG Extraction from 3 Short Stories
and 3 Movie Scripts” to be able to completely understand the principles
involved. I saw you have updated that document, but what are the
changes?
- Is the example of DEFT on tic-tac-toe games still useful to
consider? Is it still relevant for the purpose of our project?
- Proposal: to be able to check if the complete MM algorithm functions
as expected, it would be good to have a ground truth (text(s) for which we
know the Causal DAG(s) that are the groundtruth) against which we can check
the DAGs discovered by the MM algorithm. I am not sure how to do this yet.
- The output from the MM algorithm is the set of discovered DAGs (the
DagAtlas). How could we interpret them? And what could be a practical use
case for this DagAtlas? Perhaps we could describe that in more detail. How
could existing LLMs be updated by the DagAtlas so that they could represent
causal knowledge (in the most general form)?
- Are there comparable, already existing algorithms that implement the
functionality that MM has? How do they compare?
Please share your thoughts on this!
—
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZWTYPNMHAJ6GTLLXOQ3IDXFDPPVANCNFSM6AAAAAAXX5ZLHQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
OK, I will also be on the look out for a viable solution for sentence simplification. Could you specify what we could consider as a good solution (and what not)? Also I am wondering why the existing solution for simplification is not good enough? |
Also I am wondering what would be the optimal way of subdividing the text for the purpose of creating the nodes in the DAG. Now we are using sentence clauses, but I can imagine other possible subdivisions of the text. There are many instances of NLP applications where chunking is used for this. |
When time comes to evaluate the similarity of 2 nodes, if the nodes
represent very long compound sentences, the nodes will not be very well
separated in similarity space.
Chunking and breaking into clauses is too crude and often leads to
unintelligible sets of words for each node. What is needed is to break
every compound sentence into a set of simple, easily understandable,
declarative sentences without loss of meaning, and call each of those
simple sentences a node.
ChatGPT does this 1 to N sentence conversion really well. I give an example
in the white paper.
If we had money, we could just ask ChatGPT to do it. I think chatGPT could
do it for $10-$100 per movie.
What I would love to find is a FREE way of doing what ChatGPT does so
well. I've scoured GitHub trying to find a decent solution. Some people
have tried to do this in various ways, but their methods have huge failure
rates, like 30%.
…On Tue, May 9, 2023 at 9:13 AM Jan Bours ***@***.***> wrote:
Also I am wondering what would be the optimal way of subdividing the text
for the purpose of creating the nodes in the DAG. Now we are using sentence
clauses, but I can imagine other possible subdivisions of the text. There
are many instances of NLP applications where chunking is used for this.
—
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZWTYLWQKWON4RD3RL2B6DXFI7HJANCNFSM6AAAAAAXX5ZLHQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
This sounds very promising.
https://arxiv.org/pdf/2304.07774v1.pdf
…On Tue, May 9, 2023 at 9:13 AM Jan Bours ***@***.***> wrote:
Also I am wondering what would be the optimal way of subdividing the text
for the purpose of creating the nodes in the DAG. Now we are using sentence
clauses, but I can imagine other possible subdivisions of the text. There
are many instances of NLP applications where chunking is used for this.
—
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZWTYLWQKWON4RD3RL2B6DXFI7HJANCNFSM6AAAAAAXX5ZLHQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
This looks promising
https://github.com/serenayj/ABCD-ACL2021
…On Tue, May 9, 2023 at 9:13 AM Jan Bours ***@***.***> wrote:
Also I am wondering what would be the optimal way of subdividing the text
for the purpose of creating the nodes in the DAG. Now we are using sentence
clauses, but I can imagine other possible subdivisions of the text. There
are many instances of NLP applications where chunking is used for this.
—
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZWTYLWQKWON4RD3RL2B6DXFI7HJANCNFSM6AAAAAAXX5ZLHQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
OK, you gave a good description of what we expect from a good simplification algorithm. I find it interesting that ChatGPT is able to produce good simplifications! Could there already be an open source version of ChatGPT that we could use for this purpose? What still is not clear to me what is wrong with the simplification method you are using at this moment? I what way is it not good enough? I also have a question about the existing MM algorithm. When I look at the output of your notebook: and I look at the DAGs it produces at the bottom of that notebook, I am not sure how I can interpret them correctly. Could you take 1 example and describe that example fully so it becomes clear how to interpret it? |
"What still is not clear to me what is wrong with the simplification method
you are using at this moment? I what way is it not good enough?"
It's too crude. Ideally, you want each node to be assigned a
simple, intelligible sentence clearly describing an event. For example,
"Bob ate a green apple". All the sentences in a movie script must be
reduced to simple sentences like that.
The similarity of two sentences only works well if those two sentences are
simple.
"I look at the DAGs it produces at the bottom of that notebook, I am not
sure how I can interpret them correctly"
They don't make sense because the current method of simplification of
sentences is not very good. That introduces a lot of noise into the
calculation of the similarity of 2 nodes.
…On Wed, May 10, 2023 at 2:40 AM Jan Bours ***@***.***> wrote:
OK, you gave a good description of what we expect from a good
simplification algorithm.
I find it interesting that ChatGPT is able to produce good
simplifications! Could there already be an open source version of ChatGPT
that we could use for this purpose?
Also I wonder if HuggingFace doesn’t have any FREE algorithms for doing
this? They have a large database of algorithms which I didn’t explore fully
up till now.
I will explore those 2 alternatives seeking for a solution.
What still is not clear to me what is wrong with the simplification method
you are using at this moment? I what way is it not good enough?
I also have a question about the existing MM algorithm. When I look at the
output of your notebook:
https://github.com/rrtucci/mappa_mundi/blob/master/jupyter_notebooks/navigating_short_stories.ipynb
and I look at the DAGs it produces at the bottom of that notebook, I am
not sure how I can interpret them correctly. Could you take 1 example and
describe that example fully so it becomes clear how to interpret it?
—
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZWTYNWWLJB7BXLEUDMKWTXFMZ6NANCNFSM6AAAAAAXX5ZLHQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
OK, so we must have a new, better method of sentence simplification! That is clear for me now. |
I did an interesting experiment with GPT4 instead of ChatGPT to do sentence simplification on the same example you give in your white paper: PROMPT: Give an enumerated list summarizing this quote. REPLY: This bot may make incorrect statements. It does not have knowledge of events after 2021. Tap to learn more.
I think the quality of these sentence simplifications is even a lot better than that produced by ChatGPT ! |
I have found this tweet of Huggingface (it is a tweet from today, May 10th 2023 !): https://twitter.com/gpt_index/status/1656091102221570048?s=20 "Today’s release highlight: Native @huggingface LLM support.
Now you can in 🦙! (huge s/o to Logan)" Then I looked further and found this at Huggingface:
I am trying to get this to work and then prompting it with your example of sentence simplifying from your white paper. But what is more important: I have tried your prompt in a number of different online ChatGPT-like LLMs, and even the more simple ones give excellent replies to your sentence simplifying prompt! https://poe.com/Sage |
Also I am looking into this Open Source Open ChatGPT variant: StableVicuna: The New King of Open ChatGPTs? Accompanied by this COLAB Notebook with uses a GPU T4 Runtime (!): It can hande a prompt like this: They started with 23 apples. After using 20 for lunch, they were left with 3 apples. Then, they I think it will be a small step to get it to answer your sentence simplification prompt from your white paper, which I shall try with this notebook. (Up till now it crashes due to insufficient memory errors ...) |
Yes! I agree. GPT-4 does sentence simplification better than ChatGPT!
Excellent experiment. Thanks.
…On Wed, May 10, 2023 at 4:48 AM Jan Bours ***@***.***> wrote:
I did an interesting experiment with GPT4 instead of ChatGPT to do
sentence simplification on the same example you give in your white paper:
PROMPT:
Consider the following quote:
“A range of mountains takes form in the haze.
Moving closer.
The mountains are piles of TRASH. The entire surface is nothing but waste.
“..We’re gonna find adventure in the evening air...”
A silhouetted city in the distance.
What looks like skyscrapers turns into trash.
Thousands of neatly stacked CUBES OF TRASH, stories high.
Rows and rows of stacked cubes, like city avenues. They go on for miles.”
Give an enumerated list summarizing this quote.
Each list item must be a simple declarative sentence consisting of a
subject phrase and a verb phrase.
The list items must be in chronological order of occurrence in the quote.
REPLY:
This bot may make incorrect statements. It does not have knowledge of
events after 2021. Tap to learn more.
1. Mountains appear in the haze.
2. Distance closes.
3. Mountains transform into trash piles.
4. Evening air promises adventure.
5. Silhouetted city emerges.
6. Skyscrapers reveal themselves as trash.
7. Cubes of trash form structures.
8. Trash cube rows extend for miles.
I think the quality of these sentence simplifications is even a lot better
than that produced by ChatGPT !
What do you think?
—
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZWTYPFTMECJXGCK2EREY3XFNI35ANCNFSM6AAAAAAXX5ZLHQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Nice! Thanks. Sentence Vectors seem to be part of the foundation of most
LLM software, and of some "Knowledge Graph" software, and now of some
Causal Inference software. A pleasant surprise.
…On Wed, May 10, 2023 at 6:31 AM Jan Bours ***@***.***> wrote:
I have found this tweet of Huggingface (it is a tweet from today, May 10th
2023 !):
https://twitter.com/gpt_index/status/1656091102221570048?s=20 <http://url>
"Today’s release highlight: Native @huggingface
<https://github.com/huggingface> LLM support.
Want to run *any* open-source LLM from 🤗? (Camel, StableLM, alpaca,
Dolly, etc.)
- Customize model/tokenizer/chat prompts 📝
- Full streaming support 🌊
Now you can in 🦙! (huge s/o to Logan)"
Then I looked further and found this at Huggingface:
https://huggingface.co/LLMs
https://github.com/deep-diver/LLM-As-Chatbot :
About
LLM as a Chatbot Service
With this complete notebook of using a LLM as a chatbot:https://github.com/deep-diver/LLM-As-Chatbot/blob/main/notebooks/llm_as_chatbot_in_colab.ipynb
----->https://raw.githubusercontent.com/deep-diver/LLM-As-Chatbot/main/notebooks/llm_as_chatbot_in_colab.ipynb
I am trying to get this to work and then prompting it with your example of
sentence simplifying from your white paper.
Up till now no luck, but that is a technical issue. I will try to solve
that (and search Huggingface for other FREE implementations).
But what is more important: I have tried your prompt in a number of
different online ChatGPT-like LLMs, and even the more simple ones give
excellent replies to your sentence simplifying prompt!
See and try for yourself:
https://poe.com/Sage
https://poe.com/Claude%2B (3 FREE messages per day)
https://poe.com/GPT-4 (1 FREE message per day <----- !!!!!!!)
https://poe.com/Claude-instant
https://poe.com/ChatGPT (FREE !!!!!)
https://poe.com/Dragonfly
—
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZWTYNW7ARTFIUEK7TRZXLXFNU7PANCNFSM6AAAAAAXX5ZLHQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
ANSWER TO MY QUESTION ABOUT THIS: I was trying to run the COLAB notebook but it crashes due to insufficient memory errors. Is it ONLY possible to run this on the PRO paid version of COLAB? How could I run this on a free version of COLAB or perhaps even locally on a PC in a Jupyter notebook? Perhaps using 4-bit version of the LLM as you mention? @samwitteveenai |
At this moment I am looking into this list of Open Source implementations of LLMs: GitHub - eugeneyan/open-llms: 🤖 A list of open LLMs available for commercial use. https://github.com/eugeneyan/open-llms Trying h2oGPT from that list online with your sentence simplification prompt gives this result (which is not bad I think): http://pseudotensor.hopto.org:37860/ prompt: “ Give an enumerated list summarizing this quote. OUTPUT:
I think this would be acceptable. Trying this h2oGPT in a notebook on Google Colab (also on Kaggle) gives memory error. Maybe we could try: https://huggingface.co/hivemind/gpt-j-6B-8bit?text=My+name+is+Mariama%2C+my+favorite It uses some kind of compression (8 bit arithmetic I think) so it fits within COLAB free version memory. I will try it out in this COLAB notebook finetune-gpt-j-6B-8bit.ipynb - Colaboratory https://colab.research.google.com/drive/1ft6wQU0BhqG5PRlwgaZJv2VukKKjU4Es#scrollTo=DuW4H6HTS82r |
Sounds good. It occurs to me that this WALL-E example is not very good
because it doesn't have compound sentences, or complex sentences, or
compound-simplex sentences. It would be nice if the LLM would split those
into simple sentences.
To show you what I mean, I asked ChatGPT this:
What is a simple, complex, compound and complex compound sentence, with
examples?
Here is its answer:
Sure, I'd be happy to provide some examples of simple, complex, compound,
and complex-compound sentences.
1. Simple sentence: A simple sentence consists of a subject and a verb
and expresses a complete thought.
Example: The cat sat on the mat.
1. Complex sentence: A complex sentence consists of an independent
clause and a dependent clause.
Example: Although I studied hard, I failed the exam.
1. Compound sentence: A compound sentence consists of two or more
independent clauses joined by a conjunction.
Example: I wanted to go to the park, but it started raining.
1. Complex-compound sentence: A complex-compound sentence consists of
two or more independent clauses and at least one dependent clause.
Example: After I finish my homework, I'll go for a walk, but only if it's
not too late.
Note that these are just a few examples of each type of sentence, and there
can be many variations and combinations of each type.
…On Thu, May 11, 2023 at 3:12 AM Jan Bours ***@***.***> wrote:
At this moment I am looking into this list of Open Source implementations
of LLMs:
GitHub - eugeneyan/open-llms: 🤖 A list of open LLMs available for
commercial use.
https://github.com/eugeneyan/open-llms
Trying h2oGPT from that list online with your sentence simplification
prompt gives this result (which is not bad I think):
http://pseudotensor.hopto.org:37860/
prompt:
“
Consider the following quote:
“A range of mountains takes form in the haze.
Moving closer.
The mountains are piles of TRASH. The entire surface is nothing but waste.
“…We’re gonna find adventure in the evening air…”
A silhouetted city in the distance.
What looks like skyscrapers turns into trash.
Thousands of neatly stacked CUBES OF TRASH, stories high.
Rows and rows of stacked cubes, like city avenues. They go on for miles.”
Give an enumerated list summarizing this quote.
Each list item must be a simple declarative sentence consisting of a
subject phrase and a verb phrase.
The list items must be in chronological order of occurrence in the quote.
“
OUTPUT:
“
1. A range of mountains takes form in the haze.
2. Moving closer.
3. The mountains are piles of TRASH.
4. The entire surface is nothing but waste.
5. “…We’re gonna find adventure in the evening air…”
6. A silhouetted city in the distance.
7. What looks like skyscrapers turns into trash.
8. Thousands of neatly stacked CUBES OF TRASH, stories high.
9. Rows and rows of stacked cubes, like city avenues. They go on for
miles.
“
I think this would be acceptable.
Trying this h2oGPT in a notebook on Google Colab (also on Kaggle) gives
memory error.
—
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZWTYIZ4XVO5E6WE3SY57TXFSGLZANCNFSM6AAAAAAXX5ZLHQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
This looks VERY promising.
https://github.com/explosion/spacy-llm
…On Thu, May 11, 2023 at 3:12 AM Jan Bours ***@***.***> wrote:
At this moment I am looking into this list of Open Source implementations
of LLMs:
GitHub - eugeneyan/open-llms: 🤖 A list of open LLMs available for
commercial use.
https://github.com/eugeneyan/open-llms
Trying h2oGPT from that list online with your sentence simplification
prompt gives this result (which is not bad I think):
http://pseudotensor.hopto.org:37860/
prompt:
“
Consider the following quote:
“A range of mountains takes form in the haze.
Moving closer.
The mountains are piles of TRASH. The entire surface is nothing but waste.
“…We’re gonna find adventure in the evening air…”
A silhouetted city in the distance.
What looks like skyscrapers turns into trash.
Thousands of neatly stacked CUBES OF TRASH, stories high.
Rows and rows of stacked cubes, like city avenues. They go on for miles.”
Give an enumerated list summarizing this quote.
Each list item must be a simple declarative sentence consisting of a
subject phrase and a verb phrase.
The list items must be in chronological order of occurrence in the quote.
“
OUTPUT:
“
1. A range of mountains takes form in the haze.
2. Moving closer.
3. The mountains are piles of TRASH.
4. The entire surface is nothing but waste.
5. “…We’re gonna find adventure in the evening air…”
6. A silhouetted city in the distance.
7. What looks like skyscrapers turns into trash.
8. Thousands of neatly stacked CUBES OF TRASH, stories high.
9. Rows and rows of stacked cubes, like city avenues. They go on for
miles.
“
I think this would be acceptable.
Trying this h2oGPT in a notebook on Google Colab (also on Kaggle) gives
memory error.
—
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZWTYIZ4XVO5E6WE3SY57TXFSGLZANCNFSM6AAAAAAXX5ZLHQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Then we have to search for more complex texts. Do you have any examples in mind? |
That looks promising indeed! Have you already tried? In the meantime I am looking into this interesting list in my search for a good open source LLM that ideally can be run locally on your or mine PC to perform sentence simplification or other NLP tasks (our main goal is to be able to generate a DagAtlas based e.g. on a large number of texts, so if we can accomplish that by letting run those LLMs locally for enough time, we could reach that goal!): List of Open Sourced Fine-Tuned Large Language Models (LLM) | by Sung Kim | Geek Culture | Mar, 2023 | Medium An incomplete list of open-sourced fine-tuned Large Language Models (LLM) you can run locally on your computer |
You can choose a paragraph from one of the P.G. Wodehouse stories. If you
want really long sentences, I believe Moby Dick by Herman Melville has
sentences that are several pages long :) Legal documents have very long
sentences too.
…On Sat, May 13, 2023 at 1:32 AM Jan Bours ***@***.***> wrote:
Sounds good. It occurs to me that this WALL-E example is not very good
because it doesn't have compound sentences, or complex sentences, or
compound-simplex sentences. It would be nice if the LLM would split those
into simple sentences. To show you what I mean, I asked ChatGPT this: What
is a simple, complex, compound and complex compound sentence, with
examples? Here is its answer: Sure, I'd be happy to provide some examples
of simple, complex, compound, and complex-compound sentences. 1. Simple
sentence: A simple sentence consists of a subject and a verb and expresses
a complete thought. Example: The cat sat on the mat. 1. Complex sentence: A
complex sentence consists of an independent clause and a dependent clause.
Example: Although I studied hard, I failed the exam. 1. Compound sentence:
A compound sentence consists of two or more independent clauses joined by a
conjunction. Example: I wanted to go to the park, but it started raining.
1. Complex-compound sentence: A complex-compound sentence consists of two
or more independent clauses and at least one dependent clause. Example:
After I finish my homework, I'll go for a walk, but only if it's not too
late. Note that these are just a few examples of each type of sentence, and
there can be many variations and combinations of each type.
… <#m_-8535341612585800405_>
Then we have to search for more complex texts. Do you have any examples in
mind?
—
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZWTYK32XGHWHWJQZK3UGDXF4MGNANCNFSM6AAAAAAXX5ZLHQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
No. I put in a comment on Twitter to the developers of this software, but
got no reply. I've been trying very hard to get some sponsors for this
project, but no luck so far.
https://twitter.com/artistexyz/status/1657109267323801600
…On Sat, May 13, 2023 at 2:09 AM Jan Bours ***@***.***> wrote:
This looks VERY promising. https://github.com/explosion/spacy-llm
… <#m_6188584845194245083_>
That looks promising indeed! Have you already tried?
—
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZWTYMMAWWAOTVMHG6FZXTXF4QPLANCNFSM6AAAAAAXX5ZLHQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I am searching for a way to have texts for which we know the Ground Truth causal relations, so we could test the MM algorithm if it could find these causal relations. Now I am considering this idea💡 : I would like to hear your comments. Do you consider this as worth pursuing? |
These are all interesting research questions to pursue in the future, but I
think getting the 1100 movie scripts analyzed should be our immediate
highest priority, because it's real world data not simulated data.
Maybe you could start with a 500 node DAG that has a simple sentence
attached to each node, and ask a LLM to produce 999 new DAGs that differ
from the original DAG by rephrasings of the sentences. Then take your 1000
DAGs and "flatten" them out by transforming them into 1000 short stories.
Then input those 1000 short stories into Mappa Mundi, and see if it can
retrieve the original 500 node DAG.
…On Sat, May 13, 2023 at 6:14 AM Jan Bours ***@***.***> wrote:
I am searching for a way to have texts for which we know the Ground Truth
causal relations, so we could test the MM algorithm if it could find these
causal relations.
Now I am considering this idea💡 :
would it be possible to have LLMs generating texts with custom, predefined
causal relations in it? That would give us a way to test if the MM
algorithm could correctly come up with those causal relations in a DAG.
I would like to hear your comments. Do you consider this as worth pursuing?
—
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZWTYJMY4I64OEIDNVKCR3XF5NF7ANCNFSM6AAAAAAXX5ZLHQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Another thing that occurs to me is to analyze 10,000 games of chess with
Mappa Mundi, and then ask a chess expert to judge whether the resulting
DAGs make sense to him/her. The ground truth in this case is the chess
expert :)
…On Sat, May 13, 2023 at 6:14 AM Jan Bours ***@***.***> wrote:
I am searching for a way to have texts for which we know the Ground Truth
causal relations, so we could test the MM algorithm if it could find these
causal relations.
Now I am considering this idea💡 :
would it be possible to have LLMs generating texts with custom, predefined
causal relations in it? That would give us a way to test if the MM
algorithm could correctly come up with those causal relations in a DAG.
I would like to hear your comments. Do you consider this as worth pursuing?
—
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZWTYJMY4I64OEIDNVKCR3XF5NF7ANCNFSM6AAAAAAXX5ZLHQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Note that for the chess example, the sentences are Moves of chess pieces.
There is no need for sentence simplification, and sentence similarity is
just equality of moves. Hence the chess example requires no NLP whatsoever.
It is trivial compared to the movie scripts example.
…On Sat, May 13, 2023 at 6:14 AM Jan Bours ***@***.***> wrote:
I am searching for a way to have texts for which we know the Ground Truth
causal relations, so we could test the MM algorithm if it could find these
causal relations.
Now I am considering this idea💡 :
would it be possible to have LLMs generating texts with custom, predefined
causal relations in it? That would give us a way to test if the MM
algorithm could correctly come up with those causal relations in a DAG.
I would like to hear your comments. Do you consider this as worth pursuing?
—
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZWTYJMY4I64OEIDNVKCR3XF5NF7ANCNFSM6AAAAAAXX5ZLHQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Jan, I know I have cried wolf many times before, but this time, I think I really found a solution to the sentence simplification problem. And it's pretrained, so no need for GPU card. |
Looks interesting, give it a try! But I think you should try the newer version as they mention (which is https://github.com/dair-iitd/openie6): NEWEST VERSION: [2010.03147] OpenIE6: Iterative Grid Labeling and Coordination Analysis for Open Information Extraction CODE: OLDER VERSION : [2005.08178] IMoJIE: Iterative Memory-Based Joint Open Information Extraction (OLDER VERSION, ALREADY OVERRIDDEN BY OIE6 !) CODE: |
Good catch. yes, the newest version.
…On Fri, May 19, 2023 at 5:31 PM Jan Bours ***@***.***> wrote:
Looks interesting, give it a try! But I think you should try the newer
version as they mention (which is https://github.com/dair-iitd/openie6):
------------------------------
NEWEST VERSION:
[2010.03147] OpenIE6: Iterative Grid Labeling and Coordination Analysis
for Open Information Extraction
https://arxiv.org/abs/2010.03147
CODE:
https://github.com/dair-iitd/openie6
------------------------------
OLDER VERSION :
[2005.08178] IMoJIE: Iterative Memory-Based Joint Open Information
Extraction (OLDER VERSION, ALREADY OVERRIDDEN BY OIE6 !)
https://arxiv.org/abs/2005.08178
May 2020
CODE:
https://github.com/dair-iitd/imojie
Neural generation model for Open Information Extraction
------------------------------
—
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZWTYK7RZUM5KO26NWA7TDXG7RA7ANCNFSM6AAAAAAXX5ZLHQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Can you get it working? I am trying, but having trouble installing all the requirements Python packages (in google COLAB). I will try it later. |
Sorry, I haven't tried. Will try soon. I'm a bit slow due to personal
problems.
…On Sun, May 21, 2023 at 1:20 PM Jan Bours ***@***.***> wrote:
Good catch. yes, the newest version.
… <#m_-7294515081261667074_>
On Fri, May 19, 2023 at 5:31 PM Jan Bours *@*.*> wrote: Looks
interesting, give it a try! But I think you should try the newer version as
they mention (which is https://github.com/dair-iitd/openie6
<https://github.com/dair-iitd/openie6>): ------------------------------
NEWEST VERSION: [2010.03147] OpenIE6: Iterative Grid Labeling and
Coordination Analysis for Open Information Extraction
https://arxiv.org/abs/2010.03147 <https://arxiv.org/abs/2010.03147> CODE:
https://github.com/dair-iitd/openie6 <https://github.com/dair-iitd/openie6>
------------------------------ OLDER VERSION : [2005.08178] IMoJIE:
Iterative Memory-Based Joint Open Information Extraction (OLDER VERSION,
ALREADY OVERRIDDEN BY OIE6 !) https://arxiv.org/abs/2005.08178
<https://arxiv.org/abs/2005.08178> May 2020 CODE:
https://github.com/dair-iitd/imojie <https://github.com/dair-iitd/imojie>
Neural generation model for Open Information Extraction
------------------------------ — Reply to this email directly, view it on
GitHub <#2 (comment)
<#2 (comment)>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ADZWTYK7RZUM5KO26NWA7TDXG7RA7ANCNFSM6AAAAAAXX5ZLHQ
<https://github.com/notifications/unsubscribe-auth/ADZWTYK7RZUM5KO26NWA7TDXG7RA7ANCNFSM6AAAAAAXX5ZLHQ>
. You are receiving this because you were mentioned.Message ID: @.*>
Can you get it working? I am trying, but having trouble installing all the
requirements Python packages (in google COLAB). I will try it later.
—
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZWTYPDRZPTPVQDAOKXQB3XHJFFVANCNFSM6AAAAAAXX5ZLHQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Take it easy, no reason to hurry. I will help you with this. I find this project extremely interesting and now that I am retired from work I can devote a lot time to it and love to do that! The last couple of days I was searching for information about comparable projects which could help us with our own MM problem. I encountered quite a few of them, and I will compile a list of them later over here. |
I've been trying to run both imojie and openie6 on Colab. Failed so far. Get lots of errors loading the requirements. This repo named openie6-revised seems interesting, because it claims to be as good as openie6, but much impler because it relies on HuggingFace tools that have only become available in the last year. |
I am trying to clean up openie6-revised and train it on Colab. I started a new github repo for this |
I've been reading the code for openie6-revised (based on BERT), googling everything I don't understand. https://huggingface.co/blog/bert-101 https://huggingface.co/docs/transformers/training According to these articles, BERT can be fine-tuned in 2-3 hours with a single GPU |
Openie6-revised is a BERT model fine tuned with what is called an Openie dataset. I asked ChatGPT what is an Openie dataset and it gave a nice answer. There are several Openie datasets. Openie6-revised uses the Openie4 dataset.
|
The same people that wrote Openie6 also wrote Carb, which contains several openie datasets |
Today, I discovered that Amazon offers a similar service to Google Colab. But with 15Gb of persistent memory |
Interesting. I certainly will try that! I hope it circumvents the limited memory problems of the free Google Colab with I most often use. But with the gigantic Large Language Models that I try to use in Google Colab I reach the memory limits quite often there. Will https://studiolab.sagemaker.aws/ be better at that? |
Last 2 weeks I am experimenting with State of the Art Large FREE OPEN SOURCE Language Models. Not only out of pure curiosity, but also I think they would be able to contribute to our ultimate goal of the Mappa Mundi project! The things I am working on now:
My question: do you have already incorporated this stemming and/or lemmatisation step in your code? This has an important effect on the quality of the vector embeddings that are based on it, and consequently also on the quality of the calculation of similarities in the texts, because I will base these similarities on the vector embeddings of the texts!
See e.g. Textual entailment (TE), also known as Natural Language Inference (NLI), in natural language processing is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text. In the TE framework, the entailing and entailed texts are termed text (t) and hypothesis (h), respectively. Textual entailment is not the same as pure logical entailment – it has a more relaxed definition: "t entails h" (t ⇒ h) if, typically, a human reading t would infer that h is most likely true.[1] (Alternatively: t ⇒ h if and only if, typically, a human reading t would be justified in inferring the proposition expressed by h from the proposition expressed by t.[2]) The relation is directional because even if "t entails h", the reverse "h entails t" is much less certain.[3][4] Determining whether this relationship holds is an informal task, one which sometimes overlaps with the formal tasks of formal semantics (satisfying a strict condition will usually imply satisfaction of a less strict conditioned); additionally, textual entailment partially subsumes word entailment. Examples An example of a positive TE (text entails hypothesis) is: text: If you help the needy, God will reward you. text: If you help the needy, God will reward you. text: If you help the needy, God will reward you. "" See also: Textual Entailment Recognition with Semantic Features from Empirical Text Representation 29 Textual Entailment I think this will be of importance to our MM project.
See: Pytorch implementation of "Adapting Text Embeddings for Causal Inference" (!): Tensorflow 2 implementation of Causal-BERT https://raw.githubusercontent.com/rpryzant/causal-bert-pytorch/master/CausalBert.py See also: Causal Effects of Linguistic Properties https://github.com/rpryzant/awesome-causality-algorithms#learning-causal-effects Natural Language Processing Name Paper Code Causal Text Embeddings Veitch, Victor, Dhanya Sridhar, and David M. Blei. "Using Text Embeddings for Causal Inference." arXiv preprint arXiv:1905.12741 (2019). Python Handling Missing/Noisy Treatment Wood-Doughty, Zach, Ilya Shpitser, and Mark Dredze. "Challenges of Using Text Classifiers for Causal Inference." In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4586-4598. 2018. Python Conditional Treatment-adversarial Learning Based Matching Yao, Liuyi, Sheng Li, Yaliang Li, Hongfei Xue, Jing Gao, and Aidong Zhang. "On the estimation of treatment effect with text covariates." In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 4106-4113. AAAI Press, 2019. NA Causal Inferences Using Texts Egami, Naoki, Christian J. Fong, Justin Grimmer, Margaret E. Roberts, and Brandon M. Stewart. "How to make causal inferences using texts." arXiv preprint arXiv:1802.02163 (2018). NA Causal FS for text classification Michael J. Paul. Feature selection as causal inference: experiments with text classification. Conference on Computational Natural Language Learning (CoNLL), Vancouver, Canada. August 2017. NA |
This textual entailment stuff sounds nice, but quite different from causal inference. It produces no DAGs and no DAG atlas, These are the main objectives of MM!! for now, my immediate goal is very well defined and reachable. I want to
I think BERT is good enough to achieve 1,2 |
When we have divided up the original texts in pieces of sentences, we can do 2 things: 1-After embedding sentence pieces (for which we already have the open source method from HuggingFace (HF) !), we can do the similarity searches using the cosine similarity (as I showed in an earlier Python notebook posted here). I think this will be much more accurate than the NLTK method of doing this and a lot speedier too! Therefore I expect this will enable us to get the causal maps for much more or even all the IMSDb movie scripts! 2-My intention with the textual entailment could be comparison with the results of step 1, because if 2 pieces of text show this entailment than there is an unequivocal causal relationship between the 2 ! And my goal with this is to test if the method in step 1 discovers these same causally related pairs: does it discover all of them, just a few of them or none at all? It seems a good testing method to me. If in step 2 there are causally related pairs which we did not discover in step 1, I think we should decide what to do: add them to the set of discovered causally related pairs from step 1? I think we should, because they are causally related pairs for sure! Why should we not do that? |
Like I said, this is nice for future research. But I want to do 1 and 2 first. |
I will try to implement step 1 first and then see where it brings us ! I am particularly focused on the assuring of being able to validate (some of) the extracted causal relations found by the MM algorithm just by looking at the texts of the 2 nodes involved. In my view it is crucial that we will be able to verify if (a sample of) the MM extracted causal pairs are defendable. Can you please comment on this because I find this very important and would like us to be on the same page with this ! |
I'm not interested in working on this right now. I want to finish what I started, instead of going off in a tangential direction before finishing what I started. When I was in high school, I did some art and took some art classes. I learned from the art teacher that the best art teachers don't tell their pupils what style to paint in. They just coach you in whatever style you choose. So I am not telling you not to do this. You should do whatever you think is best. I am just telling you that I will not participate in it because at the present time I want to do 1 and 2. |
I am trying to get in contact with the authors of Carb, Imojie and Openie6 |
As I said : I will try to implement step 1 first and then see where it brings us ! |
A good step. Could be useful ! |
I think I know what the problem with installing openie6 is. The latest pip versions are more strict than the older ones. It's a little risky, but it is possible to tell pip not to check for dependency conflicts as follows: %pip install --no-deps -r requirements.txt https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies |
Are you able to install openie6 using %pip install --no-deps -r requirements.txt ? |
yes.
…On Sun, Jun 4, 2023 at 9:16 AM Jan Bours ***@***.***> wrote:
Are you able to install openie6 using %pip install --no-deps -r
requirements.txt ?
—
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZWTYN4BGBIICIMVTT6AS3XJSDBVANCNFSM6AAAAAAXX5ZLHQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Unfortunately, the people who wrote Openie6 are not very eager to help. I was trying to understand their code and simplify it but it's very hard because
So I've gone to plan B, which is to use the following |
I saw you created a new github repository: https://github.com/rrtucci/openie6A/tree/master, with Should I use these new cleaned texts as input to the new to be developed version of the Mappa Mundi algorithm (using vector embeddings and similarities based on those embeddings on which I am working)? Or use the existing ones? I have a week off for vacation, planning to go on working on the updated version of MM from next week on. |
Open6A resulted from my trying to update the libraries of Openie6 and simplifying it by throwing out the benchmarking and comparing with other models. But I have totally given up on that. My new plan is to study Openie6-revised very carefully, googling every line of the program that I don't understand. Hopefully, once I learn it really well, I will be able to train it. I've joined 3 online sites that give a limited amount of free GPU time: Google Colab, Kaggle and Amazon Sagemaker Studio Lab. So even though my computer has no GPU, I believe it will be possible for me to train it on one of those 3 services. |
Good news. Just received this email from my friend Anton from St.Petersburg, Dear Dr. Tucci, I've finally managed to run the trained coordination analysis model by Kolluru et al. 2020 (IGL-CA) on colab. Here's the notebook. I wish you luck with your project, best regards, |
The latest version of Mappa Mundi uses Openie6 for SOTA sentence simplification and an sBert sentence transformer for SOTA sentence similarity. The new notebooks for the 3 PG Wodehouse short stories and the 3 Pixar movie scripts, have been run on Google Colab with GPU. |
In about a week I will retire from work, so from then on I could look more closely into these issues. But I think we are already on the right track!
A question: should we do all our communication via this Github channel? Of could we choose another one? Do you have a proposal? Please let me know.
…
My dream is if we could start a project at github/huggingface, called Mappa
Mundi V2 , run by us two and others, with the goal of doing ALL 1100 movie
scripts from the IMSDb website.
We would have to replace the sentence simplification and sentence
similarity portions of the old Mappa Mundi by these speedier, more
precise versions.
On Wed, May 3, 2023 at 4:01 AM Jan Bours @.***> wrote:
Originally posted by @rrtucci in #1 (comment)
The text was updated successfully, but these errors were encountered: