Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

My dream is if we could start a project at github/huggingface, called Mappa #2

Open
jbdatascience opened this issue May 6, 2023 · 83 comments

Comments

@jbdatascience
Copy link

In about a week I will retire from work, so from then on I could look more closely into these issues. But I think we are already on the right track!

A question: should we do all our communication via this Github channel? Of could we choose another one? Do you have a proposal? Please let me know.

My dream is if we could start a project at github/huggingface, called Mappa
Mundi V2 , run by us two and others, with the goal of doing ALL 1100 movie
scripts from the IMSDb website.

We would have to replace the sentence simplification and sentence
similarity
portions of the old Mappa Mundi by these speedier, more
precise versions.

On Wed, May 3, 2023 at 4:01 AM Jan Bours @.***> wrote:

Looks promising indeed. If you can get your DagAtlas algorithm more
speedy, then you can produce a lot more DAGs from texts and that will make
it more interesting fro practical use cases, would not it?


Reply to this email directly, view it on GitHub
#1 (comment),
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ADZWTYLUAPA7UWNMX2RGETTXEIGEVANCNFSM6AAAAAAXMWE3C4
.
You are receiving this because you commented.Message ID:
@.***>

Originally posted by @rrtucci in #1 (comment)

@rrtucci
Copy link
Owner

rrtucci commented May 6, 2023 via email

@rrtucci
Copy link
Owner

rrtucci commented May 7, 2023 via email

@jbdatascience
Copy link
Author

Interesting sentence simplification software !

I am thinking about our project.
I intend to do the following and points worth considering :

  • Do we have a good definition of the ultimate goal of this project?
  • Reread your white paper “Causal DAG Extraction from 3 Short Stories and 3 Movie Scripts” to be able to completely understand the principles involved. I saw you have updated that document, but what are the changes?
  • Is the example of DEFT on tic-tac-toe games still useful to consider? Is it still relevant for the purpose of our project?
  • Proposal: to be able to check if the complete MM algorithm functions as expected, it would be good to have a ground truth (text(s) for which we know the Causal DAG(s) that are the groundtruth) against which we can check the DAGs discovered by the MM algorithm. I am not sure how to do this yet.
  • The output from the MM algorithm is the set of discovered DAGs (the DagAtlas). How could we interpret them? And what could be a practical use case for this DagAtlas? Perhaps we could describe that in more detail. How could existing LLMs be updated by the DagAtlas so that they could represent causal knowledge (in the most general form)?
  • Are there comparable, already existing algorithms that implement the functionality that MM has? How do they compare?

Please share your thoughts on this!

@jbdatascience
Copy link
Author

This quote was put on LinkedIn by me (I thought you would like it!):

https://www.linkedin.com/posts/janbours_science-causality-knowledge-activity-7061268335266541568-qF8a

The power of science is its discovery of causal law.

Bertrand Russell. Human Knowledge: Its Scope and Limits. Simon and Schuster, 1948.

https://twitter.com/scitechtalk/status/1655502709267505154?s=46&t=8nmTR-4kmZaFFBCGObS6fA

https://twitter.com/artistexyz/status/1655576091149451269?s=46&t=8nmTR-4kmZaFFBCGObS6fA

Excellent quote. Thanks for sharing.

@rrtucci
Copy link
Owner

rrtucci commented May 8, 2023 via email

@jbdatascience
Copy link
Author

jbdatascience commented May 9, 2023

OK, I will also be on the look out for a viable solution for sentence simplification. Could you specify what we could consider as a good solution (and what not)?

Also I am wondering why the existing solution for simplification is not good enough?

@jbdatascience
Copy link
Author

Also I am wondering what would be the optimal way of subdividing the text for the purpose of creating the nodes in the DAG. Now we are using sentence clauses, but I can imagine other possible subdivisions of the text. There are many instances of NLP applications where chunking is used for this.

@rrtucci
Copy link
Owner

rrtucci commented May 9, 2023 via email

@rrtucci
Copy link
Owner

rrtucci commented May 9, 2023 via email

@rrtucci
Copy link
Owner

rrtucci commented May 10, 2023 via email

@jbdatascience
Copy link
Author

OK, you gave a good description of what we expect from a good simplification algorithm.

I find it interesting that ChatGPT is able to produce good simplifications! Could there already be an open source version of ChatGPT that we could use for this purpose?
Also I wonder if HuggingFace doesn’t have any FREE algorithms for doing this? They have a large database of algorithms which I didn’t explore fully up till now.
I will explore those 2 alternatives seeking for a solution.

What still is not clear to me what is wrong with the simplification method you are using at this moment? I what way is it not good enough?

I also have a question about the existing MM algorithm. When I look at the output of your notebook:
https://github.com/rrtucci/mappa_mundi/blob/master/jupyter_notebooks/navigating_short_stories.ipynb

and I look at the DAGs it produces at the bottom of that notebook, I am not sure how I can interpret them correctly. Could you take 1 example and describe that example fully so it becomes clear how to interpret it?

@rrtucci
Copy link
Owner

rrtucci commented May 10, 2023 via email

@jbdatascience
Copy link
Author

OK, so we must have a new, better method of sentence simplification! That is clear for me now.

@jbdatascience
Copy link
Author

I did an interesting experiment with GPT4 instead of ChatGPT to do sentence simplification on the same example you give in your white paper:

PROMPT:
Consider the following quote:
“A range of mountains takes form in the haze.
Moving closer.
The mountains are piles of TRASH. The entire surface is nothing but waste.
“..We’re gonna find adventure in the evening air...”
A silhouetted city in the distance.
What looks like skyscrapers turns into trash.
Thousands of neatly stacked CUBES OF TRASH, stories high.
Rows and rows of stacked cubes, like city avenues. They go on for miles.”

Give an enumerated list summarizing this quote.
Each list item must be a simple declarative sentence consisting of a subject phrase and a verb phrase.
The list items must be in chronological order of occurrence in the quote.

REPLY:

This bot may make incorrect statements. It does not have knowledge of events after 2021. Tap to learn more.

  1. Mountains appear in the haze.
  2. Distance closes.
  3. Mountains transform into trash piles.
  4. Evening air promises adventure.
  5. Silhouetted city emerges.
  6. Skyscrapers reveal themselves as trash.
  7. Cubes of trash form structures.
  8. Trash cube rows extend for miles.

I think the quality of these sentence simplifications is even a lot better than that produced by ChatGPT !
What do you think?

@jbdatascience
Copy link
Author

I have found this tweet of Huggingface (it is a tweet from today, May 10th 2023 !):

https://twitter.com/gpt_index/status/1656091102221570048?s=20

"Today’s release highlight: Native @huggingface LLM support.
Want to run any open-source LLM from 🤗? (Camel, StableLM, alpaca, Dolly, etc.)

  • Customize model/tokenizer/chat prompts 📝
  • Full streaming support 🌊

Now you can in 🦙! (huge s/o to Logan)"

Then I looked further and found this at Huggingface:

https://huggingface.co/LLMs

https://github.com/deep-diver/LLM-As-Chatbot :
About
LLM as a Chatbot Service

    With this complete notebook of using a LLM as a chatbot:
https://github.com/deep-diver/LLM-As-Chatbot/blob/main/notebooks/llm_as_chatbot_in_colab.ipynb
    ----->
https://raw.githubusercontent.com/deep-diver/LLM-As-Chatbot/main/notebooks/llm_as_chatbot_in_colab.ipynb

I am trying to get this to work and then prompting it with your example of sentence simplifying from your white paper.
Up till now no luck, but that is a technical issue. I will try to solve that (and search Huggingface for other FREE implementations).

But what is more important: I have tried your prompt in a number of different online ChatGPT-like LLMs, and even the more simple ones give excellent replies to your sentence simplifying prompt!
See and try for yourself:

https://poe.com/Sage
https://poe.com/Claude%2B (3 FREE messages per day)
https://poe.com/GPT-4 (1 FREE message per day <----- !!!!!!!)
https://poe.com/Claude-instant
https://poe.com/ChatGPT (FREE !!!!!)
https://poe.com/Dragonfly

@jbdatascience
Copy link
Author

Also I am looking into this Open Source Open ChatGPT variant:

StableVicuna: The New King of Open ChatGPTs?
https://www.youtube.com/watch?v=m_xD0algP4k

Accompanied by this COLAB Notebook with uses a GPU T4 Runtime (!):
YT Stable Vicuna13B 8bit in Colab.ipynb
https://colab.research.google.com/drive/1Kvf3qF1TXE-jR-N5G9z1XxVf5z-ljFt2?usp=sharing

It can hande a prompt like this:
raw_output = pipe(get_prompt('Answer the following question by reasoning step by step. The cafeteria had 23 apples. If they used 20 for lunch, and bought 6 more, how many apple do they have?'))
parse_text(raw_output)

They started with 23 apples. After using 20 for lunch, they were left with 3 apples. Then, they
bought 6 more apples, so now they have a total of 9 apples. Therefore, the cafeteria has 9 apples in
total.

I think it will be a small step to get it to answer your sentence simplification prompt from your white paper, which I shall try with this notebook. (Up till now it crashes due to insufficient memory errors ...)

@rrtucci
Copy link
Owner

rrtucci commented May 10, 2023 via email

@rrtucci
Copy link
Owner

rrtucci commented May 10, 2023 via email

@jbdatascience
Copy link
Author

jbdatascience commented May 10, 2023

Also I am looking into this Open Source Open ChatGPT variant:

StableVicuna: The New King of Open ChatGPTs? https://www.youtube.com/watch?v=m_xD0algP4k

Accompanied by this COLAB Notebook with uses a GPU T4 Runtime (!): YT Stable Vicuna13B 8bit in Colab.ipynb https://colab.research.google.com/drive/1Kvf3qF1TXE-jR-N5G9z1XxVf5z-ljFt2?usp=sharing

It can hande a prompt like this: raw_output = pipe(get_prompt('Answer the following question by reasoning step by step. The cafeteria had 23 apples. If they used 20 for lunch, and bought 6 more, how many apple do they have?')) parse_text(raw_output)

They started with 23 apples. After using 20 for lunch, they were left with 3 apples. Then, they bought 6 more apples, so now they have a total of 9 apples. Therefore, the cafeteria has 9 apples in total.

I think it will be a small step to get it to answer your sentence simplification prompt from your white paper, which I shall try with this notebook. (Up till now it crashes due to insufficient memory errors ...)

ANSWER TO MY QUESTION ABOUT THIS:

I was trying to run the COLAB notebook but it crashes due to insufficient memory errors. Is it ONLY possible to run this on the PRO paid version of COLAB? How could I run this on a free version of COLAB or perhaps even locally on a PC in a Jupyter notebook? Perhaps using 4-bit version of the LLM as you mention?

@samwitteveenai
34 minuten geleden
yeah unfortunately you need a GPU with a lot of VRAM to run this so Colab free isn't going to work.
1

@jbdatascience
Copy link
Author

jbdatascience commented May 11, 2023

At this moment I am looking into this list of Open Source implementations of LLMs:

GitHub - eugeneyan/open-llms: 🤖 A list of open LLMs available for commercial use.

https://github.com/eugeneyan/open-llms

Trying h2oGPT from that list online with your sentence simplification prompt gives this result (which is not bad I think):

http://pseudotensor.hopto.org:37860/

prompt:


Consider the following quote:
“A range of mountains takes form in the haze.
Moving closer.
The mountains are piles of TRASH. The entire surface is nothing but waste.
“…We’re gonna find adventure in the evening air…”
A silhouetted city in the distance.
What looks like skyscrapers turns into trash.
Thousands of neatly stacked CUBES OF TRASH, stories high.
Rows and rows of stacked cubes, like city avenues. They go on for miles.”

Give an enumerated list summarizing this quote.
Each list item must be a simple declarative sentence consisting of a subject phrase and a verb phrase.
The list items must be in chronological order of occurrence in the quote.

OUTPUT:

  1. A range of mountains takes form in the haze.
  2. Moving closer.
  3. The mountains are piles of TRASH.
  4. The entire surface is nothing but waste.
  5. “…We’re gonna find adventure in the evening air…”
  6. A silhouetted city in the distance.
  7. What looks like skyscrapers turns into trash.
  8. Thousands of neatly stacked CUBES OF TRASH, stories high.
  9. Rows and rows of stacked cubes, like city avenues. They go on for miles.

I think this would be acceptable.

Trying this h2oGPT in a notebook on Google Colab (also on Kaggle) gives memory error.

Maybe we could try:
hivemind/gpt-j-6B-8bit · Hugging Face

https://huggingface.co/hivemind/gpt-j-6B-8bit?text=My+name+is+Mariama%2C+my+favorite

It uses some kind of compression (8 bit arithmetic I think) so it fits within COLAB free version memory. I will try it out in this COLAB notebook

finetune-gpt-j-6B-8bit.ipynb - Colaboratory

https://colab.research.google.com/drive/1ft6wQU0BhqG5PRlwgaZJv2VukKKjU4Es#scrollTo=DuW4H6HTS82r

@rrtucci
Copy link
Owner

rrtucci commented May 11, 2023 via email

@rrtucci
Copy link
Owner

rrtucci commented May 12, 2023 via email

@jbdatascience
Copy link
Author

Sounds good. It occurs to me that this WALL-E example is not very good because it doesn't have compound sentences, or complex sentences, or compound-simplex sentences. It would be nice if the LLM would split those into simple sentences. To show you what I mean, I asked ChatGPT this: What is a simple, complex, compound and complex compound sentence, with examples? Here is its answer: Sure, I'd be happy to provide some examples of simple, complex, compound, and complex-compound sentences. 1. Simple sentence: A simple sentence consists of a subject and a verb and expresses a complete thought. Example: The cat sat on the mat. 1. Complex sentence: A complex sentence consists of an independent clause and a dependent clause. Example: Although I studied hard, I failed the exam. 1. Compound sentence: A compound sentence consists of two or more independent clauses joined by a conjunction. Example: I wanted to go to the park, but it started raining. 1. Complex-compound sentence: A complex-compound sentence consists of two or more independent clauses and at least one dependent clause. Example: After I finish my homework, I'll go for a walk, but only if it's not too late. Note that these are just a few examples of each type of sentence, and there can be many variations and combinations of each type.

Then we have to search for more complex texts. Do you have any examples in mind?

@jbdatascience
Copy link
Author

jbdatascience commented May 13, 2023

This looks VERY promising. https://github.com/explosion/spacy-llm

That looks promising indeed! Have you already tried?

In the meantime I am looking into this interesting list in my search for a good open source LLM that ideally can be run locally on your or mine PC to perform sentence simplification or other NLP tasks (our main goal is to be able to generate a DagAtlas based e.g. on a large number of texts, so if we can accomplish that by letting run those LLMs locally for enough time, we could reach that goal!):

List of Open Sourced Fine-Tuned Large Language Models (LLM) | by Sung Kim | Geek Culture | Mar, 2023 | Medium

An incomplete list of open-sourced fine-tuned Large Language Models (LLM) you can run locally on your computer

https://medium.com/geekculture/list-of-open-sourced-fine-tuned-large-language-models-llm-8d95a2e0dc76

@rrtucci
Copy link
Owner

rrtucci commented May 13, 2023 via email

@rrtucci
Copy link
Owner

rrtucci commented May 13, 2023 via email

@jbdatascience
Copy link
Author

I am searching for a way to have texts for which we know the Ground Truth causal relations, so we could test the MM algorithm if it could find these causal relations.

Now I am considering this idea💡 :
would it be possible to have LLMs generating texts with custom, predefined causal relations in it? That would give us a way to test if the MM algorithm could correctly come up with those causal relations in a DAG.

I would like to hear your comments. Do you consider this as worth pursuing?

@rrtucci
Copy link
Owner

rrtucci commented May 13, 2023 via email

@rrtucci
Copy link
Owner

rrtucci commented May 13, 2023 via email

@rrtucci
Copy link
Owner

rrtucci commented May 13, 2023 via email

@rrtucci
Copy link
Owner

rrtucci commented May 19, 2023

Jan, I know I have cried wolf many times before, but this time, I think I really found a solution to the sentence simplification problem. And it's pretrained, so no need for GPU card.
https://github.com/dair-iitd/imojie
https://arxiv.org/abs/2005.08178

@jbdatascience
Copy link
Author

Looks interesting, give it a try! But I think you should try the newer version as they mention (which is https://github.com/dair-iitd/openie6):


NEWEST VERSION:

[2010.03147] OpenIE6: Iterative Grid Labeling and Coordination Analysis for Open Information Extraction
https://arxiv.org/abs/2010.03147

CODE:
https://github.com/dair-iitd/openie6


OLDER VERSION :

[2005.08178] IMoJIE: Iterative Memory-Based Joint Open Information Extraction (OLDER VERSION, ALREADY OVERRIDDEN BY OIE6 !)
https://arxiv.org/abs/2005.08178
May 2020

CODE:
https://github.com/dair-iitd/imojie
Neural generation model for Open Information Extraction


@rrtucci
Copy link
Owner

rrtucci commented May 20, 2023 via email

@jbdatascience
Copy link
Author

Good catch. yes, the newest version.

On Fri, May 19, 2023 at 5:31 PM Jan Bours @.> wrote: Looks interesting, give it a try! But I think you should try the newer version as they mention (which is https://github.com/dair-iitd/openie6): ------------------------------ NEWEST VERSION: [2010.03147] OpenIE6: Iterative Grid Labeling and Coordination Analysis for Open Information Extraction https://arxiv.org/abs/2010.03147 CODE: https://github.com/dair-iitd/openie6 ------------------------------ OLDER VERSION : [2005.08178] IMoJIE: Iterative Memory-Based Joint Open Information Extraction (OLDER VERSION, ALREADY OVERRIDDEN BY OIE6 !) https://arxiv.org/abs/2005.08178 May 2020 CODE: https://github.com/dair-iitd/imojie Neural generation model for Open Information Extraction ------------------------------ — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZWTYK7RZUM5KO26NWA7TDXG7RA7ANCNFSM6AAAAAAXX5ZLHQ . You are receiving this because you were mentioned.Message ID: @.>

Can you get it working? I am trying, but having trouble installing all the requirements Python packages (in google COLAB). I will try it later.

@rrtucci
Copy link
Owner

rrtucci commented May 21, 2023 via email

@jbdatascience
Copy link
Author

Sorry, I haven't tried. Will try soon. I'm a bit slow due to personal problems.

Take it easy, no reason to hurry. I will help you with this. I find this project extremely interesting and now that I am retired from work I can devote a lot time to it and love to do that!

The last couple of days I was searching for information about comparable projects which could help us with our own MM problem. I encountered quite a few of them, and I will compile a list of them later over here.

@rrtucci
Copy link
Owner

rrtucci commented May 24, 2023

I've been trying to run both imojie and openie6 on Colab. Failed so far. Get lots of errors loading the requirements.

This repo named openie6-revised seems interesting, because it claims to be as good as openie6, but much impler because it relies on HuggingFace tools that have only become available in the last year.
https://github.com/tkdrnjs0621/openie6-revised

@rrtucci
Copy link
Owner

rrtucci commented May 25, 2023

I am trying to clean up openie6-revised and train it on Colab. I started a new github repo for this
https://github.com/rrtucci/openie6-revised2
My computer doesn't have a GPU card but Colab offers limited GPU for free.
Even if this doesn't work, it will be a good learning experience because it will teach me basic stuff about Colab, Torch, HuggingFace and transformers. I am currently a newbie in those areas but slowly learning.,

@rrtucci
Copy link
Owner

rrtucci commented May 26, 2023

I've been reading the code for openie6-revised (based on BERT), googling everything I don't understand.
Hugging face articles for BERT are excellent. For example

https://huggingface.co/blog/bert-101

https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb

https://huggingface.co/docs/transformers/training

According to these articles, BERT can be fine-tuned in 2-3 hours with a single GPU

@rrtucci
Copy link
Owner

rrtucci commented May 27, 2023

Openie6-revised is a BERT model fine tuned with what is called an Openie dataset. I asked ChatGPT what is an Openie dataset and it gave a nice answer. There are several Openie datasets. Openie6-revised uses the Openie4 dataset.

An OpenIE dataset is a collection of text data that has been annotated or labeled with open information extraction (OpenIE) outputs. It typically consists of a set of sentences or documents along with the extracted relations or facts from those texts.

The purpose of an OpenIE dataset is to provide a resource for training and evaluating OpenIE systems. These datasets enable researchers and developers to analyze and improve the performance of their OpenIE models by benchmarking them against the annotated data.

Each data point in an OpenIE dataset usually includes the original sentence or document from which the information is extracted and the corresponding extracted relations or facts. These relations are typically represented in a structured format, such as subject-predicate-object triples, where the subject, predicate, and object are the components of the extracted relation.

OpenIE datasets are valuable resources for various NLP tasks, including relation extraction, knowledge base construction, and information retrieval. They help in advancing the development of OpenIE systems by providing a standardized benchmark for evaluating their performance and facilitating the comparison of different approaches.

Several OpenIE datasets have been created and made publicly available, such as OpenIE-4, ReVerb, OIE2016, and NYTOpenIE. These datasets are often used for training and evaluating OpenIE models and can be accessed through online repositories or research platforms.

@rrtucci
Copy link
Owner

rrtucci commented May 27, 2023

The same people that wrote Openie6 also wrote Carb, which contains several openie datasets
https://github.com/dair-iitd/CaRB
https://github.com/dair-iitd/imojie
https://github.com/dair-iitd/openie6
https://github.com/dair-iitd

@rrtucci
Copy link
Owner

rrtucci commented May 27, 2023

Today, I discovered that Amazon offers a similar service to Google Colab. But with 15Gb of persistent memory
https://studiolab.sagemaker.aws/

@jbdatascience
Copy link
Author

Today, I discovered that Amazon offers a similar service to Google Colab. But with 15Gb of persistent memory https://studiolab.sagemaker.aws/

Interesting. I certainly will try that! I hope it circumvents the limited memory problems of the free Google Colab with I most often use. But with the gigantic Large Language Models that I try to use in Google Colab I reach the memory limits quite often there. Will https://studiolab.sagemaker.aws/ be better at that?

@jbdatascience
Copy link
Author

jbdatascience commented May 27, 2023

Last 2 weeks I am experimenting with State of the Art Large FREE OPEN SOURCE Language Models. Not only out of pure curiosity, but also I think they would be able to contribute to our ultimate goal of the Mappa Mundi project!
It is a pity that that these FREE OPEN SOURCE LLMs (of which there are a lot of them around these days and nearly every day a new one appears !) often have bugs and produce a lot of errors, which will be caused by the fact they are under heavy development at this moment. But it did cost me a lot of time these last 2 weeks, so I hope this will get better over time.

The things I am working on now:

  • exploring what Free Open Source Large Language Models can mean for us

  • I think an important step in the existing code base of the Mappa Mundi project is the use of stemming and/or lemmatisation of the text. These techniques brings each word / term to its most basic standard form (called the "stem" of a word). For example: the word "riding" will be transformed to "ride". The purpose of this step is to increase the quality of the embeddings, because if you omit this step, "rinding" and "ride" (and other derivatives with comparable semantic meaning of the word "ride") will get different tokens during the tokenisation process and these different tokens will not have any relation with each other. And because of that, these different tokens will get different vector embeddings during the embedding process, which will give a lower quality embedding space because all these different tokens should have gotten embedded to the SAME vector because they are semantically the same!

My question: do you have already incorporated this stemming and/or lemmatisation step in your code? This has an important effect on the quality of the vector embeddings that are based on it, and consequently also on the quality of the calculation of similarities in the texts, because I will base these similarities on the vector embeddings of the texts!

  • Intuitively, I think causal relations in texts has something to do with textual entailments (text_1 entails text_2 if text_2 can be seen as a consequence of text_1).

See e.g.
https://en.wikipedia.org/wiki/Textual_entailment

Textual entailment (TE), also known as Natural Language Inference (NLI), in natural language processing is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text. In the TE framework, the entailing and entailed texts are termed text (t) and hypothesis (h), respectively. Textual entailment is not the same as pure logical entailment – it has a more relaxed definition: "t entails h" (t ⇒ h) if, typically, a human reading t would infer that h is most likely true.[1] (Alternatively: t ⇒ h if and only if, typically, a human reading t would be justified in inferring the proposition expressed by h from the proposition expressed by t.[2]) The relation is directional because even if "t entails h", the reverse "h entails t" is much less certain.[3][4]

Determining whether this relationship holds is an informal task, one which sometimes overlaps with the formal tasks of formal semantics (satisfying a strict condition will usually imply satisfaction of a less strict conditioned); additionally, textual entailment partially subsumes word entailment.

Examples
Textual entailment can be illustrated with examples of three different relations:[5]

An example of a positive TE (text entails hypothesis) is:

text: If you help the needy, God will reward you.
hypothesis: Giving money to a poor man has good consequences.
An example of a negative TE (text contradicts hypothesis) is:

text: If you help the needy, God will reward you.
hypothesis: Giving money to a poor man has no consequences.
An example of a non-TE (text does not entail nor contradict) is:

text: If you help the needy, God will reward you.
hypothesis: Giving money to a poor man will make you a better person.

""

See also:
https://huggingface.co/EndMO/text-entailment-bert?text=I+like+you.+I+love+you

https://huggingface.co/roberta-large-mnli?text=The%2Bdog%2Bwas%2Blost.%2BNobody%2Blost%2Bany%2Banimal

https://medium.com/@zeyuanhu/recognizing-contextual-entailment-using-nneural-network-in-nlp-ea9c5f1a216a

Textual Entailment Recognition with Semantic Features from Empirical Text Representation
https://arxiv.org/abs/2210.09723

29 Textual Entailment
https://academic.oup.com/edited-volume/42643/chapter-abstract/358152055?redirectedFrom=fulltext

I think this will be of importance to our MM project.
Perhaps textual entailments are not the only causal relations we are after, but I do think textual entailments are a particular string form of causal relations that can be present in texts !
Please give your vision!

  • I am searching for good texts to use for MM. With good I mean, that we check if MM extracts the correct causal relations from these texts. That means that good texts have a known ground truth: when we know the causal relations in them beforehand, before we let MM try to extract these causal relations. With the goal of testing the workings of MM and checking if MM can indeed extract the correct causal relations.

  • Are there any other NLP applications that can perform the same function(s) that MM aims to be able to do?
    Her I think e.g. CausalBERT could be comparable (at least may be made to do comparable things).
    I am in the process of researching that.

See:
https://github.com/rpryzant

Pytorch implementation of "Adapting Text Embeddings for Causal Inference" (!):
https://github.com/rpryzant/causal-bert-pytorch
This seems a very interesting approach to me, because of the adapted embeddings for Causal Inference !

Tensorflow 2 implementation of Causal-BERT
https://github.com/rpryzant/causal-text-embeddings-tf2

https://raw.githubusercontent.com/rpryzant/causal-bert-pytorch/master/CausalBert.py

See also:
Find text features that are most related to an outcome, controlling for confounds.
https://github.com/rpryzant/deconfounded-lexicon-induction

Causal Effects of Linguistic Properties
Package for computing causal effects of text (as treatment)
https://github.com/rpryzant/causal-text

https://github.com/rpryzant/awesome-causality-algorithms#learning-causal-effects

Natural Language Processing

Name Paper Code

Causal Text Embeddings Veitch, Victor, Dhanya Sridhar, and David M. Blei. "Using Text Embeddings for Causal Inference." arXiv preprint arXiv:1905.12741 (2019). Python

Handling Missing/Noisy Treatment Wood-Doughty, Zach, Ilya Shpitser, and Mark Dredze. "Challenges of Using Text Classifiers for Causal Inference." In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4586-4598. 2018. Python

Conditional Treatment-adversarial Learning Based Matching Yao, Liuyi, Sheng Li, Yaliang Li, Hongfei Xue, Jing Gao, and Aidong Zhang. "On the estimation of treatment effect with text covariates." In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 4106-4113. AAAI Press, 2019. NA

Causal Inferences Using Texts Egami, Naoki, Christian J. Fong, Justin Grimmer, Margaret E. Roberts, and Brandon M. Stewart. "How to make causal inferences using texts." arXiv preprint arXiv:1802.02163 (2018). NA

Causal FS for text classification Michael J. Paul. Feature selection as causal inference: experiments with text classification. Conference on Computational Natural Language Learning (CoNLL), Vancouver, Canada. August 2017. NA

@rrtucci
Copy link
Owner

rrtucci commented May 27, 2023

This textual entailment stuff sounds nice, but quite different from causal inference. It produces no DAGs and no DAG atlas, These are the main objectives of MM!!

for now, my immediate goal is very well defined and reachable. I want to

  1. replace the simplification and sentence similarity operations of mappa mundi by something quicker and more efficient.
  2. Get the causal maps for all the IMSDb movie scripts.

I think BERT is good enough to achieve 1,2

@jbdatascience
Copy link
Author

jbdatascience commented May 28, 2023

When we have divided up the original texts in pieces of sentences, we can do 2 things:

1-After embedding sentence pieces (for which we already have the open source method from HuggingFace (HF) !), we can do the similarity searches using the cosine similarity (as I showed in an earlier Python notebook posted here). I think this will be much more accurate than the NLTK method of doing this and a lot speedier too! Therefore I expect this will enable us to get the causal maps for much more or even all the IMSDb movie scripts!

2-My intention with the textual entailment could be comparison with the results of step 1, because if 2 pieces of text show this entailment than there is an unequivocal causal relationship between the 2 ! And my goal with this is to test if the method in step 1 discovers these same causally related pairs: does it discover all of them, just a few of them or none at all? It seems a good testing method to me.

If in step 2 there are causally related pairs which we did not discover in step 1, I think we should decide what to do: add them to the set of discovered causally related pairs from step 1? I think we should, because they are causally related pairs for sure! Why should we not do that?

@rrtucci
Copy link
Owner

rrtucci commented May 28, 2023

Like I said, this is nice for future research. But I want to do 1 and 2 first.

@jbdatascience
Copy link
Author

jbdatascience commented May 28, 2023

I will try to implement step 1 first and then see where it brings us ! I am particularly focused on the assuring of being able to validate (some of) the extracted causal relations found by the MM algorithm just by looking at the texts of the 2 nodes involved.

In my view it is crucial that we will be able to verify if (a sample of) the MM extracted causal pairs are defendable. Can you please comment on this because I find this very important and would like us to be on the same page with this !
Do you think this is possible or not? The causal relations found should be human interpretable if need be don’t you think?

@rrtucci
Copy link
Owner

rrtucci commented May 28, 2023

I'm not interested in working on this right now. I want to finish what I started, instead of going off in a tangential direction before finishing what I started. When I was in high school, I did some art and took some art classes. I learned from the art teacher that the best art teachers don't tell their pupils what style to paint in. They just coach you in whatever style you choose. So I am not telling you not to do this. You should do whatever you think is best. I am just telling you that I will not participate in it because at the present time I want to do 1 and 2.

@rrtucci
Copy link
Owner

rrtucci commented May 28, 2023

I am trying to get in contact with the authors of Carb, Imojie and Openie6

@jbdatascience
Copy link
Author

I'm not interested in working on this right now. I want to finish what I started, instead of going off in a tangential direction before finishing what I started. When I was in high school, I did some art and took some art classes. I learned from the art teacher that the best art teachers don't tell their pupils what style to paint in. They just coach you in whatever style you choose. So I am not telling you not to do this. You should do whatever you think is best. I am just telling you that I will not participate in it because at the present time I want to do 1 and 2.

As I said : I will try to implement step 1 first and then see where it brings us !

@jbdatascience
Copy link
Author

I am trying to get in contact with the authors of Carb, Imojie and Openie6

A good step. Could be useful !

@rrtucci
Copy link
Owner

rrtucci commented Jun 3, 2023

I think I know what the problem with installing openie6 is. The latest pip versions are more strict than the older ones. It's a little risky, but it is possible to tell pip not to check for dependency conflicts as follows:

%pip install --no-deps -r requirements.txt

https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies

@jbdatascience
Copy link
Author

Are you able to install openie6 using %pip install --no-deps -r requirements.txt ?

@rrtucci
Copy link
Owner

rrtucci commented Jun 4, 2023 via email

@rrtucci
Copy link
Owner

rrtucci commented Jun 9, 2023

Unfortunately, the people who wrote Openie6 are not very eager to help. I was trying to understand their code and simplify it but it's very hard because

  1. they provide no docstrings and their code is somewhat sloppy and overly complicated.
  2. I'm a newbie to this. This is the first time I use most of the libraries they use
  3. Many of the library versions they use (e.g., torchtext, allennlp, pytorch-lighting, transformers) are very outdated
  4. The code was designed to compare Openie6 with many other models across many benchmarks. So if all you want to do is to use Openie6, without comparing it or benchmarking it, large parts of the code are unnecessary.

So I've gone to plan B, which is to use the following
https://github.com/tkdrnjs0621/openie6-revised/

@jbdatascience
Copy link
Author

I saw you created a new github repository: https://github.com/rrtucci/openie6A/tree/master, with
https://github.com/rrtucci/openie6A/tree/master/short_stories_clean
and
https://github.com/rrtucci/openie6A/tree/master/m_scripts_clean

Should I use these new cleaned texts as input to the new to be developed version of the Mappa Mundi algorithm (using vector embeddings and similarities based on those embeddings on which I am working)? Or use the existing ones?

I have a week off for vacation, planning to go on working on the updated version of MM from next week on.
Please keep me updated ! I will do the same!

@rrtucci
Copy link
Owner

rrtucci commented Jun 9, 2023

Open6A resulted from my trying to update the libraries of Openie6 and simplifying it by throwing out the benchmarking and comparing with other models. But I have totally given up on that.

My new plan is to study Openie6-revised
https://github.com/tkdrnjs0621/openie6-revised/tree/main

very carefully, googling every line of the program that I don't understand. Hopefully, once I learn it really well, I will be able to train it. I've joined 3 online sites that give a limited amount of free GPU time: Google Colab, Kaggle and Amazon Sagemaker Studio Lab. So even though my computer has no GPU, I believe it will be possible for me to train it on one of those 3 services.

@rrtucci
Copy link
Owner

rrtucci commented Jun 12, 2023

Good news. Just received this email from my friend Anton from St.Petersburg,
https://alexeyev.github.io/

Dear Dr. Tucci,

I've finally managed to run the trained coordination analysis model by Kolluru et al. 2020 (IGL-CA) on colab. Here's the notebook.
https://colab.research.google.com/drive/1samvO-SH6Xgjf9ItlhAF1EmBZo5grBQb?usp=sharing

I wish you luck with your project, best regards,
Anton M. Alekseev.

@rrtucci
Copy link
Owner

rrtucci commented Jun 17, 2023

The latest version of Mappa Mundi uses

Openie6 for SOTA sentence simplification and

an sBert sentence transformer for SOTA sentence similarity.

The new notebooks for the 3 PG Wodehouse short stories and the 3 Pixar movie scripts, have been run on Google Colab with GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants