Is this function correct for calculating NERF? #25

satoshi-2000 · 2023-12-19T16:03:28Z

Hello!

I am seeking to estimate text readability, and to accomplish this, I employed the NERF formula for computation. However, I encountered issues with the LingFeat library due to a dependency problem, as outlined in this GitHub issue.
Despite this setback, I attempted to use the LFTK library and found success.

Nevertheless, there is a disparity in the names of features between LeafFeat and LFTK. Consequently, I am uncertain about the correctness of the correspondence. Additionally, I am unable to locate the variable 'Constituency Parse Tree Height.' In an effort to address this concern, I turned to using the LingFeat library and it worked.

However, due to the differing names of features between LeafFeat and LFTK, I am unsure if the correspondence is correct. Could you please confirm whether this correspondence is accurate, especially regarding the variable 'Constituency Parse Tree Height'?

Thank you in advance.

import spacy
import lftk
import math
import nltk
from supar import Parser


# load models
nlp = spacy.load('en_core_web_sm')
SuPar = Parser.load('crf-con-en')

def preprocess(doc, short=False, see_token=False, see_sent_token=True):
    n_token = 1
    n_sent = 1
    token_list = []
    #raw_token_list = []
    sent_token_list = []

    # sent_list is for raw string sentences
    sent_list = []

    # count tokens, sentence + make lists
    #for sent in self.NLP_doc.sents:
    for sent in doc.sents:
        n_sent += 1
        sent_list.append(sent.text)
        temp_list = []
        for token in sent:
            if token.text.isalpha():
                temp_list.append(token.text)
                if short == True:
                    n_token += 1
                    token_list.append(token.lemma_.lower())
                if short == False:
                    if len(token.text) >= 3:
                        n_token += 1
                        token_list.append(token.lemma_.lower())
        if len(temp_list) > 3:
            sent_token_list.append(temp_list)

    #self.n_token = n_token 
    #self.n_sent = n_sent
    #self.token_list = token_list
    #self.sent_token_list = sent_token_list
    
    result = {"n_token": n_token, 
                "n_sent":n_sent
                }

    if see_token == True:
        result["token"] = token_list
    if see_sent_token == True:
        result["sent_token"] = sent_token_list

    return result

def calculate_nerf(extracted_features):
    return (0.04876 * extracted_features['t_kup'] - 0.1145 * extracted_features['t_subtlex_us_zipf']) / extracted_features['t_sent'] \
        + (0.3091 * (extracted_features['n_noun'] + extracted_features['n_verb'] + extracted_features['n_num'] + extracted_features['n_adj'] + extracted_features['n_adv']) + 0.1866 * extracted_features['n_noun'] + 0.2645 * extracted_features['to_TreeH_C']) / extracted_features['t_sent'] \
        + (1.1017 * extracted_features['t_uword']) / math.sqrt(extracted_features['t_word']) - 4.125


text = 'This is simple example sentence. This is another example sentence.'
doc = nlp(text)

# initiate LFTK extractor by passing in doc
LFTK = lftk.Extractor(docs = doc)
LFTK.customize(stop_words=True, punctuations=False, round_decimal=3)

preprocessed_features = preprocess(doc, short=False, see_token=False, see_sent_token=True)
TrSF = retrieve(SuPar, preprocessed_features['sent_token'])
feature_keys = ['t_kup', 't_subtlex_us_zipf', 't_sent', 'n_noun', 'n_verb', 'n_adj', 'n_adv', 'n_num', 't_uword', 't_word']

extracted_features = LFTK.extract(features = feature_keys)
extracted_features.update(TrSF)

# convert to float
extracted_features = {k: float(v) for k, v in extracted_features.items()}

print(calculate_nerf(extracted_features))

brucewlee · 2023-12-19T22:19:23Z

Hi! Thanks always for visiting my research.

TrSF = retrieve(SuPar, preprocessed_features['sent_token'])

Can I take a look at the context of this retrieve function?

satoshi-2000 · 2023-12-20T01:17:29Z

Hi!

Here is the retrieval function with some lines commented out from this code.

Please confirm this function.

#def retrieve(SuPar, sent_token_list, n_token, n_sent):
def retrieve(SuPar, sent_token_list):
    to_TreeH_C = 0
    #to_FTree_C = 0
    for sent in sent_token_list:
        dataset = SuPar.predict([sent], prob=True, verbose=False)
        parsed_tree = dataset.sentences
        nltk_tree = nltk.Tree.fromstring(str(parsed_tree[0]))
        to_TreeH_C += int(nltk_tree.height())
        #to_FTree_C += len(nltk_tree.flatten())
    result = {
        "to_TreeH_C": to_TreeH_C,
        #"as_TreeH_C": float(division(to_TreeH_C,n_sent)),
        #"at_TreeH_C": float(division(to_TreeH_C,n_token)),
        #"to_FTree_C": to_FTree_C,
        #"as_FTree_C": float(division(to_FTree_C,n_sent)),
        #"at_FTree_C": float(division(to_FTree_C,n_token)),
    }
    return result

brucewlee · 2023-12-20T19:05:19Z

Yes. I think your implementation is valid.

I apologize that I couldn't put constituency parsing capabilities in LFTK. It was intentional because existing constituency parsing libraries / models are not very well maintained but I wanted to make LFTK as maintainable and lightweight as possible unlike LingFeat.

Back in the LingFeat era, many faced problems in installing and using it because of its heavy and complicated dependencies. Which is what I tried to change in LFTK

satoshi-2000 · 2023-12-21T00:35:35Z

Thank you for confirming!

I understand. That sounds great! Indeed, LingFeat had complex dependencies.

Anyway, I appreciate your sincere comments!

satoshi-2000 closed this as completed Dec 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is this function correct for calculating NERF? #25

Is this function correct for calculating NERF? #25

satoshi-2000 commented Dec 19, 2023

brucewlee commented Dec 19, 2023

satoshi-2000 commented Dec 20, 2023

brucewlee commented Dec 20, 2023

satoshi-2000 commented Dec 21, 2023

Is this function correct for calculating NERF? #25

Is this function correct for calculating NERF? #25

Comments

satoshi-2000 commented Dec 19, 2023

brucewlee commented Dec 19, 2023

satoshi-2000 commented Dec 20, 2023

brucewlee commented Dec 20, 2023

satoshi-2000 commented Dec 21, 2023