Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document is not clear on 'topk' and 'mean' for lambdarank_pair_method for lambda rank pair construction #10991

Open
nsh-bay opened this issue Nov 8, 2024 · 1 comment

Comments

@nsh-bay
Copy link

nsh-bay commented Nov 8, 2024

Hi team,
I have a few questions on this document for learn to rank https://xgboost.readthedocs.io/en/stable/tutorials/learning_to_rank.html,

  1. I couldn't find how can I run an exhaustive pairs construction with lambdarank_pair_method='mean' or 'topk'. This is my ultimate goal. Note that the number of documents for each query varies.
  2. What is the default k (lambdarank_num_pair_per_sample) for topk and mean method?
    When I left it default, the model.json shows the lambdarank_num_pair_per_sample is full 32bit number (screenshot). Is it a bug?
image
  1. I assume that setting topk and set the k via lambdarank_num_pair_per_sample very large number (e.g., -1 or 1000) can help me achieve the goal in question 1, but I am not sure how it behaves if lambdarank_num_pair_per_sample is set to a number higher than number of documents for every queries.
  2. The example with the mean method is a bit tricky to me that if we have 3 documents , typically we only need 2c3=3 pairs at most but the example showed we can generate lambdarank_num_pair_per_sample * #documents = 2*3 = 6.
    • a. That means there are duplicates pairs in this case? if I set method as mean and lambdarank_num_pair_per_sample is very large, does it affects the training time significantly because of that duplicates?
    • b. How to set it to archive question 1 above?
  • Here is the example quote in the document.
    For the mean strategy, XGBoost samples lambdarank_num_pair_per_sample pairs for each document in a query list. For example, given a list of 3 documents and lambdarank_num_pair_per_sample is set to 2, XGBoost will randomly sample 6 pairs, assuming the labels for these documents are different. On the other hand, if the pair method is set to topk, XGBoost constructs about number of pairs with pairs for each sample at the top position. The number of pairs counted here is an approximation since we skip pairs that have the same label.
  1. If I select topk' method with lambdarank_num_pair_per_sample=2` and my query have 4 documents, says ranked d1-d4.
    • a. What pairs will be constructed? (d1-d2), (d1d3), (d1-d4), (d2-d3), (d2-d4) ?
    • b. The document says it will construct k*|query| , so it should be 2*4=8 , how will they be constructed ?

Here is one of my GBM setting and environment:

  • xgb.version :2.1.2 (CPU only)
  • Labels is floating point values
        'ndcg_exp_gain': False,
       'objective': 'rank:ndcg',
       'lambdarank_pair_method':'topk',
       'lambdarank_num_pair_per_sample':10000,
       'verbosity': 1,
       'grow_policy': 'lossguide',
       'learning_rate': 0.3,              
       'max_depth': 6,                    
       'min_child_weight': 0.0,             
       'subsample': 0.5,                  
       'tree_method': 'approx',          
       'max_bin': 256,                    
       'gamma': 0,                        
       'reg_lambda': 1.0,                
       'reg_alpha':0.0,                  
       'max_leaves': 32,                  
       'random_state': 999,               
       'n_jobs': -1

Thank you very much.

@trivialfis
Copy link
Member

I couldn't find how can I run an exhaustive pairs

For now, set it to a number larger than existing groups?

What is the default k

1 if random sampling, 32 if top k.

Is it a bug?

it's an internal indicator for "not-set".

The example with the mean method is a bit tricky

Randomly select k documents, and pair them with all other existing documents in the group.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants