Skip to content

handle yahoo US collecotr api limit issue (Fix #1953) #1970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

shockylove
Copy link

Fix eastmoney API pagination for US stock data collection

Description

Fixed pagination issue in _get_eastmoney() function that was causing "request
error" when collecting US stock symbols. Changed from requesting 10,000
symbols per page to proper pagination with 100 symbols per page, iterating
through all pages until completion.

Changes made:

  • Changed pz parameter from 10000 to 100 (page size)
  • Added proper pagination loop with page increment
  • Added exit conditions for API failures, empty responses, or no more data
  • Added 0.01s delay between requests for rate limiting

Motivation and Context

Related Issue: eastmoney API now limits page size to 100 symbols maximum,
but the original code was trying to fetch 10,000 symbols in a single request.

Problem: When running python collector.py download_data --source_dir ~/.qlib/stock_data/source/us_data --start 2020-01-01 --end 2020-12-31 --delay 1 --interval 1d --region US, it failed with "request error" because
len(_symbols) < 8000.

Root Cause: The eastmoney API
http://4.push2.eastmoney.com/api/qt/clist/get was returning only 100 symbols
instead of the requested 10,000, triggering the validation error.

How Has This Been Tested?

  • If you are adding a new feature, test on your own test scripts.

API Endpoint Testing:

  • Verified API returns 12,095 total US stocks
  • Tested pagination boundaries:
    • Page 1-120: 100 symbols each
    • Page 121: 95 symbols (final page)
    • Page 122+: 0 symbols (empty, properly triggers exit)
  • Confirmed total collection: 120×100 + 95 = 12,095 symbols
  • Verified no duplicate symbols in collection

Test Commands Used:

# Test API response structure
curl -L "http://4.push2.eastmoney.com/api/qt/clist/get?pn=1&pz=100&fs=m:105,m:1
06,m:107&fields=f12" | jq '.data.total'

# Test pagination boundaries
for page in 120 121 122; do
  curl -s -L "http://4.push2.eastmoney.com/api/qt/clist/get?pn=$page&pz=100&fs=
m:105,m:106,m:107&fields=f12" | jq '.data.diff | length'
done

Screenshots of Test Results (if appropriate):

1. Pipeline test: N/A (focused fix, existing pipeline tests should pass)
2. Your own tests:
  - API Total Response: {"data":{"total":12095}} ✓
  - Page 120: 100 symbols ✓
  - Page 121: 95 symbols ✓
  - Page 122: 0 symbols (null diff) ✓
  - Successfully tested retrieval of all 12,095 US stock symbols

Types of changes

- Fix bugs
- Add new feature
- Update documentation

@github-actions github-actions bot added the waiting for triage Cannot auto-triage, wait for triage. label Jul 22, 2025
@shockylove
Copy link
Author

@microsoft-github-policy-service agree

symbols.extend(page_symbols)
page += 1
time.sleep(0.01)
except:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use an explicit type of exception?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting for triage Cannot auto-triage, wait for triage.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants