handle yahoo US collecotr api limit issue (Fix #1953) #1970

shockylove · 2025-07-22T06:55:06Z

Fix eastmoney API pagination for US stock data collection

Description

Fixed pagination issue in _get_eastmoney() function that was causing "request
error" when collecting US stock symbols. Changed from requesting 10,000
symbols per page to proper pagination with 100 symbols per page, iterating
through all pages until completion.

Changes made:

Changed pz parameter from 10000 to 100 (page size)
Added proper pagination loop with page increment
Added exit conditions for API failures, empty responses, or no more data
Added 0.01s delay between requests for rate limiting

Motivation and Context

Related Issue: eastmoney API now limits page size to 100 symbols maximum,
but the original code was trying to fetch 10,000 symbols in a single request.

Problem: When running python collector.py download_data --source_dir ~/.qlib/stock_data/source/us_data --start 2020-01-01 --end 2020-12-31 --delay 1 --interval 1d --region US, it failed with "request error" because
len(_symbols) < 8000.

Root Cause: The eastmoney API
http://4.push2.eastmoney.com/api/qt/clist/get was returning only 100 symbols
instead of the requested 10,000, triggering the validation error.

How Has This Been Tested?

If you are adding a new feature, test on your own test scripts.

API Endpoint Testing:

Verified API returns 12,095 total US stocks
Tested pagination boundaries:
- Page 1-120: 100 symbols each
- Page 121: 95 symbols (final page)
- Page 122+: 0 symbols (empty, properly triggers exit)
Confirmed total collection: 120×100 + 95 = 12,095 symbols
Verified no duplicate symbols in collection

Test Commands Used:

# Test API response structure
curl -L "http://4.push2.eastmoney.com/api/qt/clist/get?pn=1&pz=100&fs=m:105,m:1
06,m:107&fields=f12" | jq '.data.total'

# Test pagination boundaries
for page in 120 121 122; do
  curl -s -L "http://4.push2.eastmoney.com/api/qt/clist/get?pn=$page&pz=100&fs=
m:105,m:106,m:107&fields=f12" | jq '.data.diff | length'
done

Screenshots of Test Results (if appropriate):

1. Pipeline test: N/A (focused fix, existing pipeline tests should pass)
2. Your own tests:
  - API Total Response: {"data":{"total":12095}} ✓
  - Page 120: 100 symbols ✓
  - Page 121: 95 symbols ✓
  - Page 122: 0 symbols (null diff) ✓
  - Successfully tested retrieval of all 12,095 US stock symbols

Types of changes

- Fix bugs
- Add new feature
- Update documentation

shockylove · 2025-07-22T06:58:57Z

@microsoft-github-policy-service agree

you-n-g · 2025-07-22T15:16:03Z

scripts/data_collector/utils.py

+                symbols.extend(page_symbols)
+                page += 1
+                time.sleep(0.01) 
+            except:


Can we use an explicit type of exception?

handle yahoo US collecotr api limit issue (Fix microsoft#1953)

1fce024

github-actions bot added the waiting for triage Cannot auto-triage, wait for triage. label Jul 22, 2025

you-n-g reviewed Jul 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

handle yahoo US collecotr api limit issue (Fix #1953) #1970

handle yahoo US collecotr api limit issue (Fix #1953) #1970

Uh oh!

shockylove commented Jul 22, 2025

Uh oh!

shockylove commented Jul 22, 2025

Uh oh!

you-n-g Jul 22, 2025

Uh oh!

Uh oh!

handle yahoo US collecotr api limit issue (Fix #1953) #1970

Are you sure you want to change the base?

handle yahoo US collecotr api limit issue (Fix #1953) #1970

Uh oh!

Conversation

shockylove commented Jul 22, 2025

Description

Motivation and Context

How Has This Been Tested?

Uh oh!

shockylove commented Jul 22, 2025

Uh oh!

you-n-g Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!