A multi-step data scraping and processing pipeline that collects company data from Crunchbase, merges it with static sources, and generates network-based datasets for The Wall addon.
- Node.js 21
- npm
- Crunchbase account credentials
-
Install dependencies:
npm install -
Create
.envfile:
[email protected]
PASSWORD=your-password
- Configure Crunchbase Saved Searches:
Create three saved searches in Crunchbase with these filters:
- Geography: Headquarter Location = Israel
- Organization Type: Companies
- Status: Active
- CB Rank (Company): Number input fields for rank filtering
- Rank ranges: 1-60k, 60k-145k, 145k-290k, 290k-440k, 440k-590k, 590k-750k, 750k-920k, 920k-1M, 1M-1.1M, 1.1M-1.3M, 1.3M-1.53M, 1.53M-1.8M, 1.8M-2.05M, 2.05M-2.29M, 2.29M-2.58M, 2.58M-2.85M, 2.85M-3.18M, 3.18M-3.54M, 3.54M-9.99M
- People/Founders: Founder Location = Israel
- Organization Type: Companies
- Status: Active
- CB Rank (Company): Number input fields
- Rank ranges: 1-150k, 150k-1.2M, 1.2M-999M
- People/Investors: Investor Location = Israel (founder location ≠ Israel)
- Organization Type: Companies
- Status: Active
- CB Rank (Company): Number input fields
- Rank ranges: 1-50k, 50k-999M
Note: Crunchbase limits results to 1000 per query, so the scraper splits searches by rank ranges. Update cbSearchUrl in src/tasks/scrap.ts with your saved search URLs.
-
Scraping (
scrap.ts): Scrapes Crunchbase using Puppeteer. Navigates to saved searches, sets rank filters, extracts company data, handles pagination. Output:results/1_batches/cb/*.json -
Merge CB (
merge_cb.ts): Consolidates all CB batches, merges duplicates byid, combinesreasonsarrays. Output:results/2_merged/1_MERGED_CB.json -
Generate Static Data (
gen_static.ts,gen_buyIsraeliTech.ts): Processes BDS and BuyIsTech data. Output:results/1_batches/static/*.json -
Merge All (
merge_static.ts): Combines CB + static data, deduplicates by website, applies manual overrides, normalizes URLs. Output:results/2_merged/2_MERGED_ALL.json -
Extract Social (
extract_social.ts): Extracts LinkedIn, Facebook, Twitter handles using regex. Output:results/3_networks/FLAGGED_*.json -
Extract Websites (
extract_websites.ts): Extracts domains, filters invalid sites. Output:results/3_networks/WEBSITES.json -
Generate Final DB (
final.ts): Merges network files byid, adds alternatives. Output:results/4_final/ALL.json -
Alternatives Report (
alternatives_report.ts): Validates top 10 companies have alternatives. Output: Console warnings -
Copy to Addon (
copy_to_addon.ts) [Optional]: Copies final files to../addon/src/db/ -
Validate URLs (
validate.ts) [Optional]: Validates URLs, reports status codes. Output:results/2_merged/report.json
npm run devPrompts for: Scraping, URL validation, Copy to addon
Other scripts: npm run merge, npm run convert
1_batches/cb/*.json → 2_merged/1_MERGED_CB.json
↓
1_batches/static/*.json → 2_merged/2_MERGED_ALL.json
↓
3_networks/*.json (separate by network type)
↓
4_final/ALL.json (unified format)
- ScrappedItemType: Full company data (name, id, cbLink, reasons, social links, founderIds, investorIds, cbRank, etc.)
- APIEndpointDomainsResult:
{ id, selector, name, reasons, s? } - FinalDBFileType:
{ id, n, r, s?, ws?, li?, fb?, tw?, alt? }
Defined in src/tasks/manual_resolve/duplicate.ts:
manualDeleteIds: IDs to excludemanualOverrides: Field overrides by company name
- Scraping can be time-consuming; existing batch files are skipped
- Pipeline is resumable
- Manual resolution handles edge cases
Important: There are two types of platforms:
- Platforms from Crunchbase (e.g., LinkedIn, Facebook, Twitter): Extracted automatically from scraped data
- Platforms only from manual overrides (e.g., Instagram, GitHub): Only added manually, not scraped from Crunchbase
@theWallProject/addonCommon: AddAPI_ENDPOINT_RULE_PLATFORM_NAME(regex capturing username in group 1),DBFileNames.FLAGGED_PLATFORM_NAME, andplatformName?: stringtoFinalDBFileType- Field naming: use short lowercase (e.g., Instagram →
ig, GitHub →gh)
- Field naming: use short lowercase (e.g., Instagram →
-
src/types.ts:- Add
platformName: z.string().optional()toScrappedItemSchema - Add
platformName: z.array(z.string()).optional()toManualItemSchema
- Add
-
src/tasks/extract_social.ts:- Import
API_ENDPOINT_RULE_PLATFORM_NAMEandDBFileNames.FLAGGED_PLATFORM_NAME - Add extraction logic (follow Facebook/Twitter pattern)
- Add file output using
DBFileNames.FLAGGED_PLATFORM_NAME - Add log statement
- Import
-
src/tasks/validate_urls.ts:- Add
"platformName"tovalidateItemLinkslinks array (since it exists onScrappedItemType)
- Add
-
src/types.ts:- DO NOT add to
ScrappedItemSchema(only in manual overrides) - Add
platformName: z.array(z.string()).optional()toManualItemSchema
- DO NOT add to
-
src/tasks/extract_social.ts:- SKIP this step (not extracted from Crunchbase)
-
src/tasks/validate_urls.ts:- DO NOT add to
validateItemLinkslinks array (field doesn't exist onScrappedItemType)
- DO NOT add to
-
src/tasks/manual_resolve/manualOverrides.ts: AddplatformName?: string | string[]toManualOverrideFieldstype -
src/tasks/merge_static.ts:- Import
API_ENDPOINT_RULE_PLATFORM_NAME - Add
"platformName"toextractIdentifierfunction (add new case) - Add
"platformName"tolinkFieldsarray - Update
ScrappedItemWithOverridestype to includeplatformName?: string - Add
platformNamecase to thesetFieldhelper function - Protocol removal is handled automatically via
setField
- Import
-
src/tasks/validate_urls.ts:- Import
API_ENDPOINT_RULE_PLATFORM_NAME - Add
"platformName"toLinkFieldtype - Add
platformName?: string[]toCategorizedUrlsandOverrideWithUrlstypes - Add detection in
categorizeUrlfunction - Add
platformNamecase informatValuefunction (both processed and unprocessed branches) - Update
saveManualOverridestemplate to includeplatformNameinManualOverrideFields - Add
platformNamecase incollectExtraUrlscategorization loop and merging logic - Update
ManualOverrideValuetype to includeplatformName?: string | string[]
- Import
-
src/tasks/final.ts: Add caseDBFileNames.FLAGGED_PLATFORM_NAMEreturning"platformName"inkeyFromFileNamefunction -
Addon (
theWallAddon/src/storage.ts): Add case togetSelectorKeyfunction mapping domain to selector key (e.g.,"github.com"→"gh")
Checklist:
- Regex extracts usernames correctly
- Output files created (if platform from Crunchbase)
- Manual override extraction works
- Addon updated
- No linter errors
- No
anytype hacks used