Handles a bunch of RSS formats because everyone implements them differently:
type Item struct {
Title string `xml:"title"`
Link string `xml:"link"`
PubDate string `xml:"pubDate"` // regular rss
Date string `xml:"date"` // some use this
Published string `xml:"published"` // atom folks
Updated string `xml:"updated"` // fallback
}
RSS feeds use whatever date format they feel like. We handle:
var dateFormats = []string{
time.RFC1123Z, // most RSS
time.RFC3339, // atom's favorite
"02 Jan 2006 15:04 -0700", // why do people use this
"2006-01-02", // at least it's simple
}
- Strips HTML (nobody needs that in a feed)
- Fixes entities (
&
→&
) - Handles missing descriptions (minimalist blogs)
- Reads feeds from
whitelist.toml
- Downloads them all at once (because waiting sucks)
- Parses XML, prays it's valid
- Cleans up the mess
- Dumps a nice markdown file in
output/
The code's modular so we can add new formats when someone inevitably implements RSS wrong again. This has been so fun to troubleshoot.