Skip to content

feat: add a script to bulk mark users as spam retroactively#3521

Open
tefkah wants to merge 6 commits intomainfrom
tfk/bulk-spam
Open

feat: add a script to bulk mark users as spam retroactively#3521
tefkah wants to merge 6 commits intomainfrom
tfk/bulk-spam

Conversation

@tefkah
Copy link
Member

@tefkah tefkah commented Mar 4, 2026

  • feat: add bulk spam labeling script
  • refactor: write analyze results incrementally

Issue(s) Resolved

Still so many spam users!

This PR adds a script that scans every. single. user in PubPub and determines whether they are likely spam.

It does this by checking for a few metrics, which increase the spam score

  • Do all the users comments contain links? +2
  • Is the user not associated to any Community (are a member, have attribution) and have a link in their profile? +2
  • Do they post comments with links, and don't have any memberships? (this alone qualifies someone as spam) +4
  • Do their comments contain common spam phrases ("I like your blog!") and contain a link? +2
  • Do they have a URL in their bio? +2
  • Have they immediately added their website to their profile? +3
  • Do they have a website in their profile, and no memberships? +2
  • Does their profile have some spam phrases (BUY PILLS!!) (depends on the phrase how severe)

By default the script considers you spam if you score more than 4.
I think running it with a threshold of 6 should weed out almost everyone, but can be increased.

Test Plan

Screenshots (if applicable)

Optional

Notes/Context/Gotchas

Supporting Docs

"@types/mime-types": "^2.1.3",
"@types/multer": "^1.4.9",
"@types/node": "^18.11.4",
"@types/node": "^24.4.0",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wild that we were still on such an old one

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a two-phase bulk spam detection tool (scanSpamUsers) that analyzes all PubPub users for spam signals and can tag flagged users. Also introduces shared content analysis utilities and refactors the user spam scoring to support async, richer analysis.

Changes:

  • New scanSpamUsers tool with --analyze and --execute phases for bulk spam detection and tagging
  • Refactored userScore.ts to expose an async computeUserSpamReport that checks comments, memberships, and profile signals (previously only checked profile phrases)
  • Bumped @types/node from v18 to v24

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tools/scanSpamUsers.ts New bulk spam scanning/tagging CLI tool
server/spamTag/contentAnalysis.ts New shared utilities for link extraction, spam template matching
server/spamTag/userScore.ts Refactored spam scoring with async report including comment/membership signals
utils/jsonArrayWriter.ts Utility for incrementally writing JSON arrays to disk
types/spam.ts Added automatedScan field to UserSpamTagFields
tools/index.js Registered new tool
package.json / pnpm-lock.yaml Bumped @types/node to v24
Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

contains an array of entries sorted by score descending.

--output <path> required. where to write the results json.
--min-score <n> minimum score to include in output. default 4.
process.exit(1);
}
const entries: AnalyzeEntry[] = JSON.parse(fs.readFileSync(inputPath, 'utf-8'));
const minScore = parseInt(parseArg('min-score') ?? '0', 10);
Comment on lines +149 to +207
while (true) {
const users = await User.findAll({
where: { spamTagId: { [Op.is]: null as any } },
attributes: [
'id',
'fullName',
'email',
'slug',
'title',
'bio',
'website',
'createdAt',
'updatedAt',
],
limit: BATCH_SIZE,
offset,
order: [['createdAt', 'DESC']],
});
if (users.length === 0) break;

for (const user of users) {
scanned++;
if (skipIds.has(user.id)) continue;
try {
const report = await computeUserSpamReport(user);
if (report.score < minScore) continue;

const commentInfo = await getRecentCommentsWithLinks(user.id, 5);
const hasProfileSignal = report.signals.some(
(s) => s.includes('website') || s.includes('bio'),
);
const profile = hasProfileSignal
? {
website: user.website ?? null,
bio: user.bio ?? null,
bioUrls: extractLinksFromText(user.bio),
}
: null;

writer.push({
index: writer.length,
userId: user.id,
email: user.email ?? '',
slug: user.slug,
fullName: user.fullName,
createdAt: String(user.createdAt),
score: report.score,
signals: report.signals,
commentCount: commentInfo.total,
commentsWithLinks: commentInfo.withLinks,
recentComments: commentInfo.evidence,
profile,
});
} catch (err) {
console.error(`error analyzing user ${user.id}:`, err);
}
}
console.log(`scanned=${scanned} flagged=${writer.length}`);
offset += BATCH_SIZE;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants