feat: add a script to bulk mark users as spam retroactively by tefkah · Pull Request #3521 · knowledgefutures/pubpub

tefkah · 2026-03-04T17:16:48Z

feat: add bulk spam labeling script
refactor: write analyze results incrementally

Issue(s) Resolved

Still so many spam users!

This PR adds a script that scans every. single. user in PubPub and determines whether they are likely spam.

It does this by checking for a few metrics, which increase the spam score

Do all the users comments contain links? +2
Is the user not associated to any Community (are a member, have attribution) and have a link in their profile? +2
Do they post comments with links, and don't have any memberships? (this alone qualifies someone as spam) +4
Do their comments contain common spam phrases ("I like your blog!") and contain a link? +2
Do they have a URL in their bio? +2
Have they immediately added their website to their profile? +3
Do they have a website in their profile, and no memberships? +2
Does their profile have some spam phrases (BUY PILLS!!) (depends on the phrase how severe)

By default the script considers you spam if you score more than 4.
I think running it with a threshold of 6 should weed out almost everyone, but can be increased.

Test Plan

Screenshots (if applicable)

Optional

Notes/Context/Gotchas

Supporting Docs

tefkah · 2026-03-04T19:50:15Z

package.json

 		"@types/mime-types": "^2.1.3",
 		"@types/multer": "^1.4.9",
-		"@types/node": "^18.11.4",
+		"@types/node": "^24.4.0",


wild that we were still on such an old one

Copilot

Pull request overview

Adds a two-phase bulk spam detection tool (scanSpamUsers) that analyzes all PubPub users for spam signals and can tag flagged users. Also introduces shared content analysis utilities and refactors the user spam scoring to support async, richer analysis.

Changes:

New scanSpamUsers tool with --analyze and --execute phases for bulk spam detection and tagging
Refactored userScore.ts to expose an async computeUserSpamReport that checks comments, memberships, and profile signals (previously only checked profile phrases)
Bumped @types/node from v18 to v24

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tools/scanSpamUsers.ts	New bulk spam scanning/tagging CLI tool
server/spamTag/contentAnalysis.ts	New shared utilities for link extraction, spam template matching
server/spamTag/userScore.ts	Refactored spam scoring with async report including comment/membership signals
utils/jsonArrayWriter.ts	Utility for incrementally writing JSON arrays to disk
types/spam.ts	Added `automatedScan` field to `UserSpamTagFields`
tools/index.js	Registered new tool
package.json / pnpm-lock.yaml	Bumped `@types/node` to v24

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

tools/scanSpamUsers.ts

+  contains an array of entries sorted by score descending.
+
+  --output <path>        required. where to write the results json.
+  --min-score <n>        minimum score to include in output. default 4.


tools/scanSpamUsers.ts

+		process.exit(1);
+	}
+	const entries: AnalyzeEntry[] = JSON.parse(fs.readFileSync(inputPath, 'utf-8'));
+	const minScore = parseInt(parseArg('min-score') ?? '0', 10);


tools/scanSpamUsers.ts

+	while (true) {
+		const users = await User.findAll({
+			where: { spamTagId: { [Op.is]: null as any } },
+			attributes: [
+				'id',
+				'fullName',
+				'email',
+				'slug',
+				'title',
+				'bio',
+				'website',
+				'createdAt',
+				'updatedAt',
+			],
+			limit: BATCH_SIZE,
+			offset,
+			order: [['createdAt', 'DESC']],
+		});
+		if (users.length === 0) break;
+
+		for (const user of users) {
+			scanned++;
+			if (skipIds.has(user.id)) continue;
+			try {
+				const report = await computeUserSpamReport(user);
+				if (report.score < minScore) continue;
+
+				const commentInfo = await getRecentCommentsWithLinks(user.id, 5);
+				const hasProfileSignal = report.signals.some(
+					(s) => s.includes('website') || s.includes('bio'),
+				);
+				const profile = hasProfileSignal
+					? {
+							website: user.website ?? null,
+							bio: user.bio ?? null,
+							bioUrls: extractLinksFromText(user.bio),
+						}
+					: null;
+
+				writer.push({
+					index: writer.length,
+					userId: user.id,
+					email: user.email ?? '',
+					slug: user.slug,
+					fullName: user.fullName,
+					createdAt: String(user.createdAt),
+					score: report.score,
+					signals: report.signals,
+					commentCount: commentInfo.total,
+					commentsWithLinks: commentInfo.withLinks,
+					recentComments: commentInfo.evidence,
+					profile,
+				});
+			} catch (err) {
+				console.error(`error analyzing user ${user.id}:`, err);
+			}
+		}
+		console.log(`scanned=${scanned} flagged=${writer.length}`);
+		offset += BATCH_SIZE;


tefkah added 3 commits March 4, 2026 16:41

feat: add bulk spam labeling script

536f28d

refactor: write analyze results incrementally

fa08e31

fix: increase default threshold

c38de68

tefkah mentioned this pull request Mar 4, 2026

feat: disallow links in comments #3518

Draft

tefkah commented Mar 4, 2026

View reviewed changes

tefkah requested a review from Copilot March 16, 2026 15:37

Copilot started reviewing on behalf of tefkah March 16, 2026 15:38 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

tefkah added 3 commits March 16, 2026 16:54

Merge branch 'main' into tfk/bulk-spam

18e43f1

feat: add cron and improve spam detection

1e5bd8b

fix: tweak values slightly

3e37271

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add a script to bulk mark users as spam retroactively#3521

feat: add a script to bulk mark users as spam retroactively#3521
tefkah wants to merge 6 commits intomainfrom
tfk/bulk-spam

tefkah commented Mar 4, 2026

Uh oh!

tefkah Mar 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tefkah commented Mar 4, 2026

Issue(s) Resolved

Test Plan

Screenshots (if applicable)

Optional

Notes/Context/Gotchas

Supporting Docs

Uh oh!

tefkah Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants