White list / black list websites, robots.txt pre-sets #5

d47081 · 2023-04-07T15:16:54Z

So, trackers with external seeders is shit inside the network

Nice start..

I mean this subject for the websites we need to crawl and some maybe a mirrors we need to block or limit by the crawlPageLimit/CRAWL_HOST_DEFAULT_PAGES_LIMIT

Ideas here, just few relevant relations
#1 (comment)

And I would to ask, do we need to enable the GitHub Discussions page, or do Issues to resolve, not talk.

The text was updated successfully, but these errors were encountered:

ygguser · 2023-04-07T16:29:18Z

And I would to ask, do we need to enable the GitHub Discussions page, or do Issues to resolve, not talk.

Perhaps it would be better to chat and discuss the development in "Discussions", and use this section to solve existing (already implemented :)) problems, as well as consider user requests.
I think it would be more traditional for GitHub.

d47081 · 2023-04-08T21:31:58Z

Well, for this subject have implemented new feature that relates to the hostPage.robotsPostfix field in the database plus new configuration option available:

/*
 * Permanent rules that append to the robots.txt if exists else CRAWL_ROBOTS_DEFAULT_RULES
 * The crawler does not overwrite these rules
 *
 * Presets
 * yggdrasil: /database/yggdrasil/host.robotsPostfix.md
 *
 */
define('CRAWL_ROBOTS_POSTFIX_RULES', null); // string|null

In few words, we can append extra robots.txt rules in to the hostPage.robotsPostfix field, and these data will not be overwritten by the remote one, on auto-update.

For the white-blacklist needs we don't need the any of new features implementation, because can simply disable specific domain for it pages crawling and indexing in the host.status field.

And finally, to close this subject, I have created database configuration preset, where everyone can contribute the propositions.
Because me using this engine for Yggdrasil network scanning, I have separated this registry into the relative folder (because engine could be used for other networks also)

https://github.com/YGGverse/YGGo/tree/main/database/yggdrasil

d47081 · 2023-05-03T03:05:55Z

https://github.com/YGGverse/YGGo/tree/main/database/yggdrasil

just for a note, those data sets are depending of crawler configuration so have moved these variables to the manifest API, where each the application able to grab the data match to it specific requirements

I work on the distributed ecosystem, so for right now it's
<meta name="yggo" content="/yggo/api.php?action=manifest" />

This option could be enabled by node owner with API_ENABLED + API_MANIFEST_ENABLED settings.

d47081 added the question Further information is requested label Apr 7, 2023

d47081 modified the milestone: A goal Apr 7, 2023

d47081 changed the title ~~White list / black list websites [here]~~ White list / black list websites, robots.txt pre-sets [Yggdrasil only] Apr 8, 2023

d47081 changed the title ~~White list / black list websites, robots.txt pre-sets [Yggdrasil only]~~ White list / black list websites, robots.txt pre-sets [Yggdrasil] Apr 8, 2023

d47081 added the yggdrasil label Apr 8, 2023

d47081 changed the title ~~White list / black list websites, robots.txt pre-sets [Yggdrasil]~~ White list / black list websites, robots.txt pre-sets Apr 8, 2023

d47081 pushed a commit that referenced this issue Apr 8, 2023

implement CRAWL_ROBOTS_POSTFIX_RULES configuration #5

df6f2a1

d47081 pushed a commit that referenced this issue Apr 8, 2023

init yggdrasil robots.txt registry #5

b819fda

d47081 pushed a commit that referenced this issue Apr 8, 2023

add required user-agent construction #5

3c9bc1a

d47081 pushed a commit that referenced this issue Apr 8, 2023

add host.status registry #1, #5

be7eae5

d47081 closed this as completed Apr 8, 2023

d47081 pushed a commit that referenced this issue Apr 8, 2023

update roadmap item by #5 answer

e505c76

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

White list / black list websites, robots.txt pre-sets #5

White list / black list websites, robots.txt pre-sets #5

d47081 commented Apr 7, 2023

ygguser commented Apr 7, 2023

d47081 commented Apr 8, 2023 •

edited

Loading

d47081 commented May 3, 2023

White list / black list websites, robots.txt pre-sets #5

White list / black list websites, robots.txt pre-sets #5

Comments

d47081 commented Apr 7, 2023

ygguser commented Apr 7, 2023

d47081 commented Apr 8, 2023 • edited Loading

d47081 commented May 3, 2023

d47081 commented Apr 8, 2023 •

edited

Loading