-
Notifications
You must be signed in to change notification settings - Fork 3
I think the search results should be divided into categories somehow... #1
Comments
Anyway thanks, I'm still testing Yacy node on another server, but it seems this solution takes less of disk and ram resources, even suppose I have just 10Gb for indexes on VPS and that's not enough for at least 10M pages, because crawler collecting the raw (html-less) page text, not only meta data (in this case added the optional meta only mode). The lot of work on content semantic also. Yacy has the same issues, not possible to find something really relevant, because of crawling navigation containers and other spam words. |
It seems to me that sqlite is not very suitable for storing the database of this search engine. We have several mirrors of popular websites on the Internet (rutracker.org , for example) is a very large volume of pages for indexing... |
Agreed, I don't know why did used SQLite instead of MySQL for example, maybe the goal was simple deployment without server and database setup... Anyway, SQLlite has FTS5 solution that allow to make full text relevant search, without external dependencies. Maybe PostgreSQL - but I not familiar there. By the way, all data model implemented in the single driver file, so that is no problem to implement alternative one and add new settings row to the config file. About web mirrors, like rutracker I think we need to have some blacklist in the configuration, because what as sense to index resource that has seed/peers, related with the clearnet? P.s. on 2.5 M links we have 2 Gb of disk space usage, where most of them not indexed yes. So maybe soon I stop crawler, because we need to have some server with enough of disk space. In this case, maybe SQLite better as we may host this file on separated static host, don't know it is possible, suppose that yes. |
About Yacy experiments as the alternative, I have started few topics here But I see, that solution not for small VPS, because running Yggdrasil web directory scanning and it takes 4 cores + 4Gb RAM and index takes 1.45 GB on 10k documents. Maybe I will continue improve this engine, because much more lightweight. |
I am a little familiar with YaCy and I know that it is quite heavy and consumes a lot of resources... Taking into account the above, I think it's worth abandoning sqlite in favor of, at least, MySQL, whose database can also be stored on a separate host. MySQL also supports full-text search. And of course, this project is worth developing, especially if you are interested in it and you think it would be useful :) P.S.: All of the above is the personal opinion of the user, not an expert in the development of search engines. |
Imho, distributed solutions usually requires more resources, but idea to share the disk storage with API is cool (like Mastodon example, it maybe called Federative model), I have same thoughts in the Roadmap draft presented in README.md. At least we need to understand how many people able to participate this thing before spend a time for implementation (in ygg community context) but I vote yes too.
and it also supports Replication. yes, me clear to rewrite, SQLite better for desktop/mobile apps, maybe it was my fail as conception + current implementation takes just about 24 hours :) even Idea disturb me for a long time before Yacy node go down.
I do this project just for fun in a free time. Of course some donations could motivate me but it is not my goal. For right now I think about VPS server issue, not sure that my 10Gb one enough for these ambitions.
Thank you for the interest to this project is <3 for me |
Well, since the project is just for fun, there is no need to hurry. And it's definitely not worth burning out. Just for fun means you have to have fun :) I think It is better not to crawle these sites at the start: |
and most of them are shorterned ;) |
Do you mean addresses from subnets 300::/64? |
wait a minute please |
when I get rest in the mental hospital, there learn an ipv6 protocol but I can't still write a regular expression for the net filtering like 200::/7 or 300::/64 yet https://forum.yggdrasil.link/index.php?topic=138.msg243#msg243 |
You can try to do something like this: if ((filter_var($ip, FILTER_VALIDATE_IP, FILTER_FLAG_IPV6) !== FALSE) && preg_match('/^0{0,1}[2-3][a-f0-9]{0,2}:/', $ip)) {
$Ygg_addr_OK = true;
} |
And the forum, unfortunately, is almost dead... I recommend channels: |
Sad to hear, Alfis one? |
Alfis is alive and quite popular. |
:D |
IRC web-front-end ) |
Btw, I don 't mind cleaning this thread from spam ) |
By the way, I can't wait few minutes, plus, I was stupid that duplicated hostnames in the url row need hostname table plus uri one to prevent the data expance. in the #2 context. thanks. |
Plus some thoughts last night about the page rank columns for the host table etc) |
We have some updates by #3, now I have an idea to make semantic markers to the readme or another separated file, aka yggo.txt where owner can provide the website rubric. Of course images and videos requires another interface, but for right now I have an idea just to make the websites thematic in the to tab, beside the media search interface. |
Meta attributes can be inserted into html pages: <META name="description" content="My cool site">
<META name="keywords" content="blog, programming, linux"> This can also be used to sort sites by category. |
hm, thanks, but we have limited tray area.. I supposed or ygg.txt/user-agent:ygg* where we can provide super live results... ps I won't to have a deal with chat gpt API attempts because before tried to simulate isotopes before understand MIT has a super computer :) |
I need to add, that we have about 92 hosts in the network crawled for right now (by the new model #3) re-index beginning playing with CRAWL_HOST_DEFAULT_PAGES_LIMIT by increasing it value to 1k - maybe more hosts available there. I just keep in mind, that could be awesome to add some extra-semantic rules to our screwed project |
I think it would make sense to add something like this panel:
If I want to find sites, then such search results don't make much sense to me:
The text was updated successfully, but these errors were encountered: