Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orleans Indexing and Lucene.Net #4

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

KSemenenko
Copy link

@KSemenenko KSemenenko commented Sep 5, 2021

Anyway, I've been waiting for this functionality for years, and I've thought a lot about it.
After studying the original documents and the code, I came to the conclusion that it would be better to use #3 Lucene.Net for indexing. For example, the same ElasticSearch does this. With Grain, it is easy to do index clustering.
On GitHub there is code to support LINQ queries, or code for storing files in Azure Storage.
I think that it is possible to use a special Grain which will keep track of a certain type of grain and index the necessary fields.
I can use a service, and thus have indexes on each Silo.

So as soon as I had time I built some prototype.

What do you think about Lucene.Net? @ReubenBond @sergeybykov @philbe
If you like this idea I can go ahead and make this code stable.

@KSemenenko
Copy link
Author

KSemenenko commented Sep 5, 2021

public async Task GrainTest()
{
    var grain = new IndexGrain();

    await grain.OnActivateAsync();

    int count = 0;
    int foundCont = 0;

    await Task.WhenAll(Task.Run(async () =>
    {
        for (int i = 0; i < 150; i++)
        {
            var doc = new GrainDocument(i.ToString());
            doc.LuceneDocument.Add(new StringField("property",$"i={i}", Field.Store.YES));
            await grain.WriteIndex(doc);
            count++;
        }
    }), 
    Task.Run( async () =>
    {
        await Task.Delay(1000);
        for (int i = 0; i < 300; i++)
        {
            var doc = await grain.QueryByField("property",$"i={i}");
            count++;

            if (doc.TotalHits > 0)
            {
                foundCont += 1;
            }

        }
    }));

    await grain.OnDeactivateAsync();

    count.Should().Be(450);
    foundCont.Should().Be(150);

}

In this test, of course, I create indexes in Lucene.Net, which is not convenient.
Of course for all this you should write wrapper methods. and for queries add LINQ.
and we'll have even better than ElasticSearch

@KSemenenko KSemenenko changed the title Orleans Indexing and Luciene Orleans Indexing and Lucene.Net Sep 5, 2021
@SebastianStehle
Copy link

I tried to implement Lucene full text search based on Orleans and cloud storage providers and I kind of failed. The problem I faced were around performance:

  1. You need a central storage for your Lucene indexes. You can implement the Index Directory using Azure Blob Storage or so but it is relatively slow. In my experience it was much faster to periodically make a backup of the snapshots, by putting them in an archive and then send it over. In combination with a remove disk that works as a backup, the write is not safe.
  2. Lucene is not build for commits of each document. If you wanna have high performance you need to make the changes in batches.

If you wanna achieve high performance and stability it is very challenging, especially because Orleans Applications are deployed much more often than a database. If you can achieve that, it would be great, but I have lost data from time to time and therefore decided to go with Elastic or a database full text system.

@KSemenenko
Copy link
Author

@SebastianStehle Can I ask you to disclose the details of your implementation?
did you store/load data in memory on the activation and deactivation of Grain?
did you have only 1 index, or did you use MuliIndex?

@SebastianStehle
Copy link

Hi, my implementation is removed now but ít is Open Source: https://github.com/Squidex/squidex/tree/8e088beb1c91626d1f67ec8a09f2b80740639054/backend/src/Squidex.Domain.Apps.Entities/Contents/Text/Lucene

  1. I had multiple indexes, one grain per index.
  2. The index was loaded from a central store like S3 to a local folder on activation.

I think the most important class is this one: https://github.com/Squidex/squidex/blob/8e088beb1c91626d1f67ec8a09f2b80740639054/backend/src/Squidex.Domain.Apps.Entities/Contents/Text/Lucene/IndexManager.cs

It manages the indexes in case a grain gets deactivated and the index is not committed.

@sergeybykov
Copy link
Collaborator

sergeybykov commented Sep 11, 2021

@KSemenenko Conceptually, I don't see a problem with the idea. My intuition is more aligned with @SebastianStehle though. For a limited scale and load, holding and updating indices in memory will probably work. But in a production setting I'd be nervous about the lack of separation of concerns and sharing memory/CPU resource with Lucene.Net in the same process. For production use, I'd look at offloading indexing to something like Elastic or at least hosting Lucene.Net indexing code in a separate process.

Disclamer: I've never user Lucene.Net. My thoughts here are pure intuitive speculations FWIW.

@KSemenenko
Copy link
Author

That's an interesting thought @sergeybykov
Maybe then we need some kind of abstraction like you did for storing states.

An interface for writing data, and interface for Iqueralable to make queries.
And then do a basic implementation in memory, for example on List storage.
And then do interface implementations for redis, cosmosdb and other databases?

@KSemenenko
Copy link
Author

Although, for example, I keep silo in a kubernetes cluster and I have no problem adding a couple of virtual machines.
Right now I use cosmosdb to store the index. I don't really like this solution. And I still wanted to make solutions with indexes.

@sergeybykov
Copy link
Collaborator

Yes, an interface with pluggable implementations would be the way to go.

@SebastianStehle
Copy link

When you talk about indexes you have basically 2 options:

  1. Do everything in memory and use things like Dictionary or SortedDictionaries in C#.
  2. Try to find a solution that also work great when the majority of the data is still on the disk, e.g. B+Trees or inverted indexes.

Lucene and databases use the second approach because the goal is to work with large data sets.

I thought the goal of this project is to work on the Key-Value stores and follow the first approach. If we use the database for queries, why do we need Orleans Indexing at all? It would be far easier and more efficient to use stored states directly, perhaps with a mapping function for indexes? https://github.com/sebastienros/yessql/wiki/Tutorial#creating-mapped-index

Another index has the big problem that it can be out of sync with the original data, especially if you do not use transactions.

@KSemenenko
Copy link
Author

To give you an example, I have thousands of users who all have their geo position.
All communication with users is through grain, because it is the only source of up-to-date data.
I have the user's location in the database, but it is like a storage between activations of grain.
So I want to find all the users in the area. And get their grain id.
Now I have a table in cosmosdb in which I store geoposition and Grain Id.
Now every time the geoposition changes, I have to update the table in the datadatabase.

I see indexing as a convenient abstraction over storage\database. And a fairly powerful search system. yesterday I thought it would be cool to have the Grein itself take care of the index updates. For example, we'll write a post handler. Which will write variables marked with an attribute to the index when the grain method finished.

we can generate something like INotifyPorperyChaned and watch for changes of variables. Or something like that.

Well, in general, it's as abstract as the state of grain. But only for indexing.

@SebastianStehle
Copy link

You are talking about an abstracting to a custom Grain. Then I am on your side ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants