-
-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sefaria's Text API #1343
Comments
First of all, thank you! This will simplify a lot of interactions with the API.
{
ref: "Genesis 1",
segments: [
{
variants: [...],
ref: "Genesis 1:1",
},
// Or for Talmud texts
{
variants: [...],
ref: "Shabbat 2a.1",
},
],
}
|
Glad to hear! What ever happened to the idea of the GraphQL API I started designing, in #602? It solves many of the issues with the current API elegantly, extensibly, flexibly, discoverably, and self-documentingly. I believe that by nature it will be much easier to implement as well. See also POC in #741. |
@bandleader Regarding GraphQL - it's really interesting, and thank you for the thought and work toward a POC. |
I don't know if this is the kind of feature you are interested in trying to implement. But it would be very helpful to us and potentially to others. We use the text api extensively to link to content that we do not have natively in our app. We make ca text api call - and present the content in a local popup within our app. The problem that we have is that amount of text that is presented many times is very long and not precise enough to help a person zero in on the correct part of the reference. The way that we deal with this with internal links to our own content - is that besides the Page (to main entry in our database) We also include IDs of a Range of Phrases or Range of Words or a list of Multiple Phrases or multiple Words and our internal engine returns the "Page" with the words or phrases highlighted. That way you can see what the author was referring to in context of the whole source. In our internal texts we either have ids on every word or on phrases. In the Sefaria texts you do not have ids at that level. It would be nice if we could specify words 20-25 within a source and have those words wrapped in a tag as the selection that we could render as we please. |
@EliezerIsrael Regarding GraphQL, I had responded in the original GH issue that there is no inherent security issue with GraphQL. GraphQL APIs are used by large and security-conscious companies all over the world, including by the very GitHub app we are conversing on :) As described, the one thing about GraphQL that is relevant security-wise is that queries are very flexible, and you can write queries inside other queries, therefore a user could write a query that takes a huge amount of work to execute, and it still looks like a single request, so if you're rate-limiting by number of requests, then you have a problem. However:
Either you can count the number of sub-queries, i.e. if you ask for commentaries then every commentary counts as a 'hit' towards the user's API quota, and same for GraphQL where you can request multiple texts in a single query, every text requested counts as a 'hit' Or, you can simply measure the amount of CPU time taken up by a query, and rate-limit based on that. e.g. you can only send 5 seconds of queries every minute, and you can't have more than 5 minutes per hour, etc. This is quite easy. The GraphQL docs detail these things here, which in turn links here. Let me know if you need any clarification! |
@bandleader I do think it's important to note that since the Sefaria.org application makes use of the APIs to run the application, switching to GraphQL would require the dev team to update the web application architecture to use the GraphQL runtime instead of the REST API on the web app, which in addition to being a ton of work, could come along with its own set of issues (cacheing, etc), it makes sense for there to be some resistance to using it even outside of the security question. Re improvements to the Text API, assuming we're keeping the same architecture: I've also noted the issue where it's possible to make fairly large queries with our current API with the commentary flag. My suggestion here was going to be that if we want to continue to allow pulling connections outside of the links api, the texts API should have more granular system for requesting commentaries & connections and/or just the indices of such (i.e. to make it possible to query the text of or metadata about particular commentaries along with the base text but also to limit the default scope of what gets pulled into the response in some thoughtful way). I also think that it might make sense for all the flags that feel very "coupled" with particular app behavior and defaults, along with some of the default values should be re-evaluated (a number of different people I think have asked why the default behavior is I like @ronshapiro's idea about requesting languages. |
|
@bandleader ah, I think I misunderstood the changes you were suggesting on some level (just looked at the linked issues/POC and see that it's just an HTTP endpoint that handles the client requests written in GQL). Thanks for clarifying! |
@bandleader GraphQL does merit consideration, but I think that scale of complexity of implementation is too much for our team to swallow at the moment. @mayerpasternak We made a design decision way back in our early days - that we divided into segments and not words. It let us move quick, but it definitely has downsides. We run up against it ourselves. @ronshapiro You're right about iteration and the weirdness of JaggedArrays. Good time to bring that up.
Your suggestions ticks most of those boxes. I'm wondering - have you seen anything similar in the wild? It seems like we can't easily avoid a busy syntax for this. |
World level highlighting could be done externally – but your website numbers does not show word numbers – so our scholars are working blind – unless we create a new interface to your texts those expose the number of each word. We would also need to take your response from the text api and assign numbers to each word – to create the highlighting. We could possibly create all of this outside of your system – but I suspect that there are other users that could benefit from this – so it might make sense to add this as an option - instead of everyone building their own system.
From: Lev Eliezer Israel ***@***.***>
Sent: Sunday, May 7, 2023 9:58 AM
To: Sefaria/Sefaria-Project ***@***.***>
Cc: Mayer Pasternak ***@***.***>; Mention ***@***.***>
Subject: Re: [Sefaria/Sefaria-Project] Sefaria's Text API (Issue #1343)
@bandleader<https://github.com/bandleader> GraphQL does merit consideration, but I think that scale of complexity of implementation is too much for our team to swallow at the moment.
@mayerpasternak<https://github.com/mayerpasternak> We made a design decision way back in our early days - that we divided into segments and not words. It let us move quick, but it definitely has downsides. We run up against it ourselves.
It seems to me, given the implementation constraints, that word level highlighting belongs a level above the bare texts API. I could imagine something on an SDK level that takes a Ref and a string of text (or text boundaries of some sort), queries the Sefaria API, then wraps the needed text. It seems like you've implemented it in-house, but I could imagine that provided at the Sefaria SDK level.
@ronshapiro<https://github.com/ronshapiro> You're right about iteration and the weirdness of JaggedArrays. Good time to bring that up.
And your thoughts about request format are interesting. I think we do want to allow the user to specify a list of languages (3 letter language codes, likely.) The highest priority original text will probably have a reserve word like “base”. We need to specify how to specify, language by language
* One highest-priority version of a lang
* A specific version of a lang
* Multiple specific versions of a lang
* All versions of a lang
Your suggestions ticks most of those boxes. I'm wondering - have you seen anything similar in the wild? It seems like we can't easily avoid a busy syntax for this.
—
Reply to this email directly, view it on GitHub<#1343 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ATPMSCKZ3SNIOZD46245UB3XE6S73ANCNFSM6AAAAAAXKVYBBU>.
You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>
|
I care little about how to specify the versions as long as there are those
options. I imagine that the common bugs are going to be around url encoding
version titles. You could try and do something clever with http headers...
but those are always a little less obvious.
Another idea about jagged arrays: make the format/structure configurable.
Perhaps there are people that want jagged arrays structures. But make them
ask for it.
בתאריך יום א׳, 7 במאי 2023, 17:05, מאת mayerpasternak <
***@***.***>:
… World level highlighting could be done externally – but your website
numbers does not show word numbers – so our scholars are working blind –
unless we create a new interface to your texts those expose the number of
each word. We would also need to take your response from the text api and
assign numbers to each word – to create the highlighting. We could possibly
create all of this outside of your system – but I suspect that there are
other users that could benefit from this – so it might make sense to add
this as an option - instead of everyone building their own system.
From: Lev Eliezer Israel ***@***.***>
Sent: Sunday, May 7, 2023 9:58 AM
To: Sefaria/Sefaria-Project ***@***.***>
Cc: Mayer Pasternak ***@***.***>; Mention ***@***.***>
Subject: Re: [Sefaria/Sefaria-Project] Sefaria's Text API (Issue #1343)
@bandleader<https://github.com/bandleader> GraphQL does merit
consideration, but I think that scale of complexity of implementation is
too much for our team to swallow at the moment.
@mayerpasternak<https://github.com/mayerpasternak> We made a design
decision way back in our early days - that we divided into segments and not
words. It let us move quick, but it definitely has downsides. We run up
against it ourselves.
It seems to me, given the implementation constraints, that word level
highlighting belongs a level above the bare texts API. I could imagine
something on an SDK level that takes a Ref and a string of text (or text
boundaries of some sort), queries the Sefaria API, then wraps the needed
text. It seems like you've implemented it in-house, but I could imagine
that provided at the Sefaria SDK level.
@ronshapiro<https://github.com/ronshapiro> You're right about iteration
and the weirdness of JaggedArrays. Good time to bring that up.
And your thoughts about request format are interesting. I think we do want
to allow the user to specify a list of languages (3 letter language codes,
likely.) The highest priority original text will probably have a reserve
word like “base”. We need to specify how to specify, language by language
* One highest-priority version of a lang
* A specific version of a lang
* Multiple specific versions of a lang
* All versions of a lang
Your suggestions ticks most of those boxes. I'm wondering - have you seen
anything similar in the wild? It seems like we can't easily avoid a busy
syntax for this.
—
Reply to this email directly, view it on GitHub<
#1343 (comment)>,
or unsubscribe<
https://github.com/notifications/unsubscribe-auth/ATPMSCKZ3SNIOZD46245UB3XE6S73ANCNFSM6AAAAAAXKVYBBU>.
You are receiving this because you were mentioned.Message ID:
***@***.******@***.***>>
—
Reply to this email directly, view it on GitHub
<#1343 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGBRXPPQRFTYUPYBI5X2OLXE6TZJANCNFSM6AAAAAAXKVYBBU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@mayerpasternak I hear what you're looking for. I just opened up a new issue for it, so we can workshop highlighting subsections of a segment on its own terms. |
Please provide an option to ensure requests don't receive any markup in the returned text, but rather plain text alone. Currently HTML is returned in some version of some texts. For example, when I simply request
This is not easily digestible by any system not intending to directly display the returned text in an HTML context as it is riddled with markup. Further, stripping the tags is non-trivial as in these cases their contents should be deleted, whereas in other cases perhaps it should be retained. And, on a more abstract level, it interleaves secondary text with the primary text in a particular manner, which constrains its usefulness even in an HTML context. |
@bdjnk -- You can get part of the way there by tacking on a e.g.: https://www.sefaria.org/api/texts/genesis-1:1?stripItags=1
It still leaves the markup for bold, italics, etc, but removes the footnoted content. |
I would very much appreciate if y'all would be willing to make the Texts API support Hebrew refs, e.g. searching בראשית כג:א or בראשית 23.1. I know that there is a list of Hebrew titles out there somewhere in the codebase, because I searched for it a few months ago, but discovered that the Python method (or lines of code, I dont remember) that would have exposed this book list - a critical part of the queries I work with, which uses strong autocomplete in order to validate ref titles to then make said queries - was commented out, so that's an easy fix. |
Hey @shelfgot -- the texts api davka does support Hebrew refs -- e.g. בראשית כג:א -- it does however require that Hebrew text to be percent encoded (the browser itself does this, but in code you may need to do so explicitly) |
Hello Sefaria Developer Community!
We are planning on refactoring one of our main API endpoints - the Text API.
While doing so, we are interested in making its use more straightforward and also more flexible.
Currently the API always returns two versions of a requested text reference, A "Hebrew" and an "English".
In recent years, Sefaria's data has branched out to include texts that have non Hebrew source versions (Judeo-Arabic, Aramaic and even English) and also translations of texts into multiple non-English languages (German, Spanish, etc). We have used the current API to try and still provide that data, but this is no longer sufficient.
So we are looking to improve the way the API allows developers to interact with various languages more directly and give them more control of just what they need the text API to return.
For starters, we are looking to get rid of the forced duality of text version in the response. The user will be able to request a single version, or two specific languages, or more.
Beyond that we are interested in hearing what users of this API would find useful.
Is it the ability to get all versions of a given language?
All translations of a certain text?
Asking for a language with a fallback to a default language?
Any other suggestions or things you'd want to make use of?
Let us know in the comments!
The text was updated successfully, but these errors were encountered: