Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offer a ReadableStream/WritableStream interface to work getRawObject blobs... #4

Open
TooTallNate opened this issue Mar 29, 2011 · 9 comments
Labels
Milestone

Comments

@TooTallNate
Copy link

This is more of a feature request / API suggestion:

Ideally, especially for very large files (or in the case of git, blobs), http responses should be streamed back to the client. In node, this is ideally done with Stream#pipe().

Currently with node-gitteh, streaming a revision of a file over a socket or to a file is impossible. I suggest that you offer a way to create both a ReadableStream and WritableStream that would work with a Repository instance's RawObjects. These streams would be very similar to their fs module counterparts.

Perhaps something like:

var readStream = repo.createRawReadStream(id);
readStream.pipe(fs.createWriteStream("file.dump"));

for a read stream would be cool. The stream would periodically emit data events. Then we could easily pipe the contents of a blob to a file, or socket, or http response, or whatever. Let me know what you think. Thanks!

@samcday
Copy link
Contributor

samcday commented Mar 29, 2011

Heya,

Interesting you should bring this up, because the latest version of libgit2 - v0.11.0 (released yesterday) actually offers readable streams for objects now.

I'm still in the process of getting on top of these changes (there's quite a few), a new version of gitteh is a few days off yet. I'll see if I can fit in the time to expose the libgit2 object streams to Node.js!

@samcday
Copy link
Contributor

samcday commented Mar 29, 2011

Was looking into the possibility of streaming tonight, and I noticed a potential snag:

Opening a stream on an object in libgit2 is only possible if the object is currently stored as a "loose" object. A loose object is when the object is stored in a flatfile in the objects/ directory. If the git repository has been compacted at any point, then objects are more than likely going to instead be residing in a packfile, which uses zlib, deltas, and all sorts of magic that means streaming a file is somewhat impossible.

If I do implement this feature, I suppose as a user of the library you'd have to try streaming first, and then fallback to a standard object read if the requested blob has been packed.

@TooTallNate
Copy link
Author

Well how does "getRawObject" work with packed blobs then? It does the magic
underneath the hood right? Would it not be possible to modify the underlying
function in a way that it returned immediately (i.e. without any blob data),
and while it does it's "magic" unpacking, the JS stream would emit unpacked
(smaller) buffers as 'data' events.

Wouldn't something like that be possible?

On Tue, Mar 29, 2011 at 6:38 AM, samcday <
[email protected]>wrote:

Was looking into the possibility of streaming tonight, and I noticed a
potential snag:

Opening a stream on an object in libgit2 is only possible if the object
is currently stored as a "loose" object. A loose object is when the object
is stored in a flatfile in the objects/ directory. If the git repository has
been compacted at any point, then objects are more than likely going to
instead be residing in a packfile, which uses zlib, deltas, and all sorts of
magic that means streaming a file is somewhat impossible.

If I do implement this feature, I suppose as a user of the library you'd
have to try streaming first, and then fallback to a standard object read if
the requested blob has been packed.

Reply to this email directly or view it on GitHub:
#4 (comment)

@samcday
Copy link
Contributor

samcday commented Mar 29, 2011

Alas, no. Currently the way raw objects are being delivered is simply a request to libgit2 that returns a void* pointer containing the data. There is no data until the method is called, and the method does not return until the data buffer has been fully populated.

Welcome to the world of low level C libraries (:

Look the thing is, pack files are only built by git.git when you run "git gc" in the repo, or when you push/pull a repo remotely. If you're using gitteh to serve git resources directly, then you can just require that the blob resources being served are loose. I think theres actually a git.git command that forces all pack files to be decompressed into loose files anyway. Libgit2 doesn't have this right now, but it will eventually

Sent from my iPad

On Mar 30, 2011, at 2:58 AM, [email protected] wrote:

Well how does "getRawObject" work with packed blobs then? It does the magic
underneath the hood right? Would it not be possible to modify the underlying
function in a way that it returned immediately (i.e. without any blob data),
and while it does it's "magic" unpacking, the JS stream would emit unpacked
(smaller) buffers as 'data' events.

Wouldn't something like that be possible?

On Tue, Mar 29, 2011 at 6:38 AM, samcday wrote:

Was looking into the possibility of streaming tonight, and I noticed a
potential snag:

Opening a stream on an object in libgit2 is only possible if the object
is currently stored as a "loose" object. A loose object is when the object
is stored in a flatfile in the objects/ directory. If the git repository has
been compacted at any point, then objects are more than likely going to
instead be residing in a packfile, which uses zlib, deltas, and all sorts of
magic that means streaming a file is somewhat impossible.

If I do implement this feature, I suppose as a user of the library you'd
have to try streaming first, and then fallback to a standard object read if
the requested blob has been packed.

Reply to this email directly or view it on GitHub:
#4 (comment)

Reply to this email directly or view it on GitHub:
#4 (comment)

@TooTallNate
Copy link
Author

Thanks for all the work on gitteh thus far Sam, v0.1.0 looks exciting!

This one still urks me though. It seems that libgit2 needs to provide some more low-level functions. Why couldn't the function you're talking about (that gets the raw objects' data) be modified (in libgit2), to instead return immediately and begin filling the void* from another tread? It would be ideal to work with some sort of readiness API, so that perhaps the internal node IOWatcher code could be reused like node-serialport has done.

Because it occured to me that, even though it's extremely lame, I could spawn a git show HEAD:index.html with node, and have a streamable stdout to read from, but I'd really rather not have to invoke child processes if at all possible, so doing something like I described above must be possible on some level.

@samcday
Copy link
Contributor

samcday commented Apr 3, 2011

Hey man,

I see your point, however I still need to stress that this situation is a little different.

That serial port example you linked, if you check out the binding code, it's just opening /dev/stty or something. Either way it's opening a stream on the local filesystem. The difference with libgit2 is, the blob you're trying to open might be a simple zlib compressed object on the filesystem, OR it might be a delta stored in a pack file, which in turn is a differencing blob from another blob in ANOTHER pack file ;)

Modifying libgit2 is a possibility, however I'm not really involved in libgit2 development at all.

Regarding the way git CLI works, even though you think it's streaming on stdout, I think you'd find if you timed it, there's a short delay while git unpacks the file the blob is contained in. I'm going to do a couple of tests when I get into the office shortly to demonstrate this.

Oh and btw, I should note that everything I'm talking about right now is regarding getting blobs from pack files right now. I'm going to implement a streaming method for loose objects in the next release.

In then end though, we should probably just run some timing tests and see how long it takes to get a Buffer of packed blob data from libgit2, and compare it to piping from git CLI for example. If the difference is severe enough, we can investigate ways to improve it. The other thing is that caching could just be the solution in the context if projects like gitProvider that are surfacing git data directly to a client.

Sent from my iPad

On Apr 4, 2011, at 5:09 AM, [email protected] wrote:

Thanks for all the work on gitteh thus far Sam, v0.1.0 looks exciting!

This one still urks me though. It seems that libgit2 needs to provide some more low-level functions. Why couldn't the function you're talking about (that gets the raw objects' data) be modified (in libgit2), to instead return immediately and begin filling the void* from another tread? It would be ideal to work with some sort of readiness API, so that perhaps the internal node IOWatcher code could be reused like node-serialport has done.

Because it occured to me that, even though it's extremely lame, I could spawn a git show HEAD:index.html with node, and have a streamable stdout to read from, but I'd really rather not have to invoke child processes if at all possible, so doing something like I described above must be possible on some level.

Reply to this email directly or view it on GitHub:
#4 (comment)

@iamwilhelm
Copy link
Contributor

Recently, I was also looking for a stream interface for blogs, since I'm trying to read large blobs from the repository. In your comment above, you said that you were going to implement a stream interface for unpacked files, but I can't seem to find it, either in the source or the documentation.

Was this ever implemented?

@mildsunrise
Copy link
Contributor

@ all of us Yes, this is implemented.
You can stream objects from the Object DataBase (see this method, for example).

However, this would require implementation of an ODB class.

@mildsunrise
Copy link
Contributor

@samcday Please, mark this as whishlist!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants