Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[open] handle input files that cannot seek backward #1978

Merged
merged 3 commits into from
Aug 1, 2023

Conversation

midichef
Copy link
Contributor

Summary
Visidata drops lines from files that don't support seeking backward, like pipes, or symlinks to fifos. (v2.11 and develop)

It can be seen by using bash's <(command) operator for process substitution, which is implemented with pipes:
vd <(seq 10000)
On my system this is missing thousands of lines, because the file-guessing code consumes them. And some of the file guessing code gives errors from trying to seek().

Similarly, open_txt() has the same problem. It consumes lines when it peeks at the first lines of the file.
vd -f txt <(seq 10000)

This fix checks whether input files support seek. If they do not, it reads the input into a RepeatFile, so that the same Path p can be opened multiple times.

Limitations
However, the fix will not change the behavior on Windows. This is because seekable() gives wrong info for pipes on Windows. CPython has an open issue for this but it seems unlikely to be fixed any time soon.

Also, the way I cause the RepeatFile to be created is a messy hack:
https://github.com/midichef/visidata/blob/3c14d752b8196d5641a0147935f4fce5f81e3049/visidata/_open.py#L109
There were already a confusing number of layers of open-like functions that get called: vd.openPath()open_txt()visidata.Path.open()visidata.Path._open()FileProgress(..., fp=self._path.open(). Inserting another layer via p = Path(p.given, fp=p.open(mode='rb')) is not great. I welcome any better ideas of how to do this.

Additional changes
I changed visidata.Path.open() to give an error when trying to open in binary mode, if the path has a RepeatFile. Because RepeatFile can only hold text and can only support the kinds of seeking that text files do.

This change helps head off file-guessing functions that expect a binary file. Otherwise they open the file and fail when trying to do a binary-appropriate end-relative seek(). This change also will prevent the error messages from guess_* in #1957

And I changed RepeatFile.seek(n) to take an extra parameter, because filetype-guessing libraries expect to be able to do seek(offset, 2), and give an error when they cannot:

guess_zip: RepeatFile.seek() takes 2 positional arguments but 3 were given

I modeled the implementation after what's in _pyio.py
https://github.com/python/cpython/blob/76c26eaca4147ba7e3e8d7379c5a828f0b512a46/Lib/_pyio.py#L2443

I also took out a check for if not create which is redundant, as it is checked earlier in openPath().

@midichef
Copy link
Contributor Author

It looks like the tests are failing because the file-saving code triggers for directories since they are not seekable. I'll look into replacing seekable() with is_fifo().

@saulpw
Copy link
Owner

saulpw commented Jul 28, 2023

This is really great, @midichef. I agree, that whole Path business is pretty convoluted. Thanks for diving into it and cleaning up some bits of it, at least. Let's keep thinking of ways we can simplify it, but for now these seem like definitive improvements.

@midichef
Copy link
Contributor Author

Okay, great, I squashed the changes and didn't change anything else.

@midichef midichef marked this pull request as ready for review July 29, 2023 07:09
@anjakefala
Copy link
Collaborator

anjakefala commented Jul 31, 2023

Hi @midichef!

Once again, I appreciate your thorough write-up in the description. This is a pretty bad bug, and I am grateful you caught it.

I have a question:

On my system this is missing thousands of lines, because the file-guessing code consumes them. And some of the file guessing code gives errors from trying to seek().

Is the intent that this PR fixes this issue? Or just the issue for open_text?

@anjakefala
Copy link
Collaborator

anjakefala commented Aug 1, 2023

Okay, I see. On develop vd <(seq 10000) would not work at all, and -f txt would lose a bunch of lines. On this PR vd <(seq 10000) loses thousands of lines by vd.guessFiletype(p) , and -f txt has them all.

I am going to see if I can help with not losing lines in vd <(seq 10000).

seq 10000 | vd works fine, which is why this has not been caught before.

@anjakefala
Copy link
Collaborator

The rows dropped are weirdly always the ones between 2679 and 6776? Which is fascinating.

@saulpw
Copy link
Owner

saulpw commented Aug 1, 2023

@anjakefala see what bytes those correspond to. 64k-128k maybe?

@anjakefala
Copy link
Collaborator

anjakefala commented Aug 1, 2023

One thing I learned is that codecs.iterdecode is returning an fptext that contains all of the rows. codecs.iterdecode is not what is dropping them. By the time it enters open_txt, they are dropped.

If you add [l for l in p.rfile] in guessFiletypes, to run after every guess attempt, all of the lines are read. This is a clue for how to fix this in a cleaner way, and what is going on, but I am not quite sure how to approach it.

@midichef
Copy link
Contributor Author

midichef commented Aug 1, 2023

Ah, nice catch. It looks like tarfile.is_tarfile() is the problem:
vd --debug <(seq 100000):
guess_tar: [Errno 29] Illegal seek

if tarfile.is_tarfile(p):

@anjakefala
Copy link
Collaborator

anjakefala commented Aug 1, 2023

@midichef Oh wow, I did notice that, and thought it Must Be the culprit. But got myself into a weird pdb session, and was not able to confirm that it was the culprit.

Indeed, it does seem to be! What happens if you get yourself into a debugging hole.

@anjakefala
Copy link
Collaborator

anjakefala commented Aug 1, 2023

Yeah, it seems like what is_tarfile wants is something akin to the RepeatFile. It wants a fp, not a Path!

is_tarfile(p.fp) does seem to resolve the issue, but I want to understand it a bit better.

(is_tarfile also only accepts a filepointer for > Python 3.9.)

@anjakefala
Copy link
Collaborator

I added a test so that hopefully we would catch this in the future! If things are green, I will go ahead and merge.

@anjakefala anjakefala merged commit 007814c into saulpw:develop Aug 1, 2023
13 checks passed
@midichef midichef deleted the pipe_open branch August 4, 2023 06:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants