Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

for downloads less than 32MB pep_api_data_obj_read_* is triggered more than once and contains wrong size info in len attribute #604

Closed
mstfdkmn opened this issue Aug 5, 2024 · 16 comments

Comments

@mstfdkmn
Copy link

mstfdkmn commented Aug 5, 2024

python-irodsclient 2.0.1
irods 4.3.2

I am observing unnecessary pep-invokes (and irrelevant size information in the len attribute) when an object that its size is under the parallel threshold (size is less than 32BM) is downloaded. I think that is because of the extra reading (READ_BUFFER_SIZE ) for the small files here at

return iter(lambda: f.read(chunksize), b'')

Called by

for chunk in chunks(o, self.READ_BUFFER_SIZE):

My observations:

  • when a zero byte object is downloaded, pep_api_data_obj_read_* is triggered only one time but the following message is captured. Please see the value for the len serialized object.

image

  • when an object that its size is more than zero (2bytes) but less than 32 MB is downloaded, pep_api_data_obj_read_* is triggered more than one time and the value in the len varies in each invoke.
  • that is, for a 2 bytes file download, pep_api_data_obj_read_* is triggered three times and only the one in the middle invoke has the correct size information.

image
image
image
image

  • for a 20MB file download, pep_api_data_obj_read_* is triggered 5 times and none of them has the correct size information.
    image
    image
    image
    image
    image
    image

I have a kind of fix that ensures pep_api_data_obj_read_* is called only once, accurately populating the len field in the PEP. My tests haven't revealed any issues, but I'd like to get your input before submitting a pull request to confirm there won't be unintended side effects. Or might be anther nice way to fix this?

in /irods/data_object.py:

def chunks(f, chunksize=io.DEFAULT_BUFFER_SIZE):
    pos = f.tell()
    size = f.seek(0,os.SEEK_END)
    f.seek(pos,os.SEEK_SET)
    if size < 32 * ( 1024 ** 2)  - 1:
        return (size, f.read(size), b'')
    return (size, iter(lambda: f.read(chunksize), b''))

L219 in/irods/manager/data_object_manager.py:

else:
    no_chunk = chunks(o, self.READ_BUFFER_SIZE)
    if no_chunk[0] < 32 * ( 1024 ** 2):
       f.write(no_chunk[1])
       return
  for chunk in chunks(o, self.READ_BUFFER_SIZE)[1]:
       f.write(chunk)
       do_progress_updates(updatables, len(chunk))
@d-w-moore
Copy link
Collaborator

d-w-moore commented Aug 28, 2024

If I'm understanding the use case, we want the ability to accurately record the length of the overall data object (or amount of it read in total by the _download on the client side,) within the execution of one PEP. Of course it's not an expectation that could ever apply to data objects larger than the single buffer transfer size. Does it make sense to aim for a client-side solution in this case? I think I wouldn't mind accepting alternate schemes for chunking as a parameter in the get( ) and _download functions, but I don't know whether altering the scheme for all applications is called for.

@trel
Copy link
Member

trel commented Aug 28, 2024

I don't know that it's the job of an individual read to have any idea how big the entire data object is...

Can we explain the multiple read calls and the differing sizes shown in the original post?

If we can explain them, then this might be just a documentation exercise, and not a bug?

@d-w-moore
Copy link
Collaborator

d-w-moore commented Aug 28, 2024

Right, the hint to me was with zero-size object read producing a len field of 8 *1024**2 that we might be recording in this "len" field the attempted read size.

@d-w-moore
Copy link
Collaborator

I notice a bytesWritten field in @mstfdkmn 's output above. Wonder if there's the possibility of a bytesRead field being implemented.

@d-w-moore
Copy link
Collaborator

I notice a bytesWritten field in @mstfdkmn 's output above. Wonder if there's the possibility of a bytesRead field being implemented.

Hmm... I notice in pep_api_data_obj_read_post, the 4th parameter (rule_args[3]) is a BytesBuf. So ... len( rule_args[3].get_bytes()) turns out to be a really good clue about how many bytes were read. The only remaining trick is to correlate that figure across multiple reads to the data object in question. Do we have any precedent or guide for that? I've looked through dumps of the information available to the read pep, and there isn't anything to indicate what data object is being operated on.

@trel
Copy link
Member

trel commented Aug 28, 2024

i don't think it's the job of the read pep to know about the entire object.

@d-w-moore
Copy link
Collaborator

^^^ There's an l1descInx of value 3 in the irods_types.OpenedDataObjInp . Maybe that correlates to the fd opened in a data_obj_open_*

@d-w-moore
Copy link
Collaborator

i don't think it's the job of the read pep to know about the entire object.

It wouldn't seem so, but maybe l1descInx value is a unique "pointer" to a data object for the server hosting the data object being downloaded. @alanking @korydraughn ?

@alanking
Copy link
Contributor

l1descInx indeed acts as an index into the L1 descriptor table which contains information about the replica which is open for read or write. One would need to use the get_file_descriptor_info API to get that information: https://docs.irods.org/4.3.2/doxygen/get__file__descriptor__info_8h.html It should contain the size of the replica as recorded in the catalog (which, generally, is trusted).

@d-w-moore
Copy link
Collaborator

Hello @mstfdkmn - If you would like to track the total downloads from specific replicas, I think it could be done via the python rule engine plugin. The following rule illustrates the underlying calls necessary to get descriptor information, as well as how you might tally bytes read from a source replica. You would need to key on certain information derived from the l1descinx member of the OpenedDataObjInp argument to pep_api_data_obj_read_post, and use the bytesBuf parameter to determine the actual length of the byte buffer read in any one invocation.

import json,pprint
def pep_api_data_obj_read_post(rule_args, callback, rei):
     bytes_read = len(rule_args[3].get_bytes())
     ret = callback.msi_get_file_descriptor_info(rule_args[2].l1descInx,"")
     fd_info = json.loads(ret['arguments'][1])
     callback.writeLine("serverLog","<<<<\nJSON from msi loaded and translated -->\n"+
             pprint.pformat(fd_info))
     callback.writeLine("serverLog",f"{bytes_read=}\n>>>>")


@mstfdkmn
Copy link
Author

mstfdkmn commented Sep 3, 2024

Hello @d-w-moore,

If the main goal is simply to know the total downloads, then you’re right, I can use a custom rule as you suggested. However, my primary motivation for reporting this issue was:

  • To ensure a standard and consistent order of PEPs invoked, in case a custom rule is needed for any reason.
  • To gather certain information in a standardized way from the audit plugin.
    That said, I’ve decided not to include the byte size read/downloaded in our auditing.

Thanks.

@d-w-moore
Copy link
Collaborator

Is there more to do on this issue, whether for the 2.1.1 milestone or otherwise? We know now that a rule can be used to help track bytes downloaded per invocation of the data object READ API but that (or, as @mstfdkmn mentions, the standard and consistent invocation order of PEPs) probably won't directly affect our client-side work.

@trel
Copy link
Member

trel commented Oct 4, 2024

I think the current behavior has been described and explained.

@alanking ?

@trel
Copy link
Member

trel commented Oct 4, 2024

And should the bug label be removed? 'question' instead?

@alanking
Copy link
Contributor

alanking commented Oct 4, 2024

Is there more to do on this issue, whether for the 2.1.1 milestone or otherwise? We know now that a rule can be used to help track bytes downloaded per invocation of the data object READ API but that (or, as @mstfdkmn mentions, the standard and consistent invocation order of PEPs) probably won't directly affect our client-side work.

I think the current behavior has been described and explained.

@alanking ?

I think that the definition for "standard and consistent" will be difficult to pin down as PEPs are server-side and merely react to API invocations from various clients maintained by various developers and even teams. Since we have understood and explained the behavior reported in this issue, I think we can consider it resolved.

And should the bug label be removed? 'question' instead?

Yes, agreed.

@trel
Copy link
Member

trel commented Oct 4, 2024

marking as such. and closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants