for downloads less than 32MB `pep_api_data_obj_read_*` is triggered more than once and contains wrong size info in `len` attribute #604

mstfdkmn · 2024-08-05T17:35:05Z

python-irodsclient 2.0.1
irods 4.3.2

I am observing unnecessary pep-invokes (and irrelevant size information in the len attribute) when an object that its size is under the parallel threshold (size is less than 32BM) is downloaded. I think that is because of the extra reading (READ_BUFFER_SIZE ) for the small files here at

python-irodsclient/irods/data_object.py

Line 20 in bf46679

return iter(lambda: f.read(chunksize), b'')

Called by

python-irodsclient/irods/manager/data_object_manager.py

Line 219 in bf46679

for chunk in chunks(o, self.READ_BUFFER_SIZE):

My observations:

when a zero byte object is downloaded, pep_api_data_obj_read_* is triggered only one time but the following message is captured. Please see the value for the len serialized object.

when an object that its size is more than zero (2bytes) but less than 32 MB is downloaded, pep_api_data_obj_read_* is triggered more than one time and the value in the len varies in each invoke.
that is, for a 2 bytes file download, pep_api_data_obj_read_* is triggered three times and only the one in the middle invoke has the correct size information.

for a 20MB file download, pep_api_data_obj_read_* is triggered 5 times and none of them has the correct size information.

I have a kind of fix that ensures pep_api_data_obj_read_* is called only once, accurately populating the len field in the PEP. My tests haven't revealed any issues, but I'd like to get your input before submitting a pull request to confirm there won't be unintended side effects. Or might be anther nice way to fix this?

in /irods/data_object.py:

def chunks(f, chunksize=io.DEFAULT_BUFFER_SIZE):
    pos = f.tell()
    size = f.seek(0,os.SEEK_END)
    f.seek(pos,os.SEEK_SET)
    if size < 32 * ( 1024 ** 2)  - 1:
        return (size, f.read(size), b'')
    return (size, iter(lambda: f.read(chunksize), b''))

L219 in/irods/manager/data_object_manager.py:

else:
    no_chunk = chunks(o, self.READ_BUFFER_SIZE)
    if no_chunk[0] < 32 * ( 1024 ** 2):
       f.write(no_chunk[1])
       return
  for chunk in chunks(o, self.READ_BUFFER_SIZE)[1]:
       f.write(chunk)
       do_progress_updates(updatables, len(chunk))

The text was updated successfully, but these errors were encountered:

d-w-moore · 2024-08-28T16:24:04Z

If I'm understanding the use case, we want the ability to accurately record the length of the overall data object (or amount of it read in total by the _download on the client side,) within the execution of one PEP. Of course it's not an expectation that could ever apply to data objects larger than the single buffer transfer size. Does it make sense to aim for a client-side solution in this case? I think I wouldn't mind accepting alternate schemes for chunking as a parameter in the get( ) and _download functions, but I don't know whether altering the scheme for all applications is called for.

trel · 2024-08-28T16:29:53Z

I don't know that it's the job of an individual read to have any idea how big the entire data object is...

Can we explain the multiple read calls and the differing sizes shown in the original post?

If we can explain them, then this might be just a documentation exercise, and not a bug?

d-w-moore · 2024-08-28T16:36:58Z

Right, the hint to me was with zero-size object read producing a len field of 8 *1024**2 that we might be recording in this "len" field the attempted read size.

d-w-moore · 2024-08-28T16:38:55Z

I notice a bytesWritten field in @mstfdkmn 's output above. Wonder if there's the possibility of a bytesRead field being implemented.

d-w-moore · 2024-08-28T17:14:40Z

I notice a bytesWritten field in @mstfdkmn 's output above. Wonder if there's the possibility of a bytesRead field being implemented.

Hmm... I notice in pep_api_data_obj_read_post, the 4th parameter (rule_args[3]) is a BytesBuf. So ... len( rule_args[3].get_bytes()) turns out to be a really good clue about how many bytes were read. The only remaining trick is to correlate that figure across multiple reads to the data object in question. Do we have any precedent or guide for that? I've looked through dumps of the information available to the read pep, and there isn't anything to indicate what data object is being operated on.

trel · 2024-08-28T17:17:35Z

i don't think it's the job of the read pep to know about the entire object.

d-w-moore · 2024-08-28T17:17:59Z

^^^ There's an l1descInx of value 3 in the irods_types.OpenedDataObjInp . Maybe that correlates to the fd opened in a data_obj_open_*

d-w-moore · 2024-08-28T17:20:49Z

i don't think it's the job of the read pep to know about the entire object.

It wouldn't seem so, but maybe l1descInx value is a unique "pointer" to a data object for the server hosting the data object being downloaded. @alanking @korydraughn ?

alanking · 2024-08-28T17:35:59Z

l1descInx indeed acts as an index into the L1 descriptor table which contains information about the replica which is open for read or write. One would need to use the get_file_descriptor_info API to get that information: https://docs.irods.org/4.3.2/doxygen/get__file__descriptor__info_8h.html It should contain the size of the replica as recorded in the catalog (which, generally, is trusted).

d-w-moore · 2024-08-29T19:14:48Z

Hello @mstfdkmn - If you would like to track the total downloads from specific replicas, I think it could be done via the python rule engine plugin. The following rule illustrates the underlying calls necessary to get descriptor information, as well as how you might tally bytes read from a source replica. You would need to key on certain information derived from the l1descinx member of the OpenedDataObjInp argument to pep_api_data_obj_read_post, and use the bytesBuf parameter to determine the actual length of the byte buffer read in any one invocation.

import json,pprint
def pep_api_data_obj_read_post(rule_args, callback, rei):
     bytes_read = len(rule_args[3].get_bytes())
     ret = callback.msi_get_file_descriptor_info(rule_args[2].l1descInx,"")
     fd_info = json.loads(ret['arguments'][1])
     callback.writeLine("serverLog","<<<<\nJSON from msi loaded and translated -->\n"+
             pprint.pformat(fd_info))
     callback.writeLine("serverLog",f"{bytes_read=}\n>>>>")

mstfdkmn · 2024-09-03T13:17:45Z

Hello @d-w-moore,

If the main goal is simply to know the total downloads, then you’re right, I can use a custom rule as you suggested. However, my primary motivation for reporting this issue was:

To ensure a standard and consistent order of PEPs invoked, in case a custom rule is needed for any reason.
To gather certain information in a standardized way from the audit plugin.
That said, I’ve decided not to include the byte size read/downloaded in our auditing.

Thanks.

d-w-moore · 2024-10-04T14:15:52Z

Is there more to do on this issue, whether for the 2.1.1 milestone or otherwise? We know now that a rule can be used to help track bytes downloaded per invocation of the data object READ API but that (or, as @mstfdkmn mentions, the standard and consistent invocation order of PEPs) probably won't directly affect our client-side work.

trel · 2024-10-04T14:17:52Z

I think the current behavior has been described and explained.

@alanking ?

trel · 2024-10-04T14:18:30Z

And should the bug label be removed? 'question' instead?

alanking · 2024-10-04T14:24:34Z

Is there more to do on this issue, whether for the 2.1.1 milestone or otherwise? We know now that a rule can be used to help track bytes downloaded per invocation of the data object READ API but that (or, as @mstfdkmn mentions, the standard and consistent invocation order of PEPs) probably won't directly affect our client-side work.

I think the current behavior has been described and explained.

@alanking ?

I think that the definition for "standard and consistent" will be difficult to pin down as PEPs are server-side and merely react to API invocations from various clients maintained by various developers and even teams. Since we have understood and explained the behavior reported in this issue, I think we can consider it resolved.

And should the bug label be removed? 'question' instead?

Yes, agreed.

trel · 2024-10-04T14:31:46Z

marking as such. and closing.

korydraughn added the consortium-member label Aug 5, 2024

korydraughn assigned d-w-moore Aug 5, 2024

korydraughn added the bug label Aug 5, 2024

alanking added this to the 2.1.1 milestone Aug 5, 2024

trel added question resolved/invalid and removed bug labels Oct 4, 2024

trel closed this as completed Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

for downloads less than 32MB `pep_api_data_obj_read_*` is triggered more than once and contains wrong size info in `len` attribute #604

for downloads less than 32MB `pep_api_data_obj_read_*` is triggered more than once and contains wrong size info in `len` attribute #604

mstfdkmn commented Aug 5, 2024

d-w-moore commented Aug 28, 2024 •

edited

Loading

trel commented Aug 28, 2024

d-w-moore commented Aug 28, 2024 •

edited

Loading

d-w-moore commented Aug 28, 2024

d-w-moore commented Aug 28, 2024

trel commented Aug 28, 2024

d-w-moore commented Aug 28, 2024

d-w-moore commented Aug 28, 2024

alanking commented Aug 28, 2024

d-w-moore commented Aug 29, 2024

mstfdkmn commented Sep 3, 2024

d-w-moore commented Oct 4, 2024

trel commented Oct 4, 2024

trel commented Oct 4, 2024

alanking commented Oct 4, 2024

trel commented Oct 4, 2024

for downloads less than 32MB pep_api_data_obj_read_* is triggered more than once and contains wrong size info in len attribute #604

for downloads less than 32MB pep_api_data_obj_read_* is triggered more than once and contains wrong size info in len attribute #604

Comments

mstfdkmn commented Aug 5, 2024

d-w-moore commented Aug 28, 2024 • edited Loading

trel commented Aug 28, 2024

d-w-moore commented Aug 28, 2024 • edited Loading

d-w-moore commented Aug 28, 2024

d-w-moore commented Aug 28, 2024

trel commented Aug 28, 2024

d-w-moore commented Aug 28, 2024

d-w-moore commented Aug 28, 2024

alanking commented Aug 28, 2024

d-w-moore commented Aug 29, 2024

mstfdkmn commented Sep 3, 2024

d-w-moore commented Oct 4, 2024

trel commented Oct 4, 2024

trel commented Oct 4, 2024

alanking commented Oct 4, 2024

trel commented Oct 4, 2024

for downloads less than 32MB `pep_api_data_obj_read_*` is triggered more than once and contains wrong size info in `len` attribute #604

for downloads less than 32MB `pep_api_data_obj_read_*` is triggered more than once and contains wrong size info in `len` attribute #604

d-w-moore commented Aug 28, 2024 •

edited

Loading

d-w-moore commented Aug 28, 2024 •

edited

Loading