Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds read_npz and write_npz, convenience wrappers #46

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jonathanstrong
Copy link

these work like read_npy and write_npy but write compressed .npz files instead. I wanted this functionality for writing ephemeral array files to be able to check something later if needed without taking up too much disk space.

in comparing to read_npy/write_npy, there is one major difference: since a npz file can contain multiple named arrays/files, this picks a default name for the single array it writes with write_npz, while allowing the user to specify the name to extract with read_npz. this may not be the best choice, but it seemed less than ideal to not permit specifying the name in read_npz, and I wanted write_npz to remain as simple as possible.

I picked the default name for write_npz based on what numpy does in savez_compressed ("arr_0.npy"). however, I think there is a divergence there. using np.load, you will get a dict-like object that allows you to access the arrays without the .npy extension (i.e. at key arr_0). however, using NpzReader, you need to use the full arr_0.npy name to retrieve the same array. just wanted to flag as this tripped me up a bit.

thanks for your consideration of this pull request.

… and `NpzReader` similar to `read_npy` and `write_npy`
@jturner314
Copy link
Owner

jturner314 commented Mar 5, 2021

using np.load, you will get a dict-like object that allows you to access the arrays without the .npy extension (i.e. at key arr_0). however, using NpzReader, you need to use the full arr_0.npy name to retrieve the same array. just wanted to flag as this tripped me up a bit.

Thanks for pointing this out. I've created #48 to track this issue.

Thanks also for the PR. There are a few things about the proposed API which are unsatisfying to me:

  • The inconsistency regarding the name parameter in read_npz/write_npz. I'd prefer for either both to accept a name parameter or neither to accept a name. I think it would be better for both to accept a name.
  • The names read_npz/write_npz would IMO be more appropriate for functions which read/write general .npz files, rather than functions which read/write .npz files containing only a single array. For these functions, I'd prefer names more like read_npz_array/write_npz_array_compressed.
  • It would be simpler for the name parameter to have type &str rather than N: Into<String>.

Creating a .npz archive for a single array seems somewhat awkward. I wonder if you'd be happier using a single-file compression format (such as .gz, .xz, .bz2, or .zst) applied to a .npy file instead of using a .zip/.npz archive. This would avoid the problem of choosing a name for the array in the archive and would avoid the complexity of the .zip format. For example, to write/read a .npy.gz file using ndarray-npy, you could do this:

use flate2::{bufread::GzDecoder, write::GzEncoder, Compression};
use ndarray::{array, Array2};
use ndarray_npy::{ReadNpyError, ReadNpyExt, WriteNpyError, WriteNpyExt};
use std::fs::File;
use std::io::{BufReader, BufWriter, Write};
use std::path::Path;

fn write_npy_gz<P, T>(path: P, array: &T) -> Result<(), WriteNpyError>
where
    P: AsRef<Path>,
    T: WriteNpyExt,
{
    // Note: I'm not sure if the `BufWriter` actually helps or not.
    let mut writer = GzEncoder::new(BufWriter::new(File::create(path)?), Compression::default());
    array.write_npy(&mut writer)?;
    writer.finish()?.flush()?;
    Ok(())
}

fn read_npy_gz<P, T>(path: P) -> Result<T, ReadNpyError>
where
    P: AsRef<Path>,
    T: ReadNpyExt,
{
    // Note: I'm not sure if the `BufReader` actually helps or not.
    T::read_npy(GzDecoder::new(BufReader::new(File::open(path)?)))
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let arr1 = array![[1, 2, 3], [4, 5, 6]];

    // Write the array.
    write_npy_gz("foo.npy.gz", &arr1)?;

    // Read it back.
    let arr2: Array2<i32> = read_npy_gz("foo.npy.gz")?;

    println!("arr1:\n{}", arr1);
    println!("arr2:\n{}", arr2);
    assert_eq!(arr1, arr2);

    Ok(())
}

To read it with NumPy, you could do this:

import numpy as np
import gzip

def load_npy_gz(path):
    with gzip.open(path) as f:
        return np.load(f)

arr = load_npy_gz('foo.npy.gz')
print(arr)

(You could also decompress .npy.gz files at the command line using gunzip.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants