Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading files into NArray #192

Open
railsmk opened this issue Jul 7, 2021 · 10 comments
Open

Reading files into NArray #192

railsmk opened this issue Jul 7, 2021 · 10 comments

Comments

@railsmk
Copy link

railsmk commented Jul 7, 2021

Hello,

I have asked a question at stackoverflow concerning reading files into Numo::NArray and dynamically inserting rows/data.
Would you care to show me the way or tell me if the thing I'm trying to do is even possible to accomplish with this library?

https://stackoverflow.com/questions/68282417/reading-files-into-ruby-numonarray

@kojix2
Copy link
Contributor

kojix2 commented Jul 7, 2021

Hello.
I think it depends on the size of the data.

If the data is small, you can read a text file and convert the ruby array to Numo::NArray with cast method.
For example,

Input file:

3 5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

ruby script.

require 'numo/narray'

na = nil

File.open("input.txt", "r") do |f|
  a, b = f.gets.split("\n").map(&:to_i)
  arr = f.gets.split(" ").map(&:to_i)
  na = Numo::UInt8.cast(arr).reshape(a, b)
end

p na

#132

Of course, you can also use the CSV library.

If the input is in binary format, use from_string to convert a binary string to NArray.
#140

Numo::NArray is not flexible enough to change its size. Methods such as hstack, vstack, and dstack look as if they can change the shape. However, they are actually generating a new NArray. When inserting and increasing the number of rows, NArray needs to allocate new memory. You would need to create a new NArray each time.

If you want to deal with really huge data, Apache Arrow may be useful.

Currently, Apache Arrow is the fastest way to convert CSV files to Ruby data, with much better performance than the standard CSV library.

See
https://github.com/red-data-tools/red-arrow-numo-narray

@kojix2
Copy link
Contributor

kojix2 commented Jul 7, 2021

Or do you need an alternative to numpy's fromfile?
https://numpy.org/doc/stable/reference/generated/numpy.fromfile.html

@railsmk
Copy link
Author

railsmk commented Jul 7, 2021

Thank you for your reply.

Yes, I'm searching for a way to load huge data in rails/ruby application. Small data would also be used as it all depends on user input. User can upload any file type, it gets splitted and data is loaded in binary.
Data must be loaded to matrix (every file in a different row) and then I need to perform few operations (multiplications and inversions) with smaller helper matrices. Crucial thing is that I have to modify library so it works with Galois fields. I have done this with old Narray. Math operations performance was ok I can say, but the time that it took to load data was unacceptable.

I never used numpy but I think there would be a problem as I have to combine few files into one array/matrix.
I will try red-arrow lib, let's see if it works.

Thank you once again, I would appreciate if you could maybe think of something else you can recommend as I specified my goal.

@railsmk
Copy link
Author

railsmk commented Jul 7, 2021

Also I can specify that files that user can upload won't be larger than 8-16 GB. Will ruby script you posted at the beginning be enough to achieve fast results? I guess that's not considered as "large" data.

I tried loading data with built-in ruby array at my first approach to the algorithm and it wasn't enough so I guess I will have to use something else

@kojix2
Copy link
Contributor

kojix2 commented Jul 7, 2021

I see.
Big files, over several gigabytes...
If so, it would be better to create a binary string somehow and use the store_binary or from_binary method to create NArray.
If your input file is TSV or CSV, Apache Arrow + red-arrow-numo-narray is good choice. However, even for other files, you may be able to speed up loading if you get a way to create a binary string from your files. You don't need to use Ruby script to create a binary string. It will be faster if you use a fast executable. That's about all I know about it. I guess others can answer more detail.

@railsmk
Copy link
Author

railsmk commented Jul 7, 2021

Thank you for sharing knowledge. It's first time I'm working with this big data, so you helped a lot as I didn't know anything about it. I'm going to try your recommendations in the following days.

@kojix2
Copy link
Contributor

kojix2 commented Jul 8, 2021

Good luck with your work.
If you are familiar with the C language, you may want to create C extensions to read the files.
I don't know much about C, but I think a library called magro might be helpful.
https://github.com/yoshoku/magro

https://github.com/yoshoku/magro/blob/2ed598d02f0d9cc52baead28415dbdb8c6883101/ext/magro/imgrw.c#L116-L122

  VALUE nary;
  uint8_t* nary_ptr;
  nary = rb_narray_new(numo_cUInt8, n_dims, shape);
  nary_ptr = (uint8_t*)na_get_pointer_for_write(nary);


  for (y = 0; y < height; y++) {
    row_ptr = row_ptr_ptr[y];
    memcpy(nary_ptr + y * width * n_ch, row_ptr, width * n_ch);
  }
  return nary;

@kojix2
Copy link
Contributor

kojix2 commented Jul 22, 2021

@railsmk
Copy link
Author

railsmk commented Jul 26, 2021

Thank you for constantly bringing new ideas to the table. Overall the library works fine. Performance is ok for now, especially the loading part directly from binary data works great, better than I expected. I will try fromfile gem soon. I am not yet sure if matrix multiplication performance is going to be sufficient to bring the product to the market, but this part can be done in a different way. I appreciate your involvement, take care.

@kojix2
Copy link
Contributor

kojix2 commented Jul 27, 2021

@railsmk
That's good to know.
narray-fromfile is a library created by a university student as a hobby.
The implementation is helpful, but you might not want to use it in a production environment.
Good luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants