Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional encoding argument to set output character encoding #47

Merged
merged 2 commits into from
Jun 13, 2024

Conversation

SimonBrazell
Copy link

What's this PR do?

Adds an optional encoding argument to #text, #html and .read for setting the output character encoding, which is passed to Tika as the --encoding option. This value is validated against Ruby's Encoding.name_list, raising an ArgumentError if it isn't included in the list.

Why is it needed?

So we can set the output encoding via Henkei / Tika instead of having to do it ourselves afterwards.

Where should the reviewer start?

  • lib/henkei.rb:222

How should this be manually tested?

henkei = Henkei.new 'sample.pages'
utf_8_text = henkei.text(encoding: 'UTF-8')
utf_8_text.encoding 
=> #<Encoding:UTF-8>

lib/henkei.rb Outdated
stdout.binmode
stdout.set_encoding encoding unless encoding.nil?

stdin.puts data
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capture2 uses write at this stage. It is also wrapped in a begin/rescue block catching the Errno::EPIPE error.

Looking at the documentation/source they do slightly different things.. including puts writing an extra newline

stdout.set_encoding encoding unless encoding.nil?

stdin.puts data
out_reader = Thread.new { stdout.read }
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and it also sets up the read thread before writing to the input pipe

@SimonBrazell
Copy link
Author

SimonBrazell commented Jun 11, 2024

@abrom I modified the Open3.popen2 call to more closely match the capture2 source.

https://github.com/ruby/open3/blob/b8909222051b4103a19eba19506727faece252e7/lib/open3.rb#L775

@SimonBrazell SimonBrazell requested a review from abrom June 11, 2024 00:20
Copy link
Owner

@abrom abrom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one @SimonBrazell 👍

@abrom abrom merged commit ad26994 into abrom:main Jun 13, 2024
4 checks passed
@SimonBrazell SimonBrazell deleted the add-tika-encoding-option branch June 14, 2024 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants