Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from huginn:master #56

Merged
merged 6 commits into from
Oct 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

| DateOfChange | Changes |
|----------------|--------------------------------------------------------------------------------------------------------------|
| Oct 27, 2024 | WebsiteAgent can output raw XPath values by enabling the `raw` option, effectively obsoleting the `array` option which is now called `single_array`. [#3457](https://github.com/huginn/huginn/pull/3457) |
| Oct 26, 2024 | Fix huginn_agent gems for the Zeitwerk loader. [#3451](https://github.com/huginn/huginn/pull/3451) |
| Oct 26, 2024 | LiquidOutputAgent supports a new option `line_break_is_lf`. [#3456](https://github.com/huginn/huginn/pull/3456) |
| Oct 26, 2024 | LiquidOutputAgent supports ETag. [#3455](https://github.com/huginn/huginn/pull/3455) |
| May 01, 2024 | Web requesting agents can handle invalid characters by replacing them with U+FFFD. [#3336](https://github.com/huginn/huginn/pull/3336) |
| May 01, 2024 | Allow setting ActionMailer read and open timeouts. [#3361](https://github.com/huginn/huginn/pull/3361) |
| May 01, 2024 | Add support for `ttl` and `monospace` to PushoverAgent. [#3389](https://github.com/huginn/huginn/pull/3389) |
Expand Down
78 changes: 65 additions & 13 deletions app/models/agents/website_agent.rb
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,9 @@ class WebsiteAgent < Agent

Beware that when parsing an XML document (i.e. `type` is `xml`) using `xpath` expressions, all namespaces are stripped from the document unless the top-level option `use_namespaces` is set to `true`.

For extraction with `array` set to true, all matches will be extracted into an array. This is useful when extracting list elements or multiple parts of a website that can only be matched with the same selector.
For extraction with `raw` set to true, each value will be returned as is without any conversion instead of stringifying them. This is useful when you want to extract a number, a boolean value, or an array of strings.

For extraction with `single_array` set to true, all matches will be extracted into an array. This is useful when extracting list elements or multiple parts of a website that can only be matched with the same selector.

# Scraping JSON

Expand Down Expand Up @@ -295,6 +297,15 @@ def validate_extract_options!
case extraction_type
when 'html', 'xml'
extract.each do |name, details|
details.each do |name,|
case name
when 'css', 'xpath', 'value', 'repeat', 'hidden', 'raw', 'single_array'
# ok
else
errors.add(:base, "Unknown key #{name.inspect} in extraction details")
end
end

case details['css']
when String
# ok
Expand Down Expand Up @@ -627,22 +638,63 @@ def extract_xml(doc)
else
raise '"css" or "xpath" is required for HTML or XML extraction'
end

log "Extracting #{extraction_type} at #{xpath || css}"
case nodes
when Nokogiri::XML::NodeSet
stringified_nodes = nodes.map do |node|
case value = node.xpath(extraction_details['value'] || '.')
when Float
# Node#xpath() returns any numeric value as float;
# convert it to integer as appropriate.
value = value.to_i if value.to_i == value

expr = extraction_details['value'] || '.'

handle_float = ->(value) {
case
when value.nan?
'NaN'
when value.infinite?
if value > 0
'Infinity'
else
'-Infinity'
end
value.to_s
when value.to_i == value
# Node#xpath() returns any numeric value as float;
# convert it to integer as appropriate.
value.to_i
else
value
end
if boolify(extraction_details['array'])
values << stringified_nodes
}
jsonify =
if boolify(extraction_details['raw'])
->(value) {
case value
when nil, true, false, String, Integer
value
when Float
handle_float.call(value)
when Nokogiri::XML::NodeSet
value.map(&jsonify)
else
value.to_s
end
}
else
->(value) {
case value
when Float
handle_float.call(value).to_s
else
value.to_s
end
}
end

case nodes
when Nokogiri::XML::NodeSet
node_values = nodes.map { |node|
jsonify.call(node.xpath(expr))
}
if boolify(extraction_details['single_array'])
values << node_values
else
stringified_nodes.each { |n| values << n }
node_values.each { |value| values << value }
end
else
raise "The result of HTML/XML extraction was not a NodeSet"
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
class WebsiteAgentRenameArrayToSingleArray < ActiveRecord::Migration[6.1]
def up
Agents::WebsiteAgent.find_each do |agent|
case extract = agent.options['extract']
when Hash
extract.each_value do |details|
if details.is_a?(Hash) && details.key?('array')
details['single_array'] = details.delete('array')
end
end
agent.save(validate: false)
end
end
end

def down
Agents::WebsiteAgent.find_each do |agent|
case extract = agent.options['extract']
when Hash
extract.each_value do |details|
if details.is_a?(Hash) && details.key?('single_array')
details['array'] = details.delete('single_array')
end
end
agent.save(validate: false)
end
end
end
end
Loading
Loading