Skip to content

Commit e9cea8e

Browse files
authored
ignore training subject sets for counters and selection (zooniverse#3140)
* ignore training sets for workflow counts sets marked as training are not included for workflow retired and classification counts * add non_training_subject_sets relation to workflow resource the workflow can report which subjects sets are marked for training data * use new workflow non training sets method * ingore training sets on subject selection by default ignore training subejct sets in selection unless the user specifies a trainign subject set on a grouped workflow * shift the training sets specs to the correct file * remove relation comment about migration and index * use the metadata config for spec setup * clarify variable and comment wording * avoid hitting the db for spec setup stub the non training set relation behaviour as it's tested on the workflow resource * use workflow configuration attr training_set_ids consolidate on one way to configure the API and selector systems to determine a workflows training sets. * always load configuration json attr whitelist workflow configuration attribute as it's used in subject selection to determine training set ids * filter the set relation lookups for known fk values * add docs on training selector configuration * use reject instead of invert select
1 parent b099c80 commit e9cea8e

File tree

7 files changed

+134
-14
lines changed

7 files changed

+134
-14
lines changed

apiary.apib

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2129,6 +2129,18 @@ the strategy. There are 2 valid criteria:
21292129
If retirement is left blank Panoptes defaults to the `classification_count`
21302130
strategy with 15 classifications per subject.
21312131

2132+
A Workflow has a _configuration_ object that is used to store client configuration details.
2133+
Three attributes in this object are reserved: _training_set_ids_, _training_chances_ and _training_default_chance_,
2134+
_training_set_ids_ are used to choose training data from in the subject selectors
2135+
and to tell the counting systems which subjects to ignore as training subjects do not retire.
2136+
_training_default_chance_ is the fallback chance a user will see a training subject and also applies if _training_chances_ are not set
2137+
_training_chances_ are used to determine when to show a training subject and each index reflects a seen subject by the user.
2138+
E.g [ 1, 0.5, 0.1 ] reflects when i have seen 0 subjects (first element) on your workflow i will have a 100% chance of selecting a training subject.
2139+
After seeing 1 subject, I will have a 50% chance, on the second subject i will have a 10% chance, after that the default chance applies.
2140+
2141+
Normally when setting training subject selection you would set the retirement
2142+
config to "never_retire" and use another system (e.g. Caesar) to handle retirement.
2143+
21322144
+ Request
21332145

21342146
+ Headers

app/counters/workflow_counter.rb

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,16 @@ def sws_query
1717
SubjectWorkflowStatus
1818
.where(workflow_id: workflow.id)
1919
.joins("INNER JOIN set_member_subjects ON set_member_subjects.subject_id = subject_workflow_counts.subject_id")
20-
.where(set_member_subjects: { subject_set_id: subject_set_ids })
20+
.where(
21+
set_member_subjects: {
22+
subject_set_id: non_training_subject_set_ids_scope
23+
}
24+
)
2125
end
2226

2327
private
2428

25-
def subject_set_ids
26-
workflow.subject_sets.pluck(:id)
29+
def non_training_subject_set_ids_scope
30+
workflow.non_training_subject_sets.select(:id)
2731
end
2832
end

app/models/workflow.rb

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ class Workflow < ActiveRecord::Base
1212
has_many :subject_workflow_statuses, dependent: :destroy
1313
has_many :subject_sets_workflows, dependent: :destroy
1414
has_many :subject_sets, through: :subject_sets_workflows
15+
has_many :non_training_subject_sets,
16+
->(workflow) { where.not(subject_sets_workflows: { subject_set_id: workflow.training_set_ids }) },
17+
through: :subject_sets_workflows,
18+
source: :subject_set
1519
has_many :set_member_subjects, through: :subject_sets
1620
has_many :subjects, through: :set_member_subjects
1721
has_many :classifications, dependent: :restrict_with_exception
@@ -38,7 +42,7 @@ class Workflow < ActiveRecord::Base
3842
'options' => {'count' => 15}
3943
}.freeze
4044

41-
JSON_ATTRIBUTES = %w(tasks retirement aggregation configuration strings steps).freeze
45+
JSON_ATTRIBUTES = %w(tasks retirement aggregation strings steps).freeze
4246

4347
# Used by HttpCacheable
4448
scope :private_scope, -> { where(project_id: Project.private_scope) }
@@ -159,4 +163,9 @@ def finished?
159163
def retirement_config
160164
RetirementValidator.new(self).validate
161165
end
166+
167+
def training_set_ids
168+
config_training_set_ids = Array.wrap(configuration.dig("training_set_ids"))
169+
config_training_set_ids.reject { |id| id.to_i.zero? }
170+
end
162171
end

lib/subjects/postgresql_selection.rb

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,10 +26,14 @@ def selection_strategy
2626

2727
def available
2828
query = Subjects::SetMemberSubjectSelector.new(workflow, user).set_member_subjects
29-
if workflow.grouped
30-
query = query.where(subject_set_id: opts[:subject_set_id])
31-
end
32-
query
29+
subject_set_ids = if workflow.grouped
30+
# respect the user if they want to select from a training set
31+
opts[:subject_set_id]
32+
else
33+
# default mode: do not select from training sets
34+
workflow.non_training_subject_sets.pluck(:id)
35+
end
36+
query.where(subject_set_id: subject_set_ids)
3337
end
3438

3539
def limit

spec/counters/workflow_counter_spec.rb

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,20 @@
4141
expect(counter.classifications).to eq(2)
4242
end
4343
end
44+
45+
context "with subject sets marked as training data" do
46+
let(:workflow) { create(:workflow_with_subjects, num_sets: 2) }
47+
let(:training_set) { workflow.subject_sets.first }
48+
let(:real_set) { workflow.subject_sets.last }
49+
before do
50+
real_set_ar_collection_proxy = SubjectSet.where(id: real_set)
51+
allow(workflow).to receive(:non_training_subject_sets).and_return(real_set_ar_collection_proxy)
52+
end
53+
54+
it "should return non training data classification count only" do
55+
expect(counter.classifications).to eq(2)
56+
end
57+
end
4458
end
4559
end
4660

@@ -74,6 +88,20 @@
7488
workflow.subject_sets = []
7589
expect(counter.retired_subjects).to eq(0)
7690
end
91+
92+
context "with subject sets marked as training data" do
93+
let(:workflow) { create(:workflow_with_subjects, num_sets: 2) }
94+
let(:training_set) { workflow.subject_sets.first }
95+
let(:real_set) { workflow.subject_sets.last }
96+
before do
97+
real_set_ar_collection_proxy = SubjectSet.where(id: real_set)
98+
allow(workflow).to receive(:non_training_subject_sets).and_return(real_set_ar_collection_proxy)
99+
end
100+
101+
it "should return non training data retired count only" do
102+
expect(counter.retired_subjects).to eq(2)
103+
end
104+
end
77105
end
78106
end
79107
end

spec/lib/subjects/postgresql_selection_spec.rb

Lines changed: 27 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,20 +10,19 @@ def update_sms_priorities
1010

1111
describe "selection" do
1212
let(:user) { User.first }
13-
let(:workflow) { Workflow.first }
13+
let(:workflow) { create(:workflow_with_subject_sets) }
1414
let(:sms) { SetMemberSubject.all }
1515
let(:opts) { {} }
1616
let(:sms_count) { 25 }
17+
let(:uploader) { create(:user) }
1718
subject { Subjects::PostgresqlSelection.new(workflow, user, opts) }
1819

1920
before do
20-
uploader = create(:user)
21-
created_workflow = create(:workflow_with_subject_sets)
22-
create_list(:subject, sms_count, project: created_workflow.project, uploader: uploader).each do |subject|
21+
create_list(:subject, sms_count, project: workflow.project, uploader: uploader).each do |subject|
2322
create(:set_member_subject,
2423
setup_subject_workflow_statuses: true,
2524
subject: subject,
26-
subject_set: created_workflow.subject_sets.first
25+
subject_set: workflow.subject_sets.first
2726
)
2827
end
2928
end
@@ -35,6 +34,29 @@ def update_sms_priorities
3534
SetMemberSubject.all
3635
end
3736
end
37+
38+
context "with a training set and a real set with data" do
39+
let(:sms_count) { 2 }
40+
let(:training_set) { workflow.subject_sets.first }
41+
let(:real_set) { workflow.subject_sets.last }
42+
43+
before do
44+
workflow.configuration['training_set_ids'] = training_set.id
45+
create(:subject, project: workflow.project, uploader: uploader) do |subject|
46+
create(:set_member_subject,
47+
setup_subject_workflow_statuses: true,
48+
subject: subject,
49+
subject_set: real_set
50+
)
51+
end
52+
end
53+
54+
it "should not include training subject sets in the results" do
55+
result_ids = subject.select
56+
non_training_subject_ids = real_set.subjects.pluck(:id)
57+
expect(result_ids).to match_array(non_training_subject_ids)
58+
end
59+
end
3860
end
3961

4062
context "grouped selection" do

spec/models/workflow_spec.rb

Lines changed: 42 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,11 +39,12 @@
3939

4040
describe "::find_without_json_attrs" do
4141
let(:workflow) { create(:workflow) }
42+
let(:whitelist_json_attrs) { %w(configuration) }
4243
let(:json_attrs) do
4344
col_information = Workflow.columns_hash.select do |name, col|
4445
/\Ajson.*/.match?(col.sql_type)
4546
end
46-
col_information.keys
47+
col_information.keys - whitelist_json_attrs
4748
end
4849

4950
it "should load the workflow without the json attributes" do
@@ -422,4 +423,44 @@
422423
end
423424
end
424425
end
426+
427+
describe "#training_subject_sets" do
428+
let(:training_ids) { ["1"] }
429+
430+
it "should return the data in the config object" do
431+
432+
workflow.configuration["training_set_ids"] = training_ids
433+
expect(workflow.training_set_ids).to match_array(training_ids)
434+
end
435+
436+
it "should sanitize the return values to known integer values" do
437+
workflow.configuration["training_set_ids"] = training_ids | ["test"]
438+
expect(workflow.training_set_ids).to match_array(training_ids)
439+
end
440+
end
441+
442+
describe "#non_training_subject_sets" do
443+
let(:workflow) { create(:workflow_with_subject_sets) }
444+
let(:training_set) { workflow.subject_sets.first }
445+
let(:real_set) { workflow.subject_sets.last }
446+
447+
it "should only return subjects sets that are not marked as training" do
448+
workflow.configuration["training_set_ids"] = [training_set.id]
449+
expect(workflow.non_training_subject_sets).to match_array([real_set])
450+
end
451+
452+
it "should always return all real sets with empty training sets config" do
453+
workflow.configuration["training_set_ids"] = []
454+
expect(workflow.non_training_subject_sets).to match_array(workflow.subject_sets)
455+
end
456+
457+
it "should always return all real sets with an unkonwn set id" do
458+
workflow.configuration["training_set_ids"] = "test"
459+
expect(workflow.non_training_subject_sets).to match_array(workflow.subject_sets)
460+
end
461+
462+
it "should always return all real sets with no training sets config" do
463+
expect(workflow.non_training_subject_sets).to match_array(workflow.subject_sets)
464+
end
465+
end
425466
end

0 commit comments

Comments
 (0)