You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Danfei Xu, Yuke Zhu, Christopher B. Choy, Li Fei-Fei
; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5410-5419
we propose a novel end-to-end model that generates such structured scene representation from an input image.
모델은 scene graph inference problem 해결을 위해 standard RNN을 이용하며, message passing을 이용한 반복적인 예측 향상을 학습함.
joint inference model을 사용함으로써 물체와 그들의 관계들에 대한 더 나은 문맥적 단서라는 이점을 취함
Intro
scene graph is a visually-grounded graph over the object instances in an image, where the edges depict their pairwise relationships.
이미지에서 자동적으로 scene graph 생성하는 모델을 만들고자 함
object instance : characterized by a bounding box with an object category label
relationship : characterized by a directed edge btw two bounding boxes with a relationship predicate
major challenge : reasoning about relationships
local prediction은 scene graph generation문제를 물체의 짝 사이의 관계를 독립적으로 예측하는 문제로 단순화시킬 수 있지만 주면 문맥을 무시하는 문제
⇒ instead of inferring each component of a scene graph in isolation, the model passes messages containing contextual information btw a pair of bipartite sub-graphs of the scene graph, and iteratively refines its predictions using RNNs.
Scene graph generation
densely connected graph inference → expensive
use CRF but to acheive greater flexibility, use GRU(Gated Recurrnet Unit) instead of RNN unit.
각 반복마다 각 GRU는 이전의 hidden state와 incoming message를 인풋으로 해 아웃풋으로 새로운 hidden state 생성
⇒ 모델이 scene graph topology를 따라 GRU 유닛에 메시지 전달할 수 있게 해줌
we formulate two disjoint sub-graphs that are essentially the dual graph to each other.
defines channels for msgs to pass from…
primal graph : edge GRUs → node GRUs.
dual graph : node GRUs → edge GRUs
⇒ with primal-dual formulation … can improve inference efficieny by iteratively passing msgs btw sub-graphs instead of though a densely connected graph.
Experiments
goal : analyze our model in datasets with both sparse & dense relationship annotations
dataset : VisualGenome(sparse), NYU Depth v2(dense)
sementic scene graph generation
setup : localize a set of objects, classify their category labels, predict relationships btw each pair of the objects.
predicate classification
scene graph classification
scene graph generation
results ⇒
performances of our model and the baselines : shows learning to modulate the info from other hidden states enables the network to extract more relevant information and yields superior performances.
predicate classification performances of our models trained with diff # of iterations : degrades after two iterations (noisy msg start to permeate through the graph and hamper the final prediction ?)
per-type predicate recall : gap btw models expands for less frequent predicates
our model uses contextual info to cope with the uneven distribution in the relationship annotations but baseline model makes predictions in isolation ..so suffers more
support relation prediction
results ⇒
having contextual information further improves support relation prediction
incorrect predictions typically occur in ambiguous supports
Geomatric structures that have weak visual features also cause failures
visual uncertainty may be resolved by having additional depth info?
Conclusion
we addressed the problem of automatically generating a visually grounded scene graph from an img by a novel end-to-end model
it performs iterative msg passing btw primal and dual sub-graph along the topological structure of a scene graph → improves the quality of node and edge predictions by incorporating informative contextual cues.
The text was updated successfully, but these errors were encountered:
Danfei Xu, Yuke Zhu, Christopher B. Choy, Li Fei-Fei
; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5410-5419
we propose a novel end-to-end model that generates such structured scene representation from an input image.
모델은 scene graph inference problem 해결을 위해 standard RNN을 이용하며, message passing을 이용한 반복적인 예측 향상을 학습함.
joint inference model을 사용함으로써 물체와 그들의 관계들에 대한 더 나은 문맥적 단서라는 이점을 취함
Intro
scene graph is a visually-grounded graph over the object instances in an image, where the edges depict their pairwise relationships.
이미지에서 자동적으로 scene graph 생성하는 모델을 만들고자 함
major challenge : reasoning about relationships
local prediction은 scene graph generation문제를 물체의 짝 사이의 관계를 독립적으로 예측하는 문제로 단순화시킬 수 있지만 주면 문맥을 무시하는 문제
⇒ instead of inferring each component of a scene graph in isolation, the model passes messages containing contextual information btw a pair of bipartite sub-graphs of the scene graph, and iteratively refines its predictions using RNNs.
Scene graph generation
densely connected graph inference → expensive
use CRF but to acheive greater flexibility, use GRU(Gated Recurrnet Unit) instead of RNN unit.
각 반복마다 각 GRU는 이전의 hidden state와 incoming message를 인풋으로 해 아웃풋으로 새로운 hidden state 생성
⇒ 모델이 scene graph topology를 따라 GRU 유닛에 메시지 전달할 수 있게 해줌
we formulate two disjoint sub-graphs that are essentially the dual graph to each other.
defines channels for msgs to pass from…
⇒ with primal-dual formulation … can improve inference efficieny by iteratively passing msgs btw sub-graphs instead of though a densely connected graph.
Experiments
goal : analyze our model in datasets with both sparse & dense relationship annotations
dataset : VisualGenome(sparse), NYU Depth v2(dense)
sementic scene graph generation
setup : localize a set of objects, classify their category labels, predict relationships btw each pair of the objects.
results ⇒
support relation prediction
results ⇒
visual uncertainty may be resolved by having additional depth info?
Conclusion
we addressed the problem of automatically generating a visually grounded scene graph from an img by a novel end-to-end model
it performs iterative msg passing btw primal and dual sub-graph along the topological structure of a scene graph → improves the quality of node and edge predictions by incorporating informative contextual cues.
The text was updated successfully, but these errors were encountered: