Text2Scene-bib

A list of materials related to text2scene

Dataset investigation

Abstract Scene Dataset (CVPR-2013 & PAMI-2015) #thorough

Abstract scene generation

Learning the Visual Interpretation of Sentences (ICCV-2013)
- Target: text -> Cartoon-like
- Method: Statistical learning - Conditional Random Field (CRF)
- Dataset: Abstract Scene Dataset
- Supplementary Material
Text2Scene: Generating Compositional Scenes from Textual Descriptions (CVPR-2019) #thorough
- Target: text -> Cartoon-like scenes & Object layouts & Synthetic scenes
- Method: End2end deep learning (Recurrent CNN + attention); Unified framework
- Dataset: Abstract Scene Dataset; COCO
- Code: Text2Scene

Learn commonsense spatiotemporal knowledge

Predicting Object Dynamics in Scenes (CVPR-2014)
- Target: scene -> next scene
- Dataset: Abstract Scene Dataset
Visual Abstraction for Zero-Shot Learning (ECCV-2014)
- Target: learn concepts involving individual poses and interactions between two people
- Dataset: Abstract scenes depicting fine-grained iteractions between two people
- Webpage
Learning common sense through visual abstraction (ICCV-2015)
- Target: Assess the plausibility of the interaction in a scene
- Dataset: Second Generation Abstract Scene Dataset
- Webpage

3D scene generation

Stanford NLP group

Learning Spatial Knowledge for Text to 3D Scene Generation (EMNLP-2014) #thorough
- Target: Text -> 3D scene - room layout
- Method: Mostly rule-based + Bayesian + NLP
- Dataset: Collected dataset of spatial relation descriptions
- Learned spatial relation mapping
Interactive Learning of Spatial Knowledge for Text to 3D Scene Generation (ACL-2014-Workshop)
- Target: Text -> 3D scene - room layout
- Method: Interactive learning
Text to 3D Scene Generation with Rich Lexical Grounding (ACL-2015) #thorough
- Target: Text -> 3D scene - room layout
- Method: Mostly rule-based + Supervised learning -> learning lexical grounding (i.e. object match)
- Dataset: Collected dataset of scene-description pairs

3D shape generation

Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings (CVPR-2018)
- Target: text -> colored 3D shapes of tables and chairs
- Method: Joint metric learning to capture many-to-many relations between text and properties of 3D shapes
- Dataset: ShapeNet and manually collected text descriptions
- Code: text2shape

Photorealistic image synthesis

GAN conditioned on text (Degrade on general images)

Generative Adversarial Text to Image Synthesis (ICML-2016)
- Target: text -> photographic image (Bird & flower)
- Method: Both generator and discriminator conditioned on text feature
- Dataset: CUB dataset of bird images; Oxford-102 dataset of flower images
- Code: Generative Adversarial Text-to-Image Synthesis
StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks (ICCV-2017)
- Target: text -> photographic image (Bird & flower)
- Method: two-stage GANs
- Dataset: CUB; Oxford-102; MS-COCO
- Code: StackGAN

Utilize pixel-wise semantic labels

Parallel multiscale autoregressive density estimation (ICML-2017)

Semantic layout as intermediate representation OR Retrieval from oject database

Semi-parametric Image Synthesis (CVPR-2018)
- Target: Semantic layout -> Photographic image
- Method:
  - Parametric + Non-parametric (segment retrieval)
  - Segment database -> retrieve -> composite -> resolve occlusion -> post-process
- Dataset: Cityscapes; NYU; ADE20K (See Datasets in this paper)
- Code: SIMS
- Demo: Semi-parametric Image Synthesis
Image Generation from Scene Graphs (CVPR-2018)
- Target: Scene graphs -> Photographic images
- Method:
  - groud-truth object positions -> scene graphs
  - Graph processing: graph convolution network
  - symbolic graph -> scene layout: bounding box & segmentation prediction
  - scene layout -> image: cascaded refinement network (CRN)
  - image -> realistic image: adversarial training
- Dataset:
  - Visual Genome: Human annotated scene graphs provided
  - COCO-Stuff: COCO with pixel-level stuff annotations
Inferring semantic layout for hierarchical text-to-image synthesis (CVPR-2018)
- Target: text -> photographic image
- Method: text -> semantic layout (box layout & shape) -> image
- Dataset: COCO

Image query

Image Ranking and Retrieval based on Multi-Attribute Queries (CVPR-2011)
Image Retrieval Using Scene Graphs (CVPR-2015)
- Target: Textual query -> Semantically related image
- Method: Scene graph; Conditional random field
- Dataset: real-world scene graphs: manually labeled YFCC100m & COCO images

Video generation

Generating Videos with Scene Dynamics (NIPS-2016)
- Target: video generation (unlabeled)
- Method: Scene decomposition model: Foreground + Background + Mask (GAN*3)
- Dataset: A large amount of unlabeled video downloaded from Flickr
- Code: videoGAN
Visual Dynamics: Stochastic Future Generation via Layered Cross Convolutional Networks (NIPS-2016)
- Target: frame -> next frame
- Method: Probabilistic
To Create What You Tell: Generating Videos from Captions (ACM-2017)
- Target: caption -> video
- Method: Conditional GAN: LSTM caption encoder + Convolutional generator + 3 discriminator conditioned on the caption
- Dataset:
  - Synthesized videos of handwritten digits bouncing
  - Video snippets from YouTube about cooking
Imagine This! Scripts to Compositions to Videos (ECCV-2018)
- Target: text -> scene video
- Method: Entity & Background retrieval + Layout composer
- Dataset: FLINTSTONES: richly-annotated video-caption dataset
- Demo: CRAFT
Video Generation from Text (AAAI-2018)
- Target: text -> video
- Method: text -> gist -> video (VAE + GAN)
- Dataset: Videos crawled from YouTube along with titles and descriptions
MoCoGAN: Decomposing Motion and Content for Video Generation (CVPR-2018)
- Target: video generation
- Method: Content + motion GAM
TFGAN: IMPROVING CONDITIONING FOR TEXT-TO-VIDEO SYNTHESIS (2018) #Withdrawn
Generating Animated Videos of Human Activities from Natural Language Descriptions (NIPS-2018)
- Target: text -> a sequence of 3D human skeletal poses
- Method:
  - Autoencoder: Learn a representation of human motions without text
  - Seq2seq: map text into motion representation
- Dataset: The KIT Motion-Language Dataset
Language2Pose: Natural Language Grounded Pose Forecasting (2019)
- Target: text -> pose animation
- Method: learn a joint embedding of text and pose using curriculum learning
- Dataset: The KIT Motion-Language Dataset

Visual story telling

A Pipeline for Creative Visual Storytelling (2018)
- Target: video -> a sequence of text (Pipeline proposed)
Video Storytelling (2018)
- Target: video -> a sequence of coherent and succinct text
- Method:
  - Contextual multimodal embedding: Residual Bidirectional RNN
  - Narrator: Reinforcement learning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Text2Scene-bib

Dataset investigation

Abstract scene generation

Learn commonsense spatiotemporal knowledge

3D scene generation

Stanford NLP group

3D shape generation

Photorealistic image synthesis

GAN conditioned on text (Degrade on general images)

Utilize pixel-wise semantic labels

Semantic layout as intermediate representation OR Retrieval from oject database

Image query

Video generation

Visual story telling

Files

README.md

Latest commit

History

README.md

File metadata and controls

Text2Scene-bib

Dataset investigation

Abstract scene generation

Learn commonsense spatiotemporal knowledge

3D scene generation

Stanford NLP group

3D shape generation

Photorealistic image synthesis

GAN conditioned on text (Degrade on general images)

Utilize pixel-wise semantic labels

Semantic layout as intermediate representation OR Retrieval from oject database

Image query

Video generation

Visual story telling