"All are Worth Words: A ViT Backbone for Diffusion Models. (arXiv:2209.12152v3 [cs.CV] UPDATED)" — A simple and general ViT-based architecture for image generation with diffusion models which is capable of unconditional and class-conditional image generation, as well as text-to-image generation tasks.