"Fine-grained Cross-modal Fusion based Refinement for Text-to-Image Synthesis. (arXiv:2302.08706v1 [cs.CV])" — Another text-to-image generation approach where, instead of generating the final image from a noisy image, you generate an initial low-resolution image based on the input text and then use a GAN (Generative Adversarial Network) during the second stage to generate the final output.