"Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment. (arXiv:2302.00902v1 [cs.LG])" — A modification of VQ-VAE that learns to align text-image data in an unsupervised manner by leveraging pretrained language models (e.g., BERT, RoBERTa).