"Multimodal Chain-of-Thought Reasoning in Language Models. (arXiv:2302.00923v1 [cs.CL])" — A Multimodal Chain-of-Thought that incorporates vision features in a decoupled training framework which separates the rationale generation and answer inference into two stages and incorporates vision features in both stages.