1. Introduction
With the rapid development of generative AI, cross-modal content generation has become a research hotspot in computer vision and NLP.
2. Method
Our framework consists of three core modules: (1) Transformer-based text encoder; (2) Cross-modal alignment module; (3) Conditional diffusion model.
3. Experiment Results
Extensive experiments on MS-COCO, CUB-200 and ImageNet show that our method achieves 12.3% FID improvement over baselines.
4. Conclusion
This paper successfully combines LLM semantic understanding with high-quality diffusion generation, proposing an advanced cross-modal digital asset generation method.