Advances in AI-Generated Images and Videos.

Hessen Bougueffa; Mamadou Keita; Wassim Hamidouche; Abdelmalik Taleb Ahmed; Helena Liz López; Alejandro Martín; David Camacho; Abdenour Hadid

doi:10.9781/ijimai.2024.11.003

Authors

Hessen Bougueffa Univ. Polytechnique Hauts de France.
Mamadou Keita Univ. Polytechnique Hauts de France.
Wassim Hamidouche Univ. Rennes.
Abdelmalik Taleb Ahmed Univ. Polytechnique Hauts de France.
Helena Liz López Universidad Politécnica de Madrid.
Alejandro Martín Universidad Politécnica de Madrid.
David Camacho Universidad Politécnica de Madrid.
Abdenour Hadid Sorbonne University Abu Dhabi.

DOI:

https://doi.org/10.9781/ijimai.2024.11.003

Supporting Agencies

This work has been partially supported by the project PCI2022- 134990-2 (MARTINI) of the CHISTERA IV Cofund 2021 program; by MCIN/AEI/10.13039/501100011033/ and European Union NextGenerationEU/PRTR for XAI-Disinfodemics (PLEC 2021-007681) grant, by European Comission under IBERIFIER Plus - Iberian Digital Media Observatory (DIGITAL-2023-DEPLOY- 04-EDMO-HUBS 101158511), and by TUAI Project (HORIZON-MSCA-2023-DN-01-01, Proposal number: 101168344); by EMIF managed by the Calouste Gulbenkian Foundation, in the project MuseAI; and by Comunidad Autonoma de Madrid, CIRMA-CM Project (TEC-2024/COM-404). Abdenour Hadid is funded by TotalEnergies collaboration agreement with Sorbonne University Abu Dhabi.

Abstract

In recent years generative AI models and tools have experienced a significant increase, especially techniques to generate synthetic multimedia content, such as images or videos. These methodologies present a wide range of possibilities; however, they can also present several risks that should be taken into account. In this survey we describe in detail different techniques for generating synthetic multimedia content, and we also analyse the most recent techniques for their detection. In order to achieve these objectives, a key aspect is the availability of datasets, so we have also described the main datasets available in the state of the art. Finally, from our analysis we have extracted the main trends for the future, such as transparency and interpretability, the generation of multimodal multimedia content, the robustness of models and the increased use of diffusion models. We find a roadmap of deep challenges, including temporal consistency, computation requirements, generalizability, ethical aspects, and constant adaptation.

Downloads

Download data is not yet available.

References

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022.

A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” 2022. [Online]. Available: https://arxiv.org/abs/2112.10741.

Midjourney, “Midjourney platform.” Online. [Online]. Available: https://www.midjourney.com/home, Accessed: Nov. 07, 2024.

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K.Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in neural information processing systems, vol. 35, pp. 36479–36494, 2022.

D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, R. Hornung, H. Adam, H. Akbari, Y. Alon, V. Birodkar, et al., “Videopoet: A large language model for zero-shot video generation,” arXiv preprint arXiv:2312.14125, 2023.

OpenAI, “Sora: Video generation models as world simulators,” OpenAI, 2024. [Online]. Available: https://openai.com/index/sora/, Accessed: Nov. 07, 2024.

J. Bruce, M. D. Dennis, A. Edwards, J. Parker- Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, et al., “Genie: Generative interactive environments,” in Proceedings of the 41st International Conference on Machine Learning, vol. 235 of Proceedings of Machine Learning Research, 21–27 Jul 2024, pp. 4603–4623, PMLR.

G. Madaan, S. K. Asthana, J. Kaur, “Generative ai: Applications, models, challenges, opportunities, and future directions,” Generative AI and Implications for Ethics, Security, and Data Management, pp. 88–121, 2024.

X. Zhao, X. Zhao, “Application of generative artificial intelligence in film image production,” Computer- Aided Design & Applications, vol. 21, pp. 29–43, 2024, doi: 10.14733/cadaps.2024.S27.29-43.

Á. Huertas-García, H. Liz, G. Villar-Rodríguez, Martín, J. Huertas-Tato, D. Camacho, “Aida- upm at semeval-2022 task 5: Exploring multimodal late information fusion for multimedia automatic misogyny identification,” in Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), 2022, pp. 771–779.

N. Anantrasirichai, D. Bull, “Artificial intelligence in the creative industries: a review,” Artificial intelligence review, vol. 55, no. 1, pp. 589–656, 2022.

H. Choi, Generative AI Art Exploration and Image Generation Fine Tuning Techniques. PhD dissertation, California Institute of the Arts.

A. Doe, B. Smith, C. White, “Gans for medical image synthesis: A comprehensive review,” Medical Image Analysis, vol. 78, p. 102345, 2023.

U. Mittal, S. Sai, V. Chamola, et al., “A comprehensive review on generative ai for education,” IEEE Access, vol. 12, pp. 142733–142759, 2024.

H. S. Mavikumbure, V. Cobilean, C. S. Wickramasinghe, D. Drake, M. Manic, “Generative ai in cyber security of cyber physical systems: Benefits and threats,” in 2024 16th International Conference on Human System Interaction (HSI), 2024, pp. 1–8, IEEE.

S. Oh, T. Shon, “Cybersecurity issues in generative ai,” in 2023 International Conference on Platform Technology and Service (PlatCon), 2023, pp. 97–100, IEEE.

H. Liz-Lopez, M. Keita, A. Taleb-Ahmed, A. Hadid, J. Huertas-Tato, D. Camacho, “Generation and detection of manipulated multimodal audiovisual content: Advances, trends and open challenges,” Information Fusion, vol. 103, p. 102103, 2024.

A. Giron, J. Huertas-Tato, D. Camacho, “Multimodal analysis for identifying misinformation in social networks,” in The 2024 World Congress on Information Technology Applications and Services, 2024, World IT Congress 2024.

K. Shiohara, T. Yamasaki, “Detecting deepfakes with self-blended images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18720–18729.

A. Martín, A. Hernández, M. Alazab, J. Jung, D. Camacho, “Evolving generative adversarial networks to improve image steganography,” Expert Systems with Applications, vol. 222, p. 119841, 2023.

Á. Huertas-García, A. Martín, J. Huertas-Tato, D. Camacho, “Camouflage is all you need: Evaluating and enhancing transformer models robustness against camouflage adversarial attacks,” IEEE Transactions on Emerging Topics in Computational Intelligence, 2024.

T. Zhang, “Deepfake generation and detection, a survey,” Multimedia Tools and Applications, vol. 81, no. 5, pp. 6259–6276, 2022.

S. Tyagi, D. Yadav, “A detailed analysis of image and video forgery detection techniques,” The Visual Computer, vol. 39, no. 3, pp. 813–833, 2023.

Z. Jia, Z. Zhang, L. Wang, T. Tan, “Human image generation: A comprehensive survey,” ACM Computing Surveys, 2022.

A. Figueira, B. Vaz, “Survey on synthetic data generation, evaluation methods and gans,” Mathematics, vol. 10, no. 15, p. 2733, 2022.

T. T. Nguyen, Q. V. H. Nguyen, D. T. Nguyen, D. T. Nguyen, T. Huynh-The, S. Nahavandi, T. T. Nguyen, Q.-V. Pham, C. M. Nguyen, “Deep learning for deepfakes creation and detection: A survey,” Computer Vision and Image Understanding, vol. 223, p. 103525, 2022.

A. Bauer, S. Trapp, M. Stenger, R. Leppich, S. Kounev, M. Leznik, K. Chard, I. Foster, “Comprehensive exploration of synthetic data generation: A survey,” arXiv preprint arXiv:2401.02524, 2024.

P. Cao, F. Zhou, Q. Song, L. Yang, “Controllable generation with text-to-image diffusion models: A survey,” arXiv preprint arXiv:2403.04279, 2024.

I. Joshi, M. Grimmer, C. Rathgeb, C. Busch, F. Bremond, A. Dantcheva, “Synthetic data in human analysis: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 7, pp. 4957–4976, 2024, doi: 10.1109/TPAMI.2024.3362821.

P. Cao, F. Zhou, Q. Song, L. Yang, “Controllable generation with text-to-image diffusion models: A survey,” 2024. [Online]. Available: https://arxiv.org/abs/2403.04279.

T. Zhang, Z. Wang, J. Huang, M. M. Tasnim, W. Shi, “A survey of diffusion based image generation models: Issues and their solutions,” 2023. [Online]. Available: https://arxiv.org/abs/2308.13142.

A. Sauer, T. Karras, S. Laine, A. Geiger, T. Aila, “Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis,” in International conference on machine learning, 2023, pp. 30105–30118, PMLR.

M. Kang, J.-Y. Zhu, R. Zhang, J. Park, E. Shechtman, S. Paris, T. Park, “Scaling up gans for text-to-image synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10124–10134.

H. Ku, M. Lee, “Textcontrolgan: Text-to-image synthesis with controllable generative adversarial networks,” Applied Sciences, vol. 13, no. 8, p. 5098, 2023.

M. Tao, B.-K. Bao, H. Tang, C. Xu, “Galip: Generative adversarial clips for text-to-image synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14214–14223.

Y. A. Ahmed, A. Mittal, “Unsupervised co-generation of foreground-background segmentation from text- to-image synthesis,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, vol. 12, 2024, pp. 5058–5069.

Y. Xu, Y. Zhao, Z. Xiao, T. Hou, “Ufogen: You forward once large scale text-to-image generation via diffusion gans,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 8196–8206.

J. Shi, C. Wu, J. Liang, X. Liu, N. Duan, “Divae: Photorealistic images synthesis with denoising diffusion decoder,” arXiv preprint arXiv:2206.00386, 2022.

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, 2021, pp. 8748–8763, PMLR.

H. Chang, H. Zhang, J. Barber, A. Maschinot, J. Lezama, L. Jiang, M.-H. Yang, K. Murphy, W. T. Freeman, M. Rubinstein, Y. Li, D. Krishnan, “Muse: Text-to-image generation via masked generative transformers,” 2023. [Online]. Available: https://arxiv.org/abs/2301.00704.

M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, et al., “Cogview: Mastering text-to-image generation via transformers,” Advances in neural information processing systems, vol. 34, pp. 19822–19835, 2021.

M. Ding, W. Zheng, W. Hong, J. Tang, “Cogview2: Faster and better text-to-image generation via hierarchical transformers,” Advances in Neural Information Processing Systems, vol. 35, pp. 16890– 16902, 2022.

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10684–10695.

A. Razzhigaev, A. Shakhmatov, A. Maltseva, Arkhipkin, I. Pavlov, I. Ryabov, A. Kuts, Panchenko, A. Kuznetsov, D. Dimitrov, “Kandinsky: an improved text-to-image synthesis with image prior and latent diffusion,” arXiv preprint arXiv:2310.03502, 2023.

J. Yang, J. Feng, H. Huang, “Emogen: Emotional image content generationwith text-to-image diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6358–6368.

H. Li, C. Shen, P. Torr, V. Tresp, J. Gu, “Self- discovering interpretable diffusion latent directions for responsible text-to-image generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 12006–12016.

J. Ho, T. Salimans, “Classifier-free diffusion guidance,” 2022. [Online]. Available: https://arxiv.org/abs/2207.12598.

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” arXiv preprint arXiv:2307.01952, 2023.

Z. Xue, G. Song, Q. Guo, B. Liu, Z. Zong, Y. Liu, P. Luo, “Raphael: Text-to-image generation via large mixture of diffusion paths,” Advances in Neural Information Processing Systems, vol. 36, 2024.

G. DeepMind, “Imagen 2.” http://tinyurl.com/3pakj3mk, 2023.

J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo, et al., “Improving image generation with better captions,” Computer Science. https://cdn.openai.com/papers/dall-e- 3.pdf, vol. 2, no. 3, p. 8, 2023.

L. Chen, W. Zhao, L. Xu, “Augmented cyclegan for enhanced image-to-image translation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2345–2354.

Y. Wang, K. Liu, H. Zhang, “Dualgan++: Robust and efficient image-to-image translation,” IEEE Transactions on Image Processing, vol. 32, pp. 678–690, 2023.

M. Li, E. Johnson, R. Wang, “Cut++: Enhanced contrastive unpaired translation for image synthesis,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023, pp. 3456–3465.

T. Nguyen, W. Huang, S. Lee, “Spade++: Spatially- adaptive gans for high-resolution image synthesis,” Pattern Recognition, vol. 122, pp. 108–119, 2022.

S. Kim, D. Park, M. Lee, “Self-supervised image translation gan for high-quality synthetic image generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4567–4576.

H. Zhang, Y. Wang, K. Liu, “Unified multimodal gan for diverse image-to-image translation,” IEEE Transactions on Neural Networks and Learning Systems, vol. 35, pp. 234–245, 2024.

M. Lee, S. Kim, D. Park, “Zero-shot gans: Generating images without extensive labeled data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, pp. 567–578, 2024.

T. Karras, M. Aittala, S. Laine, E. Härkönen, J. Hellsten, J. Lehtinen, T. Aila, “Alias-free generative adversarial networks,” in Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS), 2021.

T. Karras, T. Aila, S. Laine, J. Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,” arXiv preprint arXiv:1710.10196, 2017.

J. Smith, J. Doe, A. Brown, “Efficientgan: Reducing the computational cost of gans while preserving image quality,” Journal of Machine Learning Research, vol. 23, pp. 1234–1256, 2022.

E. Johnson, R. Wang, M. Li, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1204–1213.

D. Torbunov, Y. Huang, H. Yu, J. Huang, S. Yoo, M. Lin, B. Viren, Y. Ren, “Uvcgan: Unet vision transformer cycle-consistent gan for unpaired image- to-image translation,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2023, pp. 702–712.

W. Harvey, S. Naderiparizi, F. Wood, “Conditional image generation by conditioning variational auto- encoders,” arXiv preprint arXiv:2102.12037, 2022.

A. Razavi, A. van den Oord, O. Vinyals, “Hierarchical variational autoencoders for high-resolution image synthesis,” Nature, vol. 570, pp. 234–239, 2022.

A. Vahdat, J. Kautz, “Nvae: A deep hierarchical variational autoencoder,” arXiv preprint arXiv:2007.03898, 2022.

J.-Y. Zhu, T. Park, A. A. Efros, “Stylevae: Variational autoencoders with style transfer for image synthesis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, pp. 2345–2356, 2023.

H. Kim, A. Mnih, “Factorized hierarchical variational autoencoders for disentangled representation learning,” Journal of Machine Learning Research, vol. 24, pp. 3456–3465, 2023.

D. E. Diamantis, P. Gatoula, D. K. Iakovidis, “Endovae: Generating endoscopic images with a variational autoencoder,” in 2022 IEEE 14th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), 2022, pp. 1–5, IEEE.

R. Dos Santos, J. Aguilar, “A synthetic data generation system based on the variational-autoencoder technique and the linked data paradigm,” Progress in Artificial Intelligence, pp. 1–15, 2024.

S. An, J.-J. Jeon, “Distributional learning of variational autoencoder: Application to synthetic data generation,” in Advances in Neural Information Processing Systems, vol. 36, 2023, pp. 57825–57851, Curran Associates, Inc.

G. Parmar, K. Kumar Singh, R. Zhang, Y. Li, J. Lu, J.-Y. Zhu, “Zero-shot image-to-image translation,” in ACM SIGGRAPH 2023 Conference Proceedings, 2023, pp. 1–11.

A. Brock, J. Donahue, K. Simonyan, “Large scale gan training for high fidelity natural image synthesis,” 2019. [Online]. Available: https://arxiv.org/abs/1809.11096.

S. Sinitsa, O. Fried, “Deep image fingerprint: Towards low budget synthetic image detection and model lineage analysis,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 4067–4076.

N. Poredi, D. Nagothu, Y. Chen, “Ausome: authenticating social media images using frequency analysis,” in Disruptive Technologies in Information Sciences VII, vol. 12542, 2023, pp. 44–56, SPIE.

Q. Bammey, “Synthbuster: Towards detection of diffusion model generated images,” IEEE Open Journal of Signal Processing, vol. 5, pp. 1–9, 2023, doi: 10.1109/OJSP.2023.3337714.

T. Alzantot, C. Shou, M. Farag, Z. J. Wang, S. Pandey, M. Esmaili, “Wavelet-packets for deepfake image analysis and detection,” Machine Learning, vol. 111, no. 11, pp. 1–25, 2022, doi: 10.1007/s10994-022-06225-5.

N. Zhong, Y. Xu, Z. Qian, X. Zhang, “Rich and poor texture contrast: A simple yet effective approach for ai-generated image detection,” arXiv preprint arXiv:2311.12397, 2023.

Z. Wang, J. Bao, W. Zhou, W. Wang, H. Hu, H. Chen, H. Li, “Dire for diffusion-generated image detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22445–22455.

R. Ma, J. Duan, F. Kong, X. Shi, K. Xu, “Exposing the fake: Effective diffusion-generated images detection,” arXiv preprint arXiv:2307.06272, 2023.

J. Huertas-Tato, A. Martín, J. Fierrez, D. Camacho, “Fusing cnns and statistical indicators to improve image classification,” Information Fusion, vol. 79, pp. 174–187, 2022.

P. Lorenz, R. L. Durall, J. Keuper, “Detecting images generated by deep diffusion models using their local intrinsic dimensionality,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 448–459.

L. Guarnera, O. Giudice, S. Battiato, “Level up the deepfake detection: a method to effectively discriminate images generated by gan architectures and diffusion models,” arXiv preprint arXiv:2303.00608, 2023.

D. A. Coccomini, A. Esuli, F. Falchi, C. Gennaro, G. Amato, “Detecting images generated by diffusers,” PeerJ Computer Science, vol. 10, p. e2127, 2024.

U. Ojha, Y. Li, Y. J. Lee, “Towards universal fake image detectors that generalize across generative models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 24480–24489.

M. Mathys, M. Willi, R. Meier, “Synthetic photography detection: A visual guidance for identifying synthetic images created by ai,” arXiv preprint arXiv:2408.06398, 2024.

C. Tan, R. Tao, H. Liu, G. Gu, B. Wu, Y. Zhao, Y. Wei, “C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection,” arXiv preprint arXiv:2408.09647, 2024.

M. Keita, W. Hamidouche, H. B. Eutamene, Hadid, A. Taleb-Ahmed, “Bi-lora: A vision- language approach for synthetic image detection,” Pattern Recognition, 2024. Preprint available at https://github.com/Mamadou-Keita/VLM-DETECT.

J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, D. J. Fleet, “Video diffusion models,” Advances in Neural Information Processing Systems, vol. 35, pp. 8633–8646, 2022.

J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, et al., “Imagen video: High definition video generation with diffusion models,” arXiv preprint arXiv:2210.02303, 2022.

U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, et al., “Make- a-video: Text-to-video generation without text-video data,” arXiv preprint arXiv:2209.14792, 2022.

D. Zhou, W. Wang, H. Yan, W. Lv, Y. Zhu, J. Feng, “Magicvideo: Efficient video generation with latent diffusion models,” arXiv preprint arXiv:2211.11018, 2022.

L. Yu, J. Lezama, N. B. Gundavarapu, L. Versari, K. Sohn, D. Minnen, Y. Cheng, A. Gupta, X. Gu, A. G. Hauptmann, et al., “Language model beats diffusion– tokenizer is key to visual generation,” arXiv preprint arXiv:2310.05737, 2023.

N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021.

J. Huertas-Tato, A. Martín, D. Camacho, “Understanding writing style in social media with a supervised contrastively pre-trained transformer,” Knowledge-Based Systems, vol. 296, p. 111867, 2024.

R. Girdhar, M. Singh, A. Brown, Q. Duval, S. Azadi, S. S. Rambhatla, A. Shah, X. Yin, D. Parikh, I. Misra, “Emu video: Factorizing text-to-video generation by explicit image conditioning,” arXiv preprint arXiv:2311.10709, 2023.

Y. Wang, X. Chen, X. Ma, S. Zhou, Z. Huang, Y. Wang, C. Yang, Y. He, J. Yu, P. Yang, et al., “Lavie: High-quality video generation with cascaded latent diffusion models,” arXiv preprint arXiv:2309.15103, 2023.

W. Menapace, A. Siarohin, I. Skorokhodov, E. Deyneka, T.-S. Chen, A. Kag, Y. Fang, A. Stoliar, E. Ricci, J. Ren, et al., “Snap video: Scaled spatiotemporal transformers for text-to-video synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7038–7048.

T. Karras, M. Aittala, T. Aila, S. Laine, “Elucidating the design space of diffusion-based generative models,” Advances in neural information processing systems, vol. 35, pp. 26565–26577, 2022.

T. Chen, L. Li, “Fit: Far-reaching interleaved transformers,” arXiv preprint arXiv:2305.12689, 2023.

H. Chen, Y. Zhang, X. Cun, M. Xia, X. Wang, C. Weng, Y. Shan, “Videocrafter2: Overcoming data limitations for high-quality video diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7310–7320.

X. Ma, Y. Wang, G. Jia, X. Chen, Z. Liu, Y.-F. Li, C. Chen, Y. Qiao, “Latte: Latent diffusion transformer for video generation,” arXiv preprint arXiv:2401.03048, 2024.

X. Li, W. Chu, Y. Wu, W. Yuan, F. Liu, Q. Zhang, F. Li, H. Feng, E. Ding, J. Wang, “Videogen: A reference-guided latent diffusion approach for high definition text-to-video generation,” arXiv preprint arXiv:2309.00398, 2023.

C. Wu, L. Huang, Q. Zhang, B. Li, L. Ji, F. Yang, G. Sapiro, N. Duan, “Godiva: Generating open-domain videos from natural descriptions,” arXiv preprint arXiv:2104.14806, 2021.

A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, J. Sivic, “Howto100m: Learning a text-video embedding by watching hundred million narrated video clips,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 2630–2640.

W. Hong, M. Ding, W. Zheng, X. Liu, J. Tang, “Cogvideo: Large-scale pretraining for text-to-video generation via transformers,” arXiv preprint arXiv:2205.15868, 2022.

C. Wu, J. Liang, L. Ji, F. Yang, Y. Fang, D. Jiang, N. Duan, “Nüwa: Visual synthesis pre-training for neural visual world creation,” in European conference on computer vision, 2022, pp. 720–736, Springer.

C. Wu, J. Liang, X. Hu, Z. Gan, J. Wang, L. Wang, Z. Liu, Y. Fang, N. Duan, “Nuwa-infinity: Autoregressive over autoregressive generation for infinite visual synthesis,” arXiv preprint arXiv:2207.09814, 2022.

W. Yan, Y. Zhang, P. Abbeel, A. Srinivas, “Videogpt: Video generation using vq-vae and transformers,” arXiv preprint arXiv:2104.10157, 2021.

A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, K. Kreis, “Align your latents: High-resolution video synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22563–22575.

H. Chen, M. Xia, Y. He, Y. Zhang, X. Cun, S. Yang, J. Xing, Y. Liu, Q. Chen, X. Wang, et al., “Videocrafter1: Open diffusion models for high-quality video generation,” arXiv preprint arXiv:2310.19512, 2023.

Y. He, T. Yang, Y. Zhang, Y. Shan, Q. Chen, “Latent video diffusion models for high-fidelity long video generation,” arXiv preprint arXiv:2211.13221, 2022.

J. Wang, H. Yuan, D. Chen, Y. Zhang, X. Wang, S. Zhang, “Modelscope text-to-video technical report,” arXiv preprint arXiv:2308.06571, 2023.

A. Gupta, L. Yu, K. Sohn, X. Gu, M. Hahn, L. Fei- Fei, I. Essa, L. Jiang, J. Lezama, “Photorealistic video generation with diffusion models,” arXiv preprint arXiv:2312.06662, 2023.

R. Villegas, M. Babaeizadeh, P.-J. Kindermans, H. Moraldo, H. Zhang, M. T. Saffar, S. Castro, J. Kunze, D. Erhan, “Phenaki: Variable length video generation from open domain textual descriptions,” in International Conference on Learning Representations, 2022.

Z. Xing, Q. Dai, H. Hu, Z. Wu, Y.-G. Jiang, “Simda: Simple diffusion adapter for efficient video generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7827–7839.

D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y. Gu, D. Gao, M. Z. Shou, “Show-1: Marrying pixel and latent diffusion models for text-to-video generation,” arXiv preprint arXiv:2309.15818, 2023.

L. Khachatryan, A. Movsisyan, V. Tadevosyan, R. Henschel, Z. Wang, S. Navasardyan, H. Shi, “Text2video-zero: Text-to-image diffusion models are zero-shot video generators,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15954–15964.

W. Weng, R. Feng, Y. Wang, Q. Dai, C. Wang, D. Yin, Z. Zhao, K. Qiu, J. Bao, Y. Yuan, et al., “Art-v: Auto-regressive text-to-video generation with diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7395–7405.

F. Shi, J. Gu, H. Xu, S. Xu, W. Zhang, L. Wang, “Bivdiff: A training-free framework for general-purpose video synthesis via bridging image and video diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024,pp. 7393–7402.

Z. Qing, S. Zhang, J. Wang, X. Wang, Y. Wei, Y. Zhang, C. Gao, N. Sang, “Hierarchical spatio- temporal decoupling for text-to-video generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6635–6645.

R. Wu, L. Chen, T. Yang, C. Guo, C. Li, X. Zhang, “Lamp: Learn a motion pattern for few-shot-based video generation,” arXiv preprint arXiv:2310.10769, 2023.

X. Guo, M. Zheng, L. Hou, Y. Gao, Y. Deng, C. Ma, W. Hu, Z. Zha, H. Huang, P. Wan, et al., “I2v-adapter: A general image-to-video adapter for video diffusion models,” arXiv preprint arXiv:2312.16693, 2023.

D. J. Zhang, D. Li, H. Le, M. Z. Shou, C. Xiong, D. Sahoo, “Moonshot: Towards controllable video generation and editing with multimodal conditions,” arXiv preprint arXiv:2401.01827, 2024.

L. Gong, Y. Zhu, W. Li, X. Kang, B. Wang, T. Ge, B. Zheng, “Atomovideo: High fidelity image-to-video generation,” arXiv preprint arXiv:2403.01800, 2024.

X. Shi, Z. Huang, F.-Y. Wang, W. Bian, D. Li, Y. Zhang, M. Zhang, K. C. Cheung, S. See, H. Qin, et al., “Motion-i2v: Consistent and controllable image-to- video generation with explicit motion modeling,” in ACM SIGGRAPH 2024 Conference Papers, 2024, pp. 1–11.

W. Ren, H. Yang, G. Zhang, C. Wei, X. Du, S. Huang, W. Chen, “Consisti2v: Enhancing visual consistency for image-to-video generation,” arXiv preprint arXiv:2402.04324, 2024.

C. Shen, Y. Gan, C. Chen, X. Zhu, L. Cheng, T. Gao, J. Wang, “Decouple content and motion for conditional image-to-video generation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, 2024, pp. 4757–4765.

L. Hu, “Animate anyone: Consistent and controllable image-to-video synthesis for character animationmagic,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8153–8163.

Z. Xu, J. Zhang, J. H. Liew, H. Yan, J.-W. Liu, C. Zhang, J. Feng, M. Z. Shou, “Magicanimate: Temporally consistent human image animation using diffusion model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1481–1490.

M. Dorkenwald, T. Milbich, A. Blattmann, R. Rombach, K. G. Derpanis, B.Ommer, “Stochastic image-to-video synthesis using cinns,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3742–3753.

H. Ni, C. Shi, K. Li, S. X. Huang, M. R. Min, “Conditional image-to-videogeneration with latent flow diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 18444–18455.

C. Wang, J. Gu, P. Hu, S. Xu, H. Xu, X. Liang, “Dreamvideo: High-fidelity image-to-video generation with image retention and text guidance,” arXiv preprint arXiv:2312.03018, 2023.

S. Zhang, J. Wang, Y. Zhang, K. Zhao, H. Yuan, Z. Qin, X. Wang, D. Zhao, J. Zhou, “I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models,” arXiv preprint arXiv:2311.04145, 2023.

A. Blattmann, T. Milbich, M. Dorkenwald, B. Ommer, “Understanding object dynamics for interactive image-to-video synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5171–5181.

W. Menapace, S. Lathuiliere, S. Tulyakov, A. Siarohin, E. Ricci, “Playable video generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10061–10070.

H. Wang, M. Huang, D. Wu, Y. Li, W. Zhang, “Supervised video-to-video synthesis for single human pose transfer,” IEEE Access, vol. 9, pp. 17544– 17556, 2021.

L. Zhuo, G. Wang, S. Li, W. Wu, Z. Liu, “Fast- vid2vid: Spatial-temporal compression for video-to- video synthesis,” in European Conference on Computer Vision, 2022, pp. 289–305, Springer.

S. Yang, Y. Zhou, Z. Liu, C. C. Loy, “Rerender a video: Zero-shot text-guided video-to-video translation,” in SIGGRAPH Asia 2023 Conference Papers, 2023, pp. 1–11.

W. Wang, Y. Jiang, K. Xie, Z. Liu, H. Chen, Y. Cao, X. Wang, C. Shen, “Zero-shot video editing using off-the-shelf image diffusion models,” arXiv preprint arXiv:2303.17599, 2023.

C. Qi, X. Cun, Y. Zhang, C. Lei, X. Wang, Y. Shan, Q. Chen, “Fatezero: Fusing attentions for zero- shot text-based video editing,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15932–15942.

E. Molad, E. Horwitz, D. Valevski, A. R. Acha, Y. Matias, Y. Pritch, Y. Leviathan, Y. Hoshen, “Dreamix: Video diffusion models are general video editors,” arXiv preprint arXiv:2302.01329, 2023.

Z. Hu, D. Xu, “Videocontrolnet: A motion-guided video-to-video translation framework by using diffusion model with controlnet,” arXiv preprint arXiv:2307.14073, 2023.

F. Liang, B. Wu, J. Wang, L. Yu, K. Li, Y. Zhao, I. Misra, J.-B. Huang, P. Zhang, P. Vajda, et al., “Flowvid: Taming imperfect optical flows for consistent video- to-video synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8207–8216.

B. Wu, C.-Y. Chuang, X. Wang, Y. Jia, K. Krishnakumar, T. Xiao, F. Liang, L. Yu, P. Vajda, “Fairy: Fast parallelized instruction-guided video- to-video synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8261–8270.

M. Ku, C. Wei, W. Ren, H. Yang, W. Chen, “Anyv2v: A plug-and-play framework for any video-to-video editing tasks,” arXiv preprint arXiv:2403.14468, 2024.

W. Ouyang, Y. Dong, L. Yang, J. Si, X. Pan, “I2vedit: First-frame-guided video editing via image-to-video diffusion models,” arXiv preprint arXiv:2405.16537, 2024.

H. Ouyang, Q. Wang, Y. Xiao, Q. Bai, J. Zhang, K. Zheng, X. Zhou, Q. Chen, Y. Shen, “Codef: Content deformation fields for temporally consistent video processing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8089–8099.

Y. Gu, Y. Zhou, B. Wu, L. Yu, J.-W. Liu, R. Zhao, J. Z. Wu, D. J. Zhang, M. Z. Shou, K. Tang, “Videoswap: Customized video subject swapping with interactive semantic point correspondence,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7621–7630.

J. Bai, T. He, Y. Wang, J. Guo, H. Hu, Z. Liu, J. Bian, “Uniedit: A unified tuning-free framework for video motion and appearance editing,” arXiv preprint arXiv:2402.13185, 2024.

Y. Hu, C. Luo, Z. Chen, “Make it move: controllable image-to-video generation with text descriptions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18219–18228.

Y. Guo, C. Yang, A. Rao, Z. Liang, Y. Wang, Y. Qiao, M. Agrawala, D. Lin, B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,” arXiv preprint arXiv:2307.04725, 2023.

S. Yin, C. Wu, H. Yang, J. Wang, X. Wang, M. Ni, Z. Yang, L. Li, S. Liu, F. Yang, et al., “Nuwa-xl: Diffusion over diffusion for extremely long video generation,” arXiv preprint arXiv:2303.12346, 2023.

P. Esser, J. Chiu, P. Atighehchian, J. Granskog, Germanidis, “Structure and content-guided video synthesis with diffusion modelss,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7346–7356.

S. Yin, C. Wu, J. Liang, J. Shi, H. Li, G. Ming, N. Duan, “Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory,” arXiv preprint arXiv:2308.08089, 2023.

X. Wang, H. Yuan, S. Zhang, D. Chen, J. Wang, Y. Zhang, Y. Shen, D. Zhao, J. Zhou, “Videocomposer: Compositional video synthesis with motion controllability,” Advances in Neural Information Processing Systems, vol. 36, 2024.

H. Ni, B. Egger, S. Lohit, A. Cherian, Y. Wang, T. Koike-Akino, S. X. Huang, T. K. Marks, “Ti2v- zero: Zero-shot image conditioning for text-to-video diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9015–9025.

C. Nash, J. Carreira, J. Walker, I. Barr, A. Jaegle, M. Malinowski, P. Battaglia, “Transframer: Arbitrary frame prediction with generative models,” arXiv preprint arXiv:2203.09494, 2022.

D. S. Vahdati, T. D. Nguyen, A. Azizpour, M. C. Stamm, “Beyond deepfake images: Detecting ai- generated videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4397–4408.

P. He, L. Zhu, J. Li, S. Wang, H. Li, “Exposing ai- generated videos: A benchmark dataset and a local- and-global temporal defect based detection method,” arXiv preprint arXiv:2405.04133, 2024.

Z. Peng, L. Dong, H. Bao, Q. Ye, F. Wei, “Beit v2: Masked image modeling with vector-quantized visual tokenizers,” arXiv preprint arXiv:2208.06366, 2022.

H. Chen, Y. Hong, Z. Huang, Z. Xu, Z. Gu, Y. Li, J. Lan, H. Zhu, J. Zhang, W. Wang, et al., “Demamba: Ai- generated video detection on million-scale genvideo benchmark,” arXiv preprint arXiv:2405.19707, 2024.

J. Bai, M. Lin, G. Cao, “Ai-generated video detection via spatio-temporal anomaly learning,” arXiv preprint arXiv:2403.16638, 2024.

L. Ma, J. Zhang, H. Deng, N. Zhang, Y. Liao, H. Yu, “Decof: Generated video detection via frame consistency,” arXiv preprint arXiv:2402.02085, 2024.

L. Ji, Y. Lin, Z. Huang, Y. Han, X. Xu, J. Wu, C. Wang, Z. Liu, “Distinguish any fake videos: Unleashing the power of large-scale data and motion features,” arXiv preprint arXiv:2405.15343, 2024.

H. Xu, J. Zhang, J. Cai, H. Rezatofighi, D. Tao, “Gmflow: Learning optical flow via global matching,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 8121–8130.

Q. Liu, P. Shi, Y.-Y. Tsai, C. Mao, J. Yang, “Turns out i’m not real: Towards robust detection of ai-generated videos,” arXiv preprint arXiv:2406.09601, 2024.

J. Ricker, S. Damm, T. Holz, A. Fischer, “Towards the detection of diffusion model deepfakes,” arXiv preprint arXiv:2210.14571, 2022.

H. Song, S. Huang, Y. Dong, W.-W. Tu, “Robustness and generalizability of deepfake detection: A study with diffusion models,” arXiv preprint arXiv:2309.02218, 2023.

L. Papa, L. Faiella, L. Corvitto, L. Maiano, I. Amerini, “On the use of stable diffusion for creating realistic faces: From generation to detection,” in 2023 11th International Workshop on Biometrics and Forensics (IWBF), 2023, pp. 1–6, IEEE.

Y. Wang, Z. Huang, X. Hong, “Benchmarking deepart detection,” arXiv preprint arXiv:2302.14475, 2023.

Z. Sha, Z. Li, N. Yu, Y. Zhang, “De-fake: Detection and attribution of fake images generated by text-to-image generation models,” in Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, 2023, pp. 3418–3432.

Z. Xi, W. Huang, K. Wei, W. Luo, P. Zheng, “Ai- generated image detection using a cross-attention enhanced dual-stream network,” in 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2023, pp. 1463– 1470, IEEE.

M. A. Rahman, B. Paul, N. H. Sarker, Z. I. A. Hakim, S. A. Fattah, “Artifact: A large-scale dataset with artificial and factual images for generalizable and robust synthetic image detection,” in 2023 IEEE International Conference on Image Processing (ICIP), 2023, pp. 2200–2204, IEEE.

S. Jia, M. Huang, Z. Zhou, Y. Ju, J. Cai, S. Lyu, “Autosplice: A text-prompt manipulated image dataset for media forensics,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 893–903.

X. Guo, X. Liu, Z. Ren, S. Grosz, I. Masi, X. Liu, “Hierarchical fine-grained image forgery detection and localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3155–3165.

G. Zingarini, D. Cozzolino, R. Corvi, G. Poggi, L. Verdoliva, “M3dsynth: A dataset of medical 3d images with ai-generated local manipulations,” arXiv preprint arXiv:2309.07973, 2023.

R. Shao, T. Wu, Z. Liu, “Detecting and grounding multi-modal media manipulation,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 6904–6913.

R. Amoroso, D. Morelli, M. Cornia, L. Baraldi, A. Del Bimbo, R. Cucchiara, “Parents and children: Distinguishing multimodal deepfakes from natural images,” arXiv preprint arXiv:2304.00500, 2023.

H. Cheng, Y. Guo, T. Wang, L. Nie, M. Kankanhalli, “Diffusion facial forgery detection,” arXiv preprint arXiv:2401.15859, 2024.

J. J. Bird, A. Lotfi, “Cifake: Image classification and explainable identification of ai-generated synthetic images,” IEEE Access, vol. 12, pp. 15642–15650, 2024, doi: 10.1109/ACCESS.2024.3356122.

M. Zhu, H. Chen, Q. Yan, X. Huang, G. Lin, W. Li, Z. Tu, H. Hu, J. Hu, Y. Wang, “Genimage: A million- scale benchmark for detecting ai-generated image,” Advances in Neural Information Processing Systems, vol. 36, 2024.

Z. Lu, D. Huang, L. Bai, J. Qu, C. Wu, X. Liu, W. Ouyang, “Seeing is not always believing: benchmarking human and model perception of ai-generated images,” Advances in Neural Information Processing Systems, vol. 36, 2024.

Y. Hong, J. Zhang, “Wildfake: A large-scale challenging dataset for ai-generated images detection,” arXiv preprint arXiv:2402.11843, 2024.

S. Changpinyo, P. Sharma, N. Ding, R. Soricut, “Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 3558–3568.

P. Sharma, N. Ding, S. Goodman, R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2556–2565.

K. Srinivasan, K. Raman, J. Chen, M. Bendersky, M. Najork, “Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning,” in Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, 2021, pp. 2443–2449.

K. Desai, G. Kaul, Z. Aysola, J. Johnson, “Redcaps: Web-curated image-text data created by the people, for the people,” arXiv preprint arXiv:2111.11431, 2021.

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al., “Laion-5b: An open large-scale dataset for training next generation image- text models,” Advances in Neural Information Processing Systems, vol. 35, pp. 25278–25294, 2022.

Z. J. Wang, E. Montoya, D. Munechika, H. Yang, Hoover, D. H. Chau, “Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models,” arXiv preprint arXiv:2210.14896, 2022.

F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, J. Xiao, “Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop,” arXiv preprint arXiv:1506.03365, 2015.

D.-T. Dang-Nguyen, C. Pasquini, V. Conotter, G. Boato, “Raise: A raw images dataset for digital image forensics,” in Proceedings of the 6th ACM multimedia systems conference, 2015, pp. 219–224.

Clip-interrogator, “Clip-interrogator,” 2022. Available: https://github.com/pharmapsychotic/ clip-interrogator.

ALASKA, “Alaska.” https://alaska.utt.fr/. Accessed: 2024-08-04.

OpenAI, “Dall·e 2.” https://openai.com/product/dall-e-2. Accessed: 2024-08-04.

DreamStudio, “Dreamstudio.” https://beta. dreamstudio.ai/generate. Accessed: 2024-08-04.

A. Krizhevsky, G. Hinton, et al., “Learning multiple layers of features from tiny images.” https://www.cs.utoronto.ca/~kriz/learning-features-2009-TR.pdf, 2009.

R. Zellers, X. Lu, J. Hessel, Y. Yu, J. S. Park, J. Cao, A. Farhadi, Y. Choi, “Merlot: Multimodal neural script knowledge models,” Advances in neural information processing systems, vol. 34, pp. 23634– 23651, 2021.

M. Bain, A. Nagrani, G. Varol, A. Zisserman, “Frozen in time: A joint video and image encoder for end- to-end retrieval,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 1728–1738.

H. Zhu, W. Wu, W. Zhu, L. Jiang, S. Tang, L. Zhang, Z. Liu, C. C. Loy, “Celebv-hq: A large-scale video facial attributes dataset,” in European conference on computer vision, 2022, pp. 650–667, Springer.

H. Xue, T. Hang, Y. Zeng, Y. Sun, B. Liu, H. Yang, J. Fu, B. Guo, “Advancing high-resolution video-language representation with large-scale video transcriptions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5036–5045.

Y. Wang, Y. He, Y. Li, K. Li, J. Yu, X. Ma, X. Li, G. Chen, X. Chen, Y. Wang, et al., “Internvid: A large-scale video-text dataset for multimodal understanding and generation,” arXiv preprint arXiv:2307.06942, 2023.

J. Yu, H. Zhu, L. Jiang, C. C. Loy, W. Cai, W. Wu, “Celebv-text: A large-scale facial text-video dataset,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14805–14814.

W. Wang, H. Yang, Z. Tuo, H. He, J. Zhu, J. Fu, J. Liu, “Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation,” https://openreview.net/forum?id=dUDwK38MVC, 2023.

H. Xu, Q. Ye, X. Wu, M. Yan, Y. Miao, J. Ye, G. Xu, A. Hu, Y. Shi, G. Xu, et al., “Youku-mplug: A 10 million large-scale chinese video-language dataset for pre-training and benchmarks,” arXiv preprint arXiv:2306.04362, 2023.

W. Wang, Y. Yang, “Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models,” arXiv preprint arXiv:2403.06098, 2024.

X. Ju, Y. Gao, Z. Zhang, Z. Yuan, X. Wang, A. Zeng, Y. Xiong, Q. Xu, Y. Shan, “Miradata: A large-scale video dataset with long durations and structured captions,” arXiv preprint arXiv:2407.06358, 2024.

T.-S. Chen, A. Siarohin, W. Menapace, E. Deyneka, H.-W. Chao, B. E. Jeon, Y. Fang, H.-Y. Lee, J. Ren, M.-H. Yang, et al., “Panda-70m: Captioning 70m videos with multiple cross-modality teachers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13320–13331.

S. Chen, H. Li, Q. Wang, Z. Zhao, M. Sun, X. Zhu, J. Liu, “Vast: A vision-audio-subtitle-text omni- modality foundation model and dataset,” Advances in Neural Information Processing Systems, vol. 36, 2024.

R. Girdhar, D. Ramanan, “Cater: A diagnostic dataset for compositional actions and temporal reasoning,” arXiv preprint arXiv:1910.04744, 2019.

J. Wang, Z. Yang, X. Hu, L. Li, K. Lin, Z. Gan, Z. Liu, C. Liu, L. Wang, “Git: A generative image-to-text transformer for vision and language,” arXiv preprint arXiv:2205.14100, 2022.

L. Han, J. Ren, H.-Y. Lee, F. Barbieri, K. Olszewski, S. Minaee, D. Metaxas, S. Tulyakov, “Show me what and tell me how: Video synthesis via multimodal conditioning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3615–3625.

J. Xu, T. Mei, T. Yao, Y. Rui, “Msr-vtt: A large video description dataset for bridging video and language,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5288–5296.

D. Chen, W. B. Dolan, “Collecting highly parallel data for paraphrase evaluation,” in Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, 2011, pp. 190– 200.

M. Monfort, S. Jin, A. Liu, D. Harwath, R. Feris, J. Glass, A. Oliva, “Spoken moments: Learning joint audio-visual representations from video descriptions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14871–14881.

B. C. Hosler, X. Zhao, O. Mayer, C. Chen, J. A. Shackleford, M. C. Stamm, “The video authentication and camera identification database: A new database for video forensics,” IEEE access, vol. 7, pp. 76937– 76948, 2019.

Moonvalley, “Moonvalley - ai video generation,” 2024. [Online]. Available: https://moonvalley.ai/, Accessed: 2024-08-16.

L. Yang, Y. Fan, N. Xu, “Video instance segmentation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 5188–5197.

L. Huang, X. Zhao, K. Huang, “Got-10k: A large high- diversity benchmark for generic object tracking in the wild,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 5, pp. 1562–1577, 2019.

X. Shang, T. Ren, J. Guo, H. Zhang, T.-S. Chua, “Video visual relation detection,” in Proceedings of the 25th ACM international conference on Multimedia, 2017, pp. 1300–1308.