A comparative study of biological images generated by selected generative artificial intelligence
Code: G-1579
Authors: Hassan Shojaee-Mend ℗, Reza Mohebbati *, Elina Saffarzadeh
Schedule: Not Scheduled!
Tag: Intelligent Virtual Assistant
Download: Download Poster
Abstract:
Abstract
Background and aims: Generative AI models for text-to-image (T2I) have found wide application in scientific and medical fields in recent years due to their ability to produce diverse and realistic images from text descriptions. These models can be useful tools for education and research in the production of biological images such as cellular structures, organs and biological processes. However, it is very necessary to evaluate them to ensure their scientific validity. This study aimed to evaluate and compare the accuracy of text-to-image AI models in generating biological images. Method Four generative AI models DALL-E 3 (accessed via the ChatGPT), Grok, Gemini, Stable Diffusion were selected based on accessibility and technical diversity. A standard set of text descriptions was designed for three biological subjects (kidney, brain, eye) at three levels of complexity (low, medium, high). Nine images were generated by each AI model. Three experts scored the images from 0-5 points. Data were analyzed using the Friedman test (to compare models), the Kruskal-Wallis’s test (to compare complexity levels), and the intraclass correlation coefficient (ICC) for expert agreement. Results: The mean scores of the models were Gemini (2.93), Grok (2.44), DALL-E 3 (2.22), and Stable Diffusion (1.59), respectively. The Friedman test (statistics: 27.53, p 0.05) showed that the difference between the models was statistically significant. Gemini performed better on the brain (4.56) and Grok on the high complexity level (3.00), while Stable Diffusion performed very poorly on the kidney (0.00). The Kruskal-Wallis’s test (statistics: 4.29, p = 0.117) did not show a significant difference between the complexity levels, although the mean scores at the low level (3.25) were higher than those at the medium (1.78) and high (1.69) levels. The ICC of 0.989 confirmed a very high level of agreement among the experts. Conclusion: In biological image generation, Gemini performed best. Grok and DALL-E3 performed moderately, and Stable Diffusion performed poorly. The high level of agreement among the experts ensured the validity of the evaluations. These findings can provide guidance for the use of artificial intelligence models in biological image generation for education and research.
Keywords
Generative Ai, Text-to-Image Generation, Biological Image