A Peek into the Future of Visual Data Interpretation

Picture of Apps on a phone screen highlighting ChatGPT — Alyssa Hwang, a Ph.D. candidate in the School of Engineering and Applied Science, developed a new framework for evaluating the performance of large language models’ ability to analyze images. Hwang utilized the tool to run a battery of tests on the new ChatGPT-Vision to assess its ability at describing scientific images ahead of its release. (Image: iStock/Robert Way)

In the last year, large language models (LLMs) have come into prominence for boasting a suite of ever-expanding capabilities including text generation, image production, and, more recently, highly descriptive image analysis. The integration of artificial intelligence (AI) into image analysis represents a significant shift in how people understand and interact with visual data, a task that historically has been reliant on vision to see and knowledge to contextualize.

Now, new AI tools present a paradigm that allows more and more people to interact with images by generating descriptions that could not only assist the visually impaired but could also inform lay audiences about the contents of a scientific figure.

Associate professor Chris Callison-Burch, assistant professor Andrew Head and Ph.D. candidate Alyssa Hwang of the Department of Computer and Information Science in the School of Engineering and Applied Science at the University of Pennsylvania have developed a framework for gauging the efficacy of vision-based AI features by conducting a battery of tests on OpenAI’s ChatGPT-Vision ahead of its release earlier this month. The team primarily assessed the LLM’s competency at identifying scientific images and documented their findings in a research paper.

Hwang shares some of her observations with Penn Today, offering a glimpse into the future of AI-powered technologies and the promise they hold for interpreting complex images.

What the AI does and how the team tested it

Hwang says that vision-based LLMs like GPT-Vision are able to analyze images and can receive images and text as input to answer a wide range of requests using this data. The team’s set of test photos included diagrams, graphs, tables, and screenshots of code, mathematic equations, and full pages of text with the intent to gauge how well the LLM could describe them.

Scientific images contain complex information, Hwang says, so the team selected 21 images from a diverse set of scientific papers. “We prioritized breadth in our qualitative analysis, which we based on existing methods in the social sciences, and we discovered many interesting patterns,” she says.

This story was written by Nathi Magubane. To read the full article, please visit Penn Today.