Home > Smart Living > Innovation > Open AI’s ChatGPT can now talk and process images

Open AI’s ChatGPT can now talk and process images

ChatGPT's voice and image capabilities could open new doors for generative AI's expansive use in daily life

Open AI is rolling out new voice and image capabilities for its generative AI platform, ChatGPT. (AP)

By Team Lounge

LAST PUBLISHED 26.09.2023  |  06:15 PM IST

In a quest to make artificial intelligence (AI) as human-like as possible, Open AI is adding voice and image capabilities to its popular generative AI platform, ChatGPT. Announcing on X (previously Twitter) Open AI wrote in a post that “ChatGPT can now see, hear, and speak."

Since its launch, ChatGPT has been limited to written prompts, but now with new updates that will roll out in the next two weeks, users with paid versions will be able to have conversations with the AI and show them what they are talking about through photos, OpenAI’s statement explains. “Troubleshoot why your grill won’t start, explore the contents of your fridge to plan a meal, or analyze a complex graph for work-related data," it added in the post.

Also read: Did life exist on Mars? AI might help us find answers

The voice capability is powered by a new text-to-speech model, which can generate human-like audio from text and a few seconds of sample speech. For this, Open AI collaborated with professional voice actors to create each of the voices, the statement added. They also used Whisper, an open-source speech recognition system, to transcribe words into text.

While this opens new possibilities, it also comes with risks of harmful impersonations. To address these concerns, OpenAI is using voice chat, created with voice actors they have directly worked with. Similar to talking to Alexa or Google Assistant, users can ask a question and ChatGPT will convert it into text, feed it into the large language model, and get an answer which is then converted back to speech and the answer is told out aloud, a report by The Verge said. OpenAI is also working with popular audio streaming platform Spotify to translate podcasts into different languages while retaining the sound of the podcaster’s voice.

Furthermore, image understanding is powered by multimodal GPT-3.5 and GPT-4. According to the statement, they apply their language reasoning skills to a wide range of images, such as photographs, screenshots, and documents containing both text and images.

While Google, Meta, and Microsoft seem to be continuously rolling out AI updates, OpenAI has waited for almost a year to launch big updates to ChatGPT. Commenting on that in the statement, the company said, “We believe in making our tools available gradually, which allows us to make improvements and refine risk mitigations over time while also preparing everyone for more powerful systems in the future. This strategy becomes even more important with advanced models involving voice and vision."

Also read: Meta working on AI-powered chatbots to stay in the game