Follow Mint Lounge

Latest Issue

Home > Smart Living> Innovation > DALL-E: How does the image-generating AI work?

DALL-E: How does the image-generating AI work?

This artificial intelligence tool that can create original images has just started beta testing. We break down how it really works

An image created by DALL-E based on the prompt ‘An astronaut riding a horse in a photorealistic style’. Photo: Open AI blog

Listen to this article

If you’ve been wondering what those absurd pictures on your Twitter feed are, like an avocado shaped teapot or a polar bear playing a guitar, you’ve stumbled upon DALL-E. This text-to-image AI is the latest among technologies like voice assistants and self-driving cars that are growing more and more sophisticated by the day. Open AI, the artificial intelligence research firm that created DALL-E has just began large scale beta tests of the tool. As debates around AI’s sentience and its role in the future of art grow, we explain what DALL-E is and how it works.

What is DALL-E?

It’s an AI research project with the aim of creating 'realistic images and art from a description in natural language.’ Basically, if you input a string of words describing an object or event, Dall-E will create an original image based on the words inputted.

Also read: World emoji day: How are emoji created and standardized, and why July 17?

So how is this different from searching for an image on Google?

Unlike Google Images which is a database of existing images, Dall-E can create completely new images. So for example you could enter absurd prompts like ‘an illustration of a baby daikon radish in a tutu walking a dog’ and DALL-E would come up with something like this:

Images that DALL-E created based on the prompt ‘an illustration of a baby daikon radish in a tutu walking a dog’. Photo: OpenAI blog
Images that DALL-E created based on the prompt ‘an illustration of a baby daikon radish in a tutu walking a dog’. Photo: OpenAI blog

Interesting, so does it create new images from images it already knows, like a more sophisticated collage?

Somewhat. DALL-E does use images it already knows but these are only used as a basis for future creations. DALL-E is a ‘neural network’, a mathematical system that learns by analyzing huge amounts of data. The AI is fed many images along with their captions that describe what the image depicts.

Since these are computers we are talking about, they can’t really identify objects the way we can. For example, they would not be able to pick out a banana in a picture of a bowl of fruits. However, what the computer can do is identify distinct RGB values of groups of pixels, and it would be able to differentiate the ‘yellowness’ of the banana, quantified as an RGB value, from the ‘redness’ of an apple.

Okay, but what if you have two objects that are both the same color?

Good point. So the computer will not just analyze colour but hundreds of features of objects including shape, texture and size. All of these features get quantified in a process called ‘embedding’. Thus, the AI has something like a set of coordinates where each object has a specific location based on where it falls on the scale of hundreds of parameters like roundness, redness or shiny-ness.

When it analyzes millions of images in this way, the AI creates a multi-dimensional coordinate plane where different objects are mapped based on similar features: think a region for round things like balls and globes. As features overlap, the AI builds more specific reference points for distinct objects.

The idea of mapping objects based on their features makes sense, but what about more intangible ideas. Like how would DALL-E know how to depict the daikon ‘walking’ like in the earlier example?

That’s right, so it's not just objects that are analyzed, but also ideas. For example, verbs like ‘riding’ and ‘walking’ that can be used in different contexts (riding a bicycle vs riding a horse and walking a dog vs walking to work) are also analyzed by the AI. After quantifying millions of images, the AI will be able to identify what kind of riding or working the user means in their text prompt.

So now that it has this ‘map’ that numerically describes features of different objects, how does DALL-E make a new picture from this information?

The AI translates this information back into pixels using a process called Diffusion. In diffusion you take a piece of data like an image and add ‘noise’ to it such that the original image is not recognizable anymore. Then the computer tries reconstructing this image back to the original and in the process learns how to generate images. DALL-E uses this process to combine different objects and ideas from the text input based on its multi-dimensional coordinate system, to create a new image.

DALL-E sounds like a fun tool to play around with images but what was it made for?

DALL-E was created as part of research to test the limits of AI. It helps people to brainstorm creative ideas quickly. For example, if a graphic designer, wants to conceptualise an idea quickly, they can type in a phrase based on their vision and then use the image DALL-E creates as a blueprint to develop their project. Despite its amazing capabilities, the makers of DALL-E are wary about rolling out the tool for public use because the tool makes it far easier to create disinformation and images that could support fake news. As of now, DALL-E is not a customer product, but rather an academic research project and is still undergoing testing. An independent developer has created a more rudimentary version called DALL-E Mini that’s on the Internet and can be used.

Also read: Scientists want to make text-generating AI systems less biased

Next Story