When OpenAI revealed its picture-making neural network DALL-E in early 2021, the program’s human-like ability to combine different concepts in new ways was striking. The string of images that DALL-E produced on demand were surreal and cartoonish, but they showed that the AI had learned key lessons about how the world fits together. DALL-E’s avocado armchairs had the essential features of both avocados and chairs; its dog-walking daikons in tutus wore the tutus around their waists and held the dogs’ leashes in their hands.   

Today the San Francisco-based lab announced DALL-E’s successor, DALL-E 2. It produces much better images, is easier to use, and—unlike the original version—will be released to the public (eventually). DALL-E 2 may even stretch current definitions of artificial intelligence, forcing us to examine that concept and decide what it really means.

“The leap from DALL-E 2 to DALL-E is reminiscent of the leap from GPT-3 to GPT-2,” says Oren Etzioni, CEO at the Allen Institute for Artificial Intelligence (AI2) in Seattle. GPT-3 was also developed by OpenAI.

Image-generation models like DALL-E have come a long way in just a few years. In 2020, AI2 showed off a neural network that could generate images from prompts such as “Three people play video games on a couch.” The results were distorted and blurry, but just about recognizable. Last year, Chinese tech giant Baidu improved on the original DALL-E’s image quality with a model called ERNIE-ViLG. 

To support MIT Technology Review’s journalism, please consider becoming a subscriber.

DALL-E 2 takes the approach even further. Its creations can be stunning: ask it to generate images of teddy-bear scientists, astronauts on horses, or sea otters in the style of Vermeer—pretty much anything you can put into words—and it can do so with near photorealism. The examples that OpenAI has made available (see below), as well as those I saw in a demo the company gave me last week, will have been cherry-picked. Even so, the quality is often remarkable.

“Teddy bears mixing sparkling chemicals as mad scientists, steampunk” / “A macro 35mm film photography of a large family of mice wearing hats cozy by the fireplace”

“One way you can think about this neural network is transcendent beauty as a service,” says Ilya Sutskever, cofounder and chief scientist at OpenAI. “Every now and then it generates something that just makes me gasp.”

DALL-E 2’s better performance is down to a complete redesign. The original version was more or less an extension of GPT-3. In many ways, GPT-3 is like a supercharged autocomplete: start it off with a few words or sentences and it carries on by itself, predicting the next several hundred words in the sequence. DALL-E worked in much the same way, but swapped words for pixels. When it received a text prompt, it “completed” that text by predicting the string of pixels that it guessed was most likely to come next, producing an image.  

DALL-E 2 is not based on GPT-3. Under the hood, it works in two stages. First, it uses OpenAI’s language-model CLIP, which can pair written descriptions with images, to translate the text prompt into an intermediate form that captures the key characteristics that an image should have to match that prompt (according to CLIP). Second, DALL-E 2 runs a type of neural network known as a diffusion model to generate an image that satisfies CLIP.

Diffusion models are trained on images that have been completely distorted with random pixels. They learn to convert these images back into their original form. In DALL-E 2, there are no existing images. So the diffusion model takes the random pixels and, guided by CLIP, converts it into a brand new image, created from scratch, that matches the text prompt.

The diffusion model allows DALL-E 2 to produce higher-resolution images more quickly than DALL-E. “That makes it vastly more practical and enjoyable to use,” says Aditya Ramesh at OpenAI.

Read More


By: Will Douglas Heaven
Title: This horse-riding astronaut is a milestone in AI’s ability to make sense of the world
Sourced From: www.technologyreview.com/2022/04/06/1049061/dalle-openai-gpt3-ai-agi-multimodal-image-generation/
Published Date: Wed, 06 Apr 2022 14:04:11 +0000