Machines That Learn Language More Like Kids Do

November 1, 2018 | MIT

Estimated reading time: 6 minutes

Connecting the Dots

The expression with the most closely matching representations for objects, humans, and actions becomes the most likely meaning of the caption. The expression, initially, may refer to many different objects and actions in the video, but the set of possible meanings serves as a training signal that helps the parser continuously winnow down possibilities. “By assuming that all of the sentences must follow the same rules, that they all come from the same language, and seeing many captioned videos, you can narrow down the meanings further,” Barbu says.

In short, the parser learns through passive observation: To determine if a caption is true of a video, the parser by necessity must identify the highest probability meaning of the caption. “The only way to figure out if the sentence is true of a video [is] to go through this intermediate step of, ‘What does the sentence mean?’ Otherwise, you have no idea how to connect the two,” Barbu explains. “We don’t give the system the meaning for the sentence. We say, ‘There’s a sentence and a video. The sentence has to be true of the video. Figure out some intermediate representation that makes it true of the video.’”

The training produces a syntactic and semantic grammar for the words it’s learned. Given a new sentence, the parser no longer requires videos, but leverages its grammar and lexicon to determine sentence structure and meaning.

Ultimately, this process is learning “as if you’re a kid,” Barbu says. “You see world around you and hear people speaking to learn meaning. One day, I can give you a sentence and ask what it means and, even without a visual, you know the meaning.”

“This research is exactly the right direction for natural language processing,” says Stefanie Tellex, a professor of computer science at Brown University who focuses on helping robots use natural language to communicate with humans. “To interpret grounded language, we need semantic representations, but it is not practicable to make it available at training time. Instead, this work captures representations of compositional structure using context from captioned videos. This is the paper I have been waiting for!”

In future work, the researchers are interested in modeling interactions, not just passive observations. “Children interact with the environment as they’re learning. Our idea is to have a model that would also use perception to learn,” Ross says.

This work was supported, in part, by the CBMM, the National Science Foundation, a Ford Foundation Graduate Research Fellowship, the Toyota Research Institute, and the MIT-IBM Brain-Inspired Multimedia Comprehension project.

Page 2 of 2

Share on:

Suggested Items

It’s Only Common Sense: Would You Join Your Own Company?

05/06/2024 | Dan Beaulieu -- Column: It's Only Common Sense
In the past few years, I have heard many company runners complaining about their workforce. They tell me that the government is paying people too much money not to work, too many young people are not interested in working every day, and there is just not the work ethic there once was when they were young.

Warm Windows and Streamlined Skin Patches – IDTechEx Explores Flexible and Printed Electronics

04/26/2024 | IDTechEx
Flexible and printed electronics can be integrated into cars and homes to create modern aesthetics that are beneficial and easy to use. From luminous car controls to food labels that communicate the quality of food, the uses of this technology are endless and can upgrade many areas of everyday life.

Scientists Propose a New Way to Search for Dark Matter

04/02/2024 | SLAC National Accelerator Laboratory
Ever since its discovery, dark matter has remained invisible to scientists, despite the launch of multiple ultra-sensitive particle detector experiments around the world over several decades.

Walmart Acquires Vizio, Set to Overtake Samsung as the Largest TV Brand in the US

02/22/2024 | TrendForce
US retail giant Walmart announced on February 20, that it has acquired smart TV brand Vizio for US$2.3 billion, aiming to accelerate the growth of its advertising business: Walmart Connect. Since its launch in 2021, Walmart Connect has seen double-digit annual growth in both its online and offline retail media advertising ventures. Vizio has been expanding its device ecosystem and its SmartCast TV OS, boasting over 18 million active users, according to TrendForce.

Fiber Optic Cables Effective Way to Detect Tsunamis

02/16/2024 | University of Michigan
Fiber optic cables that line ocean floors could provide a less expensive, more comprehensive alternative to the current buoys that act as early warning systems for tsunamis, says a University of Michigan researcher.

News Highlights

More News

Featured Books

Book Library

Article Highlights

More Articles

Latest Columns

See all of our columnists

Search Console