Bridging the Gap: Using Deep Acoustic Representations to Learn Grounded Language from Percepts and Raw Speech

Learning to realize grounded language—the language that takes place in the context of, and refers to, the broader world—is a well-known area of investigate in robotics. The majority of present get the job done in this area still operates on textual knowledge, and that limits the skill to deploy agents in practical environments.

Digital analysis of the end-user speech (or raw speech) is a vital part in robotics. Image credit: Kaufdex via Pixabay, free license

Electronic assessment of the finish-consumer speech (or uncooked speech) is a critical portion in robotics. Image credit score: Kaufdex through Pixabay, free license

A latest short article released on arXiv.org proposes to get grounded language instantly from finish-consumer speech employing a fairly tiny variety of knowledge factors rather of relying on intermediate textual representations.

A specific assessment of natural language grounding from uncooked speech to robotic sensor knowledge of daily objects employing condition-of-the-artwork speech illustration products is presented. The assessment of audio and speech features of personal contributors demonstrates that finding out instantly from uncooked speech improves effectiveness on customers with accented speech as compared to relying on automatic transcriptions.

Learning to realize grounded language, which connects natural language to percepts, is a critical investigate area. Prior get the job done in grounded language acquisition has focused mostly on textual inputs. In this get the job done we exhibit the feasibility of carrying out grounded language acquisition on paired visual percepts and uncooked speech inputs. This will make it possible for interactions in which language about novel duties and environments is acquired from finish customers, reducing dependence on textual inputs and most likely mitigating the outcomes of demographic bias discovered in broadly accessible speech recognition techniques. We leverage latest get the job done in self-supervised speech illustration products and show that acquired representations of speech can make language grounding techniques far more inclusive in the direction of precise teams whilst preserving or even raising normal effectiveness.

Exploration paper: Youssouf Kebe, G., Richards, L. E., Raff, E., Ferraro, F., and Matuszek, C., “Bridging the Hole: Working with Deep Acoustic Representations to Learn Grounded Language from Percepts and Uncooked Speech”, 2021. Hyperlink: https://arxiv.org/stomach muscles/2112.13758