Language learning requires physically existing in the world (not being just a brain in a jar). A child doesn’t learn language from reading Wikipedia. They build rich models of the world by seeing and doing. They understand other humans as socially intelligent and cooperative agents with whom to speak and from whom to learn. Yet, natural language “understanding” systems have focused on training in an impoverished disembodied text-only setting — trying to download as much text off the internet and then copy patterns. The silly example I like to use is: “Is an orange more like a baseball or a banana?” and your answer has a lot to do with if you’ve ever seen, held, or squished each of them. Maybe fruits are more alike or maybe spherical things, maybe bright colors, maybe things that deform when you squeeze them, … so on and so forth. This rich experiential meaning is necessary for building the natural language processing systems of the future – we want systems that understand novel requests like “will the oranges be ok if I pack them under the baseballs?”
To study this space, we have to break it down (we cannot simply build a terminator that we raise as our own child for a few years). Instead, we think about: 1. How can web-scale vision and language systems build compatible representations of the world? 2. How can we build instruction following agents in simplified simulated worlds? 3. How can we transfer models trained in simulation to working in the real world? In all three, we always assume a human is present, either as a user or teacher — pushing the system to do something new it hasn’t seen before. Simulators are rapidly advancing and every design decision has implications for what aspects of language can hope to acquire. For example, a simulator that lets you move or roll objects around exposes new information about shape similarity, size, and mass, but may still hide details of deformation. This means we have to work closely with colleagues in vision and robotics both to leverage new advances and guide functionality.
Butler robots — many of us grew up with Rosey from the Jetson’s, planting the dream of a robot that can help with household chores, serve as our sous chef, and is flexible enough to learn new tasks. While this won’t be anytime soon, as robots and IoT devices become ubiquitous, we need to be able to interact with them as we would with other people. Yelling at your Roomba currently only makes you feel better, but doesn’t actually change its behavior. The goal is a robot aid who can co-exist in human spaces and perform tasks we can’t (or perhaps, just don’t want to).
Our CLAW (Connecting Language to Actions and the World) Lab tries to focus around this core idea of grounding and embodiment. Several collaborators and I wrote up this vision to provide a bit of a roadmap for ourselves and the field. So that means also looking for how language can help or learn from other communities. A key aspect of meaning that is only indirectly observable is that of mental states. Once we believe we understand the world, we can then ask, whether others share our understanding and if our world models differ then how can we reach consensus. This research into building a theory-of-mind and social intelligence, is even more in its infancy than the other topics I’ve mentioned here.
I have always been interested in languages (including just learning them) which has guided what reading groups and programs I’ve participated in. Eventually, this led to doing my PhD at UIUC on unsupervised grammar induction. I wanted to understand how much about language we can learn from just “overhearing” it (e.g. adjectives come before nouns in English but the opposite for Spanish). Once it became clear how much form and how little substance was achievable I started to pivot towards understanding where meaning comes from — it’s taken a while to get really comfortable in this new space, acknowledging the limitations of my own previous work and searching out a new path.
My website: https://www.YonatanBisk.com
My Twitter handle: @ybisk