Last week, I was a speaker at two conferences in the conversational AI space: Conversation Design Festival and L3-AI. I talked about the gap between conversation design and engineering, and I introduced a new modeling language, colang, which we’ve developed at RoboSelf as a potential solution. In this article, we will look together at the motivation and need for a common language for Conversational AI.


Dialogue is sort of an “AI-complete” problem. You would have to solve all of AI to solve dialogue, and if you solve dialogue, you’ve solved all of AI.” - Stephen Roller, Facebook

Conversational AI is challenging. So much so that Stephen Roller, a senior research engineer at Facebook, calls it an “AI-complete problem.” A core challenge is dialog management, i.e., the ability of a bot to engage in a conversation by exchanging turns and keeping the context across the turns. We can divide current approaches to dialog management into two camps: the state machine camp and the machine learning camp. The first one enables the designer/developer to control what should happen next at any point in time by explicitly modeling a state machine, a flow diagram, or a combination of the two. For goal-oriented bots, gathering data from the user is a prevalent task, so we usually find special components to deal with forms in most conversational AI frameworks. The drawback of the state machine approach is its rigidity and the exploding complexity when developing a complex bot. On the other hand, machine learning approaches scale well, but the drawbacks come from the data annotation effort and from losing control of exactly what the bot says at what point in time.


Colang Code Sample


However, these approaches are two extremes of a spectrum on which we have seen very little in between. There is a point where we, as humans, can no longer produce the logic/algorithm for a very complex problem. And that’s where machine learning takes over. However, nowadays, we can build highly complex systems with the techniques we have developed in software engineering. As a reference point, the Linux kernel, the most used operating system kernel globally, has over 27 million lines of code contributed by over 11 thousand people in over 1 million commits over the last three decades. If you compare this with the typical dialog complexity of a chatbot in the market, the difference is multiple orders of magnitude. So, the question I’m asking is this: “how far can we go in handling the complexity of dialog management before we hand it over to ML algorithms? and what do we need to get there?”


When discussing the shift to conversational interfaces, the most used analogy is the shift from web to mobile and mobile to voice. We’ve gone a long way in creating exceptional web interfaces that people use every day for, well, everything. To a certain degree, we’ve done the same for mobile. Let’s look at web applications for a minute. In its most basic form, every web application is a bunch of HTML, CSS, and Javascript running in a browser, at least what is called the client-side. However, nowadays, nobody builds a web app from scratch. There are ready-made templates and components for the vast majority of things you want (authentication, forms, galleries, etc.). With no-code tools, you can build a decently complex app just by stitching together pre-built components. This efficiency in building web apps is possible because of the standards in place, including the three languages mentioned, the HTTP protocol, and several others. Many frontend frameworks have emerged over time, same for the backend. A very healthy ecosystem of packages has evolved like NPM for node.js (over 1.6 million packages) and PyPI for python (over 300k packages). In terms of job roles and skills, web designer, frontend engineer, and backend engineer became standard.

Now, if we look at the conversational AI landscape, there is very little standardization. And there’s a lot of reinventing the wheel. According to Gartner, already in 2019, there were as many as 1,000 to 1,500 Conversational AI tech providers. Personally have gathered a list of over 300 players in the space, each implementing their own Conversational AI stack. In one way or another, they each claim to be the best at something. From a business perspective, this makes sense as it’s a massive market with room for many players. And given the “local” component of the language, we can expect to see many platforms for each language. However, there’s zero interoperability between all these platforms.

Companies like Conversation Design Institute lead the way in standardizing conversation design as an industry in terms of job roles, skills required, training courses, and methodology. However, when moving from design to implementation, we still have to adapt to the wide variety of platforms out there. There is no way to reuse a component implemented for a platform and move it across to another. I would argue that the main thing missing is a language (or set of languages) that we can use to model conversational systems. What is the HTML/CSS/Javascript for Conversational AI?


But is it possible? If we look at the concepts and abstractions used by most platforms, we will find a substantial overlap. They all use bots, users, states, paths/sequences/flows/diagrams, utterances, intents, entities, context, devices, text, rich elements, quick replies, etc. I would argue there’s a solid conceptual common ground to create a standard modeling language. However, there’s something unique about conversations that we did not have with other interfaces. And that is the very open-ended nature of conversations. After all, the user can say anything they want. This variety already makes the natural language understanding layer difficult. Managing a conversation across ten turns when at every turn, the user can say anything is hard! So, what type of modeling language could potentially handle this? In my opinion, the best answer comes from the field of study dedicated explicitly to human conversations: Conversation Analysis. Natural Conversations Framework developed at IBM Research gives a comprehensive overview of the types of patterns we see in conversations, adapted from Conversation Analysis. Without going into details, the fundamental idea is that conversations are structured in basic sequences, and these sequences are expandable. So, any modeling language we chose should have the ability to represent sequences of turns and expand on them. These expansions have many names like digressions, fallbacks, or repair paths, and, without proper support, they make conversational interfaces extremely hard to model.


Colang Code Sample

Today, I’m excited to introduce you to colang, a modeling language with the ambition to solve some of the issues mentioned above. Developing great conversational experiences requires expertise from multiple fields. So, we set out to create a clean and intuitive language that designers, linguists, copywriters can use without a steep learning curve, but at the same time powerful enough to enable the modeling of complex scenarios by advanced developers. We wanted a language that reads naturally, with minimal artificial syntax, and which is extensible. The last attribute is particularly essential. Our understanding of how we should build conversational systems is evolving every day, and the language needs to evolve with it.

Seeing is believing. We’ve also built a playground that you can use to learn the language and run models written in colang. The conversational AI stack powering the playground uses state-of-the-art components. You get access to a hybrid natural language understanding engine (grammar-based and machine-learning-based), a conversational flows engine implementing the full semantics of colang, an in-memory graph database, and a natural language generation component supporting context variables, widgets, and contextual responses. We are in private beta, but you should hear from us within 24 hours if you’re interested and request access.

So, what can you do with a colang model once you’ve created one? If you want to deploy it in production using the same stack as the one powering the playground, let us know. We’re also currently testing the import of colang models from various conversation design tools and exporting a model into specific conversational AI technologies like RASA or DialogFlow. However, given the strong semantics of the language, and the sometimes very simplistic models adopted by some platforms, it will not be possible to export any colang model for any platform. Sometimes only partial export will be possible as certain features will not have a corresponding equivalent. For example, colang allows us to add a specific behavior when a digression finishes, e.g., to repeat a question up to 3 times. Very few platforms enable such behavior, and even then, it’s hard-coded and not customizable for many of them. The colang semantics will also allow us to compare the capabilities of different conversational AI platforms to choose the right one for the job. Soon, we’ll post a follow-up article on this.

Conversation means collaboration. I believe it’s high time we collaborate more on building great conversational experiences. It is early days for colang, and even version 1.0 of the language specification has not been finalized, but we wanted to get it out there and have you give it a try. It’s a long way to creating an industry-wide standard language, but every marathon starts with the first step.

Razvan Dinu

Co-founder & CEO @ RoboSelf

I'm a tech entrepreneur with a passion for building technology-focused products. Right now, I'm focused on developing the next-generation conversational AI platform for building digital assistants for business.