Software Design - virtual voice assistant

Introduction

New people joining our company from a different culture and country usually experience a great deal of confusion. There are an overwhelming number of questions ranging from visa, city registration, finding an apartment or even where to buy various products. Answering these questions is a big challenge and workload for MobiLab. Our internal documentation in Confluence helps a lot but still the people prefer to extract the specific information through a question answering process. As a result, employees frequently approach back-office staff even though the information they need is readily available.

To handle this challenge and simplify the process, we suggested a virtual voice assistant primarily focused on Q&A for MobiLabbers. In this blog post I will briefly summarize the ideas and experiences throughout this project. We will go over the basic components of the system as well as a short overview of how we prepare data for machine learning.

Designing the Software

Establishing Requirements

At the outset, it was evident that the virtual voice assistant would have to be multilingual since we employ many people from a variety of countries and a native-language voice assistant would encourage them to try it out more. For the initial launch we decided to go with English and German, because everybody at MobiLab speaks at least one of these two languages.

Furthermore the usability of the application should be a given. Answers may be provided either directly in detail (if possible) or as a link to already existing Confluence pages. This way, the user receives a proper answer (if the system is able to address the question accurately enough) or a nudge in the right direction instead if it can’t figure out a precise response. To increase MobiTalk’s general ease of use, entered requests ought to be repeatable in cases when the software does not understand the spoken input right away.

Lastly both text and voice services would need to be supported as well. Reason being that users might want to use the application silently in situations where they can’t talk to their smartphone. With all this in mind, we went on to designing the basic components.

Basic Components

Now, at this point we had to figure out how to translate our ideas into reality. The two main components we decided on are the following:

Firstly, the user interface is a responsive website, which works similarly on both mobile and PC. Using JS libraries the web application is also in charge of recording and utilizing Text-To-Speech (TTS). In short, the TTS processes text inputs of a specific language and generates a voice expressing them audibly. The attitudes and quality of this voice are adjustable.

The second component is a central server containing different libraries and scripts analyzing the incoming voice and sending back the answers to the user device. For this purpose we work with several open source libraries: KALDI for Speech-To-Text (STT), fastText for text classification, and spaCy for Named Entity Resolution (NER).

Software Design - Server

Here’s how they all play together: The STT produces a written down version of the spoken words in the audio recording. Then classifying the text helps to decide which topic the spoken sentence actually relates to. To determine important keywords like names of places or people we make use of the NER in the next step. After that there are several following post-processing steps… And we are done!

For now, users are meant to keep pressing the microphone icon and release it when they are finished talking. Also, the spoken language has to be specified beforehand by the user. In case the voice recognition unit cannot interpret any text, it sends back a flag to indicate that the voice was not recognized. As previously stated, if this happens users are tasked to repeat their audio recording. Preparing the data itself for the fastText model is a different topic entirely, which I’ll go over in the next part.

Data Preparation for Machine Learning

Since the primary goal of MobiTalk is to have an internal system for fellow MobiLabbers, we decided to use the data from our own company Confluence and Slack. There are also some cases where we needed to gather data from outside sources in addition to this (e.g. specific information about the cities like Berlin or Cologne).

We considered both supervised and unsupervised learning. In this rundown, I focus on the supervised kind in which we extracted the sentences from the paragraphs of all Confluence pages whenever possible. To keep the implementation simple, we considered the titles and subtitles of existing Confluence pages following the specific sentence as their machine-learning features. Doing this we end up with a table of training data like this:

Software Design - Table 1

In order to also be able to produce sentences from data that is formatted differently, we wrote a script that generates them from tables on the Confluence pages as well. For example, here’s a table with team-members, their roles and the number of years they have been working at MobiLab so far:

Software Design - Table 2

This set of data allows us to construct the phrase “Ali has been working as a Data Scientist at MobiLab for 2 years”. There can also be different versions of these sentences for an inclusive training. For example, we can have one more sentence like “As a data scientist, Ali has been with MobiLab for two years”. This sentence is somehow just a variation of the previous sentence but it helps with more exact training. Both are added to the training table too. In this particular case, the machine-learning features could be “Team Members” and “Roles”.

The generated tables can be handed over to fastText with some minor alterations for the text classification I mentioned earlier. We are then able to produce proper answers based on the classifications as well as additional information from NERs. Ultimately, we wanted the Q/A to be in German too, which is why we ended up translating every sentence from English automatically using Python transformers library and pre-trained models.

Conclusion

So far, we have implemented the first version of the voice assistant and could set up the server to answer requests from the UI. As for the future, the recognition still needs more training data, which can be gathered through the questions that the end users provide. We may also integrate the UI with Slack, which unifies interactions across our company as Slack is already being used on a daily basis. Another idea is adding a grammar correction and spell checker, as these areas could still use some improvement.

While working on this project, it was quite a challenge for me to define tasks in a scope that was feasible for my team members, as they were already involved in other ongoing tasks and this was just a side project for them. Overall, it was a great experience for me to lead a motivated team of developers and combine different types of technologies together.