Below is an easy guide to getting started once you’ve created your free account. You can either start your chatbot testing with your own NLP model, or use our sample model (which is for a simple customer banking chatbot).

The idea is to identify and fix one problem at a time. Because QBox will fully test your model in a matter of minutes, it’s easy to test, change, validate, and repeat.

With this step-by-step approach, you’ll start to understand how to fix common issues, and what types of solutions a specific provider responds well to.

In the next 5 minutes, you’ll learn:

  • How to export the model training data from your provider

  • How to test this training data on your provider using your free QBox account

  • How to identify the confusion in your model’s understanding

  • How to fix your training data

  • How to validate your changes to the training data to ensure they’ve been positive

Improve your chatbot in 5 steps

  • Export

  • Test

  • Identify

  • FIX

  • Validate


Follow the instructions provided by your NLP provider to download your AI bot's training data (your utterances, samples, examples, and entities).

Each provider uses slightly different terminology; use the guides below to find out how to export the training data for your chosen NLP provider.

For the following steps, we will be using our sample model training data (download link below). Feel free to follow along using our sample model, but the principles will also apply to your own model.

Note: If you are using our example training files (above) to follow this tutorial, then the rest of this article will make more sense if you first import the file into your chosen NLP provider, so it behaves as if it were a chatbot you were managing.


Run a test

Once you’ve logged in to QBox, select Create job from the main menu on the left-hand side of the page. Tests are known as jobs in QBox.

For this example, we’ll call the project “Consumer Bank.”

We’ll also need to give our job a name. This just needs to be a name that means something to you, so you can easily refer to it later. We’ll call our example job “Initial test.”

Next, in the training data section, you’ll need to upload the file.

Click Create job to start the test.

It is as simple as that! QBox will now analyze the performance of your training data for your chosen provider.

Note: You can ignore the Confidence Threshold options for now. Although QBox automatically detects your provider from the format of the file you upload, it also allows you to test your model using multiple providers, so you can make sure you always use the best provider.


  • Identify problems

    After a minute or two, we’ll see the results of our test. The three scores (Correctness/Confidence/Stability) on the top left give you KPIs for your model.

    The histogram tells us we have several poor-performing intents; they’re shown in red. Select the Intents tab to see which intents are performing poorly.

  • Identify problems

    Here, you can see a list of intents. Each intent has the same three scores that the model has. We can see that the spend_category intent seems to have a poor score (as do several others). As mentioned earlier, it is all about fixing one problem at a time.

    Let’s click on the spend_category intent to find out why it performs so poorly.

  • Identify problems

    The intent details page will show any pieces of training data that, when tested, either did not return the expected intent (poor Correctness), did return the expected intent but with a low Confidence score (poor Confidence), or did return the expected intent, but with a Confidence score close to another intent’s (poor Stability).

    Click on the training data “what have i spent my money on” to see what might have caused this confusion.

    Note: Confidence and Stability are not supported by all providers.

  • Identify problems

    This will show us a color-coded view of the training data and the top intents predicted by the model: our expected intent (spend_category) and the other two intents that were returned with a higher confidence (account_balance and statement).

    We can see there’s a similar utterance in both the statement intent and our expected intent spend_category (“what have i spent my money on?” and “what have i spent my money on”). This looks like a mistake. We will remedy this in the next step.

We can also see that the phrase “how much” is used a lot more in the account_balance intent. This may cause our NLP engine to think that this phrase is significant for this intent.

Because this phrase also applies to the spend_category intent, we need to fix the problem by adding some additional training data in the next step.

Step 4: Modify training data

Now that we understand where the problems in our training data are, we can make appropriate changes to it. In this case, we are going to remove a training sample from the statement intent, and add two samples to the spend_category intent, like so:

Intent Training data Action
statement what have i spent my money on? Remove
spend_category how much have i spent this month Add
spend_category how much did i spend this month Add

Alternatively, download one of the files we have modified for you:

Note: Log in to your NLP provider and modify the training data. Here’s how to modify your training data in your model

Step 5: Validate

Now repeat step 1. Run another job on QBox with your modified sample model to find out if your changes have resulted in any improvement.

As with the first test, it will take a minute or two.

Note: If your training data is particularly large, results will take longer to process. We recommend you enable notifications, so you’ll know as soon as your test results are in.

Results time


QBox always compares your latest job to the previous one to identify any improvements or regressions. As we can see here, we have improved the spend_category and statement intents.

We can see our changes also had a positive effect on some other intents. This is because when we clarify the meaning of intents, it can clarify the meaning of other (slightly overlapping) intents.

Note: If you use our test model you may see slight variations in the scores.

We recommend you repeat this process until all your intents have a score of at least 75%. A great model will score roughly 90-95%.

We hope you enjoyed this tutorial. If you have any feedback or questions, feel free to contact us.

Let me have a look

Get started with QBox free today and get your chatbot performance assessment in the next two minutes.

Get your free account

Give me a demo

Book a demo so we can show you how QBox will improve your chatbot testing.

Book a demo