Microsoft Power BI

2023

Microsoft Power BI

Evaluating usability of natural language prompts feature for generating mathematical formulas

Collaborators
Colette Chen
Maomao Ding
Swati

The Quick Measure Suggestions feature that enabled users to quickly create new measures using natural language prompts instead of writing formulas in DAX (Data Analysis Expressions).

As Microsoft integrated OpenAI’s GPT models across its product suite, the Power BI team wanted to evaluate the usability of the Quick Measure Suggestions feature before launching to general availability.

Purpose of Study
Gather insights on the usability of Power BI’s Quick Measure Suggestions feature

Evaluate discoverability of the feature in workplace settings

Identify challenges in submitting a natural language prompt

Discover best practices to set user expectations for the feature

Target Users

Power BI users with experience in DAX

A critical requirement was that users should have some experience with DAX (Data Analysis Expressions) in order for us to validate their expectations against the outcome.

Participant recruitment

In collaboration with the Power BI team, we recruited six participants, who used Power BI for their personal or professional work on an everyday basis.

The recruitment process, including screening, signing a Non-Disclosure Agreement with Microsoft, and scheduling, was done through the User Interviews platform. Furthermore, all our participants were based in the United States because of data regulations in specific geographic regions.

STUDY

Prompting to Prompt - Designing tasks

Our tasks were designed to first get participants an idea of the dataset they were to use during the study. This was followed by checking discoverability of the feature, and then performing calculations on Power BI of various difficulty levels.

When writing our script, we had to iterate several times on our language because users would use the same words we used in describing the task for prompting using the Quick Measure Suggestions feature.

FINDING 01

Not really natural language

Typing the prompt in certain ways lead to correct results which might not always be natural language.

Language has to be specific to the data and is not intelligent to recognize any variables that may be given input naturally. For instance, the user has to enter “United States of America” and not “U.S.” or “United States”

FINDING 02

Confusing interaction with the input box

There is a blue underline. But what does that mean?

The blue line indicates match with an existing field in the dataset, but participants thought of other things like auto-suggestions, and filtering system for specific terms.

FINDING 03

Expectations inspired by ChatGPT

Considering this feature is being launched after people are used to ChatGPT, participants expect more than just calculations.

This feature is limited to providing formulas, but people want directions on getting to their final outcome, which could be a graph or refined calculation.

FINDING 04

Interface shortcomings

Participants wanted to give a name to the measure they were creating before they added the calculation as a card to their dashboards, but they could not find a way to do that. They missed the formula bar up top left, where they could actually edit the measure name.

Some participants also expected there to be a typo recognizer so they do not make mistakes writing a natural language prompt.

FINDING 05

Poor discoverability of additional suggestions

5 of 6 participants did not notice the variations of suggested measures shown below the first expanded suggestion card, despite it showed up multiple times throughout the test.

Furthermore, participants reported that the variations look similar to each other, and they cannot easily distinguish them.

FINDING 06

Unexpected output

Every output provides a “Preview value” which is not be optimal for certain types of prompts, leading to confusion.

Confusing for prompts with multiple variables, especially for categorical or time/trend related ones. The current design is most suitable for displaying singular and text-based answers.

The “Preview value” is a middle step, but participant expectation is the final output (analysis or visualization).

Measuring Usability

System Usability Scale

We used the System Usability Scale (SUS) list of ten questions that participants scored as “Strongly Agree”, “Agree”, “Neutral”, “Disagree”, or “Strongly Disagree”.

60.7

SUS Score

AI Trust Score

Because of the uniqueness of a highly AI-based feature, we also measured usability using Microsoft’s AI Trust Score, which evaluates the trust of enterprise-based users on an AI system using six questions.

AI Trust Score for this project can not be shared here.

Good Things

What went well

The feature has high discoverability and users followed three unique flows to reach the Quick Measure Suggestions natural language input box.

No participants had any difficulty starting to write a natural language prompt. However, there were challenges in understanding the suggestions and other interface elements after they started typing in.

Furthermore, all participants (and especially those new or unfamiliar with DAX) were highly positive about the potential of the feature and everyone agreed to be willing to use the feature in their everyday work.