Would you like to try something using Web Speech API?! How about scrolling a web page using voice commands?

Before we start that here's a brief introduction to Web Speech API.

What is the Web Speech API?

Web Speech API allows you to add voice capabilities into your web app. It has two components: SpeechSynthesis and SpeechRecognition.

The SpeechSynthesis is used to convert text into speech and SpeechRecognition is used to convert speech into text.

For the purpose of this demonstration we'll be focusing only on the SpeechRecognition part of the Web Speech API.

Speech recognition works, in simpler terms, by matching the sounds we produce when we speak (phonemes) with written words.

Let's start

In the following sections we will start to code a simple HTML page with a long scrollable text and the JavaScript code need to make this page scroll via speech commands.

HTML

Create a simple page with lot's of text to scroll through.

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="utf-8">
        <title>Page Scroller using Web Speech API</title>
        <meta name="description" content="Page Scroller using Web Speech API">
        <meta name="author" content="Cloudoki">
        
        <!-- Mobile Specific Metas
        –––––––––––––––––––––––––––––––––––––––––––––––––– -->
        <meta name="viewport" content="width=device-width, initial-scale=1">
        
        <!-- CSS
        –––––––––––––––––––––––––––––––––––––––––––––––––– -->
        <link rel="stylesheet" href="styles/custom.css">
                
        <!-- Favicon
        –––––––––––––––––––––––––––––––––––––––––––––––––– -->
        <link rel="icon" type="image/png" href="images/icon.png">
    </head>
    <body>
		<div class="container">
            <h1>Page Scroller using Web Speech API</h1>

            <h6>click the scroller button then say:</h6>
            <h6>"scroll + [up, down, top, bottom]"</h6>
            <h6>click again to stop</h6>

            <div class="lipsum">
                [LARGE TEXT HERE]
            </div>

            <button class="scroller">SCROLLER</button>
		</div>

		<script src="scripts/index.js"></script>  
    </body>
</html>

JAVASCRIPT

In our script, let's first see if the browser supports the Web Speech API.

try {
  var SpeechRecognition = SpeechRecognition || webkitSpeechRecognition || null;
}
catch(err) {
  console.error('Starting Web Speech API Error:', err.message);
  var SpeechRecognition = null;
}

function init () {
  // initialize speechRecognition if supported
  if (SpeechRecognition === null) {
    alert('Web Speech API is not supported.');
  } else {
    console.log('Web Speech API is supported.');
  }
}

window.addEventListener('load', function() {
  init();
}, false);

If the SpeechRecognition variable is null then the browser does not support the Web Speech API. Otherwise, we can use it to transcribe our speech.

Note:

At the time of writing this article the browsers that supported Web Speech API were limited. We recommend using Google Chrome for this.

Starting the speech recognition

To start the recogniser, we initialise it by creating a new SpeechRecognition object instance var recognizer = new SpeechRecognition(); and set its properties.

SpeechRecognition.continuous

Controls whether continuous results are returned for each recognition, or only a single result. Defaults to single (false)

SpeechRecognition.interimResults

Controls whether interim results should be returned (true) or not (false). Meaning we get results that are not the final result.

SpeechRecognition.lang

Returns and sets the language of the current SpeechRecognition. If not set, it uses the HTML lang attribute value or the user agent language. It's a good practice to set this value.

SpeechRecognition.maxAlternatives

Sets the maximum number of alternatives provided per result.

  if (recognizer.continuous) {
    recognizer.continuous = true;
  }
  recognizer.interimResults = true; // we want partial result
  recognizer.lang = 'en-US'; // set language
  recognizer.maxAlternatives = 2; // number of alternatives for the recognized speech

When we click the scroller button the recogniser should start by using the recognizer.start() and by clicking it again it should stop using the method recognizer.stop(). We'll also use a state variable to monitor if the recogniser is listening and has really started.

var scrollerClass = '.scroller'

// try to get SpeechRecognition
try {
  var SpeechRecognition = SpeechRecognition || webkitSpeechRecognition || null;
}
catch(err) {
  console.error('Starting Web Speech API Error:', err.message);
  var SpeechRecognition = null;
}

/**
* Initialize the Speech Recognition functions
*/
function startSpeechRecognier () {
  // state used to to start and stop the detection
  var state = {
    'listening': false,
    'started': false,
  };
  var scroller = document.querySelector(scrollerClass); // button to start and stop the recognizer
  var recognizer = new SpeechRecognition();

  // set recognizer to be continuous
  if (recognizer.continuous) {
    recognizer.continuous = true;
  }
  recognizer.interimResults = true; // we want partial result
  recognizer.lang = 'en-US'; // set language
  recognizer.maxAlternatives = 2; // number of alternatives for the recognized speech

  recognizer.onstart = function () {
    // listening started
    state.started = true;
    scroller.innerHTML = 'listening';
    console.log('onstart');
  };

  scroller.onclick = function () {
    if (state.listening === false) {
      try {
        state.listening = true;
        // start recognizer
        recognizer.start();
        console.log('start clicked');
        // if after 3 seconds it doesn't start stop and show message to user
        setTimeout(function () {
          if(!state.started && state.listening) {
            scroller.click();
            alert('Web Speech API seems to not be working. Check if you gave permission to access the microphone or try with another browser.');
          }
        }, 3000)
      } catch(ex) {
        console.log('Recognition error: ' + ex.message);
        alert('Failed to start recognizer.');
      }
    } else {
      state.listening = false;
      state.started = false;
      // stop recognizer
      recognizer.stop();
      scroller.innerHTML = 'scroller';
      console.log('stop clicked');
    }
  }

}

function init () {
  // initialize speechRecognition if supported
  if (SpeechRecognition === null) {
    alert('Web Speech API is not supported.');
  } else {
    startSpeechRecognier();
    console.log('initialized...');
  }
}

window.addEventListener('load', function() {
  init();
}, false);

Getting and handling the results

After starting the recogniser there are several events that will occur that we can use to get the results or information (like the event presented in the code above recognizer.onstart that is triggered when the service starts to listen to voice inputs). The method we are going to use to get the results is the SpeechRecognition.onresult, fired every time we get a successful result.

So inside the startSpeechRecognizer function we'll add:

recognizer.onresult = function (event) {
    // got results
    // the event holds the results
    if (typeof(event.results) === 'undefined') {
        // something went wrong...
        recognizer.stop();
        return;
    }

    for (var i = event.resultIndex; i < event.results.length; ++i) {
      if(event.results[i].isFinal) {
        // get all the final detected text into an array
        var results = [];
        for(var j = 0; j < event.results[i].length; ++j) {
          // how confidente (between 0 and 1) is the service that the transcription is correct
          var confidence = event.results[i][j].confidence.toFixed(4);
          // the resuting transcription
          var transcript = event.results[i][j].transcript;
          results.push({ 'confidence': confidence, 'text': transcript });
        }

        console.log('Final results:', results);
      } else {
        // got partial result
        console.log('Partial:', event.results[i][0].transcript, event.results[i].length);
      }
    }
  };

The event parameter contains the array (SpeechRecognitionResultList) of results of the service. That array may contain the property isFinal, this means that we'll be able to know if it's a final result or a partial one and it contains another array of objects with the properties confidence and transcript. The confidence values (between 0 and 1) allow us to know how much the service is sure that the speech matches the transcription we received. The transcript is the text that the service generates based on what it understands from our voice input.

We'll also use the events onend, onspeechend and onerror to determine when the service stopped so we can to start to listen again, when the service detected that our speech stopped so we know it's not listening to new sentences and if an error occurred respectively.

recognizer.onend = function () {
    // listening ended
    console.log('onend');
    if (state.listening) {
      recognizer.start();
    }
  };

  recognizer.onerror = function (error) {
    // an error occurred
    console.log('onerror:', error);
  };

  recognizer.onspeechend = function () {
    // stopped detecting speech
    console.log('Speech has stopped being detected');
    scroller.innerHTML = 'wait';
  };

Scrolling based on the transcript

In order to scroll we'll need to add the following code to the onresult event:

// scroll according to detected command
var scroll = sortByConfidence(results).shift();
console.log('Final results:', results, scroll);
autoScroll(scroll);

This will sort the results by confidence values, get the best one and send it to be executed if it's a command matching our trigger words. Adding the next bit of code will achieve the behaviour explained.

/**
* Returns an list ordered by confidence values, descending order.
* @param {array} list - A list of objects containing the confidence and transcript values.
* @return array - Ordered list
*/
function sortByConfidence(list) {
  list.sort(function(a, b) {
    return a.confidence - b.confidence;
  }).reverse();
  var sortedResult = list.map(function(obj) {
    return obj.text;
  });
  return sortedResult;
}

/**
 * Execute the command if it matches the inputed.
 * 
 * @param {String} speech The command to evaluate
 */
function autoScroll (speech) {
  var body = document.body,
    html = document.documentElement;
  var pageHeight = Math.max(body.scrollHeight, body.offsetHeight, 
    html.clientHeight, html.scrollHeight, html.offsetHeight);
  var currentHeight = Math.max(body.scrollTop, html.scrollTop, window.pageYOffset)
  
  if (typeof speech === 'string' || speech instanceof String) {

    if (speech.indexOf('up') > -1) {
      console.log('Scrolling up...')
      window.scrollTo({
        top: currentHeight - 250,
        behavior: 'smooth'
      })
    } else if (speech.indexOf('down') > -1) {
      console.log('Scrolling down...')
      window.scrollTo({
        top: currentHeight + 250,
        behavior: 'smooth'
      })
    } else if (speech.indexOf('top') > -1) {
      console.log('Scrolling top...')
      window.scrollTo({
        top: 0,
        behavior: 'smooth'
      })
    } else if (speech.indexOf('bottom') > -1) {
      console.log('Scrolling bottom...')
      window.scrollTo({
        top: pageHeight,
        behavior: 'smooth'
      })
    }
  }
}

And with this we should now be able to scroll up, down, to the top or to the bottom of our web page just be aware that the transcriptions are not always a perfect match to what is being said but it can help out with the accessibility in your page. This as you can see, can be easily adapted to follow more instructions and create your own simple web assistant.

You can check the full demo code here.

Video Demo