I have implemented a skill that plays short SSML audio clips.
However, a few of the audio clips have alexa phrase suggestions in them.
One of the clips includes the phrase 'alexa stop'.
To my surprise it looks as if alexa 'listens' to itself in this scenario as the skill then exits instead of following the intended workflow.
Is there anything I can do about this?
You can try to mangle the word "Alexa" until it's no longer recognized--but it's possible that this will change in the future and start to be recognized again. Generally the suggestion is to not use the word "Alexa" in a response at all, what is your goal for including this phrase?
Related
I am working on an Alexa Skill and am having trouble for Alexa to understand my voice input. Therefore the utterances are not properly matched with the required slots... and alexa is always re asking or getting stuck.
Here are some examples:
affirm: f.m., a from
Speedbird: Speedboard, speaker, speed but, speed bird, spirit, speedbath
wind: windies (wind is), when is home (wind is calm)
runway 03: runway sarah three
takeoff: the cough
Any solution to training Alexa to properly understand me? Or should I just write as utterance all these "false" utterances so alexa will match my intents properly?
Thanks for any help!
There is no chance to train the language understanding itself of Alexa.
Yes, as you wrote: I would just take these false utterances as matches for your intent.
This seems also what is recommended by amazon:
...might show you that sometimes Alexa misunderstands the word "mocha" as
"milk." To mitigate this issue, you can map an utterance directly to
an Alexa intent to help improve Alexa's understanding within your
skill. ....
two common ways to improve ASR accuracy are to map an intent value or
a slot value to a failing utterance
Maybe give an other person a try to see if it's recognized the same way as your speech.
Word-Only Slots
If you're still struggling with this, you should try adding more variations to your slot values (synonyms are an option if you have specific interpretations that keep repeating). Consider adding synonyms like speed bird for Speedbird (and take off for takeoff). Non-standard word slots will not resolve as accurately as common words. By breaking Speedbird into two words, Alexa should more successfully recognize the slot. Information about synonyms are here:
https://developer.amazon.com/en-US/docs/alexa/custom-skills/define-synonyms-and-ids-for-slot-type-values-entity-resolution.html#sample-slot-type-definition-and-intentrequest
Once you've done this, you'll want to grab the canonical value of the slot, not the interpreted value (e.g. you want Speedbird not speedboard).
To see an example, scroll to the very last JSON code block. The scenario described in this request is that the user said the word track with is a synonym for the slot value song in their request. You'll see the MediaType value is track (what the user said) but if you take a look at the resolutions object, inside the values array, the first value object is the actual slot value song (what you want) associated with the synonym.
This StackOverflow goes a little more into the details on how you get that value:
How do I get the canonical slot value out of an Alexa request
Word and Number Slots
In the case of the "runway 03" example, consider breaking this into two different slots, e.g. {RunwaySlot : Custom} {Number : Amazon.Number}. You'll have better luck with these more complex slots. The same is true for an example like "red airplane," you'll want to break it into two slots: {Color : Amazon.Color} {VehicleSlot : Custom}
.
https://developer.amazon.com/en-US/docs/alexa/custom-skills/slot-type-reference.html#number
I have a skill in Alexa, Cortana and Google and in each case, there is a concept of terminating the flow after speaking the result or keeping the mic open to continue the flow. The skill is mostly in an Http API call that returns the information to speak, display and a flag to continue the conversation or not.
In Alexa, the flag returned from the API call and passed to Alexa is called shouldEndSession. In Google Assistant, the flag is expect_user_response.
So in my code folder, the API is called from the Javascript file and returns a JSON object containing three elements: speech (the text to speak - possibly SSML); displayText (the text to display to the user); and shouldEndSession (true or false).
The action calls the JavaScript code with type Search and a collect segment. It then outputs the JSON object mentioned above. This all works fine except I don't know how to handle the shouldEndSession. Is this done in the action perhaps with the validate segment?
For example, "Hi Bixby, ask car repair about changing my tires" would respond with the answer and be done. But something like "Hi Bixby, ask car repair about replacing my alternator". In this case, the response may be "I need to know what model car you have. What model car?". The user would then say "Toyota" and then Bixby would complete the dialog with the answer or maybe ask for more info.
I'd appreciate some pointers.
Thanks
I think it can easily be done in Bixby by input prompt when a required input is missing. You can build input-view to better enhance the user experience.
To start building the capsule, I would suggest the following:
Learn more about Bixby on https://bixbydevelopers.com/dev/docs/dev-guide
Try some sample capsules and watch some tutorial videos on https://bixbydevelopers.com/dev/docs/sample-capsules
If you have a Bixby enabled Samsung device, check our marketplace for ideas and inspirations.
A handler to be invoked when the skill receives an intent request with the name HelloWorldIntent, and the HelloWorldInten's utterances will have the crying sound of a baby. How do I put the crying sound in the utterances?
You cannot trigger Alexa without the wake a word (generally “Alexa”) unless you are already in a skill session. In both these cases intents are mapped according to the utterances given in the interaction model.
Unless you can convert “baby-crying” sound into words, then you won’t be able to trigger a custom intent. Also as of now there is no option to upload sample utterance as audio files.
One thing which you can try is AMAZON.FallbackIntent which gets triggered when Alexa is unable to find a proper intent match. When you are in a skill session and if you make a “baby-crying” sound,
AMAZON.FallbackIntent might get triggered, but there is no guarantee.
Responding with custom sounds.
If you want to respond with a “baby-crying” sound then you have to use SSML to add audio source to your response. You can either add a mp3 source of "baby crying" in audio tag like this:
<speak>
The baby is about to cry.
<audio src='https://yoursoundsource.com/path/to/baby_crying.mp3'/>
</speak>
Alexa Skills Kit Sound Library
Luckily for you, there is an in built library of sounds for Alexa and "baby crying" sound is already there. So, ou don't have to upload one. Just use the audio source in your response SSML.
The following sounds are listed under Human Sounds.
baby big cry (1)
<audio src='soundbank://soundlibrary/human/amzn_sfx_baby_big_cry_01'/>
baby cry (1)
<audio src='soundbank://soundlibrary/human/amzn_sfx_baby_cry_01'/>
baby cry (2)
<audio src='soundbank://soundlibrary/human/amzn_sfx_baby_cry_02'/>
baby fuss (1)
<audio src='soundbank://soundlibrary/human/amzn_sfx_baby_fuss_01'/>
In case you want to upload your own, make sure that your audio file meets the criteria.
More on SSML audio tag here
I am developing an Alexa skill, where I have a stop for names of fruits. However, if I speak something like "What is apple's cost" where the slot value has an apostrophe, Alexa does not seem to recognize the apostrophe. Workaround is to say something like "What is the cost of an apple" but that would not be the best customer experience.
How can I make Alexa understand slot value with apostrophes? Any help is appreciated.
I think this is what you are looking for.
Create Intents, Utterances, and Slots (Rules for Sample Utterances)
If the word for a slot value may have apostrophes indicating the
possessive, or any other similar punctuation (such as periods or
hyphens) include those within the brackets defining the slot. Do not
add 's after the closing bracket. For example: ...
My friend, the apostrophe could be parsed depending on the voice recognition system internally, but it will never understand in real time an apostrophe.
Good news though, you dont need the apostrophe, think about it, it is only recognizing what the custommer would say without capital letters and special characters. Meaning, if the custommer says "What is apple's cost", alexa would recognize as the following "what is apples cost". This is a problem that should be worked server-side, cause you only need to understand what the custommer meant. You should implement server side a string matching function using levenshtein's algorithm.
Greeting StackOverflow community,
Is it possible to take what a user says or enters (like the letters 1 - 9) and instead of the text to speech engine reading the numbers back to the user it plays a prerecorded audio clip so it sounds like our voiceover person instead of the robot?
Can you do this dynamically based on what the user inputs?
All i'm really asking for is a prod in the correct direction of how to start figuring this out.
You can. I've written logic, a long time ago, that takes the desired phrase and a list of available clips to find the largest segments (clips often had multiple phrases) that could be used to assemble the audio. It tends to sound very choppy, but it is possible if you have enough prerecorded audio. In my case the content was in a niche and could be accomplished with 95% coverage with only a couple thousand recordings.
At the end, it was just basic search logic to find clips. If you do this at the word level, you could just name each clip with the word and split the input and generate the audio tags. <audio src='the.wav'/><audio src='quick.wav'/><audio src='brown.wav'/><audio src='fox.wav'/>...