how to prepare data for domain specific chat-bot

how to prepare data for domain specific chat-bot - ibm-watson

I am trying to make a chatbot. all the chatbots are made of structure data. I looked Rasa, IBM watson and other famous bots. Is there any ways that we can convert the un-structured data into some sort of structure, which can be used for bot training? Let's consider bellow paragraph-
Packaging unit
A packaging unit is used to combine a certain quantity of identical items to form a group. The quantity specified here is then used when printing the item labels so that you do not have to label items individually when the items are not managed by serial number or by batch. You can also specify the dimensions of the packaging unit here and enable and disable them separately for each item.
It is possible to store several EAN numbers per packaging unit since these numbers may differ for each packaging unit even when the packaging units are identical. These settings can be found on the Miscellaneous tab:
There are also two more settings in the system settings that are relevant to mobile data entry:
When creating a new item, the item label should be printed automatically. For this reason, we have added the option ‘Print item label when creating new storage locations’ to the settings. When using mobile data entry devices, every item should be assigned to a storage location, where an item label is subsequently printed that should be applied to the shelf in the warehouse to help identify the item faster.
how to make the bot from such a data any lead would be highly appreciated. Thanks!
is this idea in picture will work?just_a_thought

The data you are showing seems to be a good candidate for a passage search. Basically, you would like to answer user question by the most relevant paragraph found in your training data. This uses-case is handled by Watson Discovery service that can analyze unstructured data as you are providing and then you can query the service with input text and the service answers with the closest passage found in the data.
From my experience you also get a good results by implementing your own custom TF/IDF algorithm tailored for your use-case (TF/IDF is a nice similarity search tackling e.g. the stopwords for you).
Now if your goal would be to bootstrap a rule based chatbot using these kind of data then these data are not that ideal. For rule-based chatbot the best data would be some actual conversations between users asking questions about the target domain and the answers by some subject matter expert. Using these data you might be able to at least do some analysis helping you to pinpoint the relevant topics and domains the chatbot should handle however - I think - you will have hard time using these data to bootstrap a set of intents (questions the users will ask) for the rule based chatbot.
TLDR
If I would like to use Watson service, I would start with Watson Discovery. Alternatively, I would implement my own search algorithm starting with TF/IDF (which maps rather nicely to your proposed solution).

Related

How to develop search box auto completion from database?

I have seen so many e-commerce websites that provides search box to search products. In that search features most of the search fields are auto-complete. If we enter a letter on field, then it will show the data which is including that letter as suggestions from database. As I know basics on developing that functionality.
But what if database contains huge amount of data?
For example, e-commerce websites like flipkart and amazon had a lot of products in their database. so, if user enter a letter in search field, it have to search for data including that letter among all the data in database and match data including that letter and display data as suggestions. The websites are processing it within nano seconds of time. I wonder how they achieved that functionality? I can't understand what are the technologies they are using.
As a learner I wanna know the functional flow and if possible demo for that feature.

I think your question can be divided into two parts. 1) how to design the database for the search technology. 2) how to implement an effect search... They belong to the field of search engine technology.
About the Q1, you can create a table to save the keywords for search, and in the table, you'd better to design a column or similar method to describe the "search-weight". As well known, a view is a practical solution to accelerate the access of the data.
About the Q2, the search engine technique is No longer mysterious, some open source projects can simulate the feature of search engine, such as Apache Lucene, visit please Apache Lucene.
more discuss:
And specially, in your front system for example, the ASP/JSP or even simple HTML page, you should use some scripts e.g. Ajax, to popup, drawdown, of caurse, simple DOM Javascript+DIV can reach it too, but with jQuery or other libarary can make it easily. Here is an example.
Here is the backend system example
To reduce the burden on the host and reduce the requirement of network's bandwidth, the front javascript should active the autocomplete feature with more than three characters.
Please pay attention in your actual application, that your server has calculation's limitation, and the client page has usually many elements, all will reduce user friendliness. Please do not make the request and response too complex.
An alternative simulation can be: make a FIFO logic, save some usual search keyword in the "cache" or temp-table|view, and the amount of data will be reduced.
There are too many solutions, I can only think of these tricks at this moment.
regards

Options to change the identified intent in Alexa fulfillment

My understanding is that Amazon ASK still does not provide:
The raw user input
An option for a fallback intent
An API to
dynamically add possible options from which Alexa can be better
informed to select an intent.
Is this right or am I missing out on knowing about some critical capabilities?
Actions on Google w/ Dialogflow provides:
raw user input for analysis: request.body.result.resolvedQuery
fallback intents:
https://dialogflow.com/docs/intents#fallback_intents
An APi to dynamically add user expressions (aka sample utterances): PUT
/intents/{id}
These tools provide devs with the ability to check to see if the identified intent is correct and if not fix it.
I know there have been a lot of questions asked previously, just a few here:
How to add slot values dynamically to alexa skill
Can Alexa skill handler receive full user input?
Amazon Alexa dynamic variables for intent
I have far more users on my Alexa skill than my AoG app simply because of Amazon's dominance to date in the market - but their experience falls short of a Google Assistant user experience because of these limitations. I've been waiting for almost a year for new Alexa capabilities here, thinking that after Amazon's guidance to not use AMAZON.LITERAL there would be improvements coming to custom slots. To date it still looks like this old blog post is still the only guidance given. With Google, I dynamically pull in utterance options from a db that are custom for a given user following account linking. By having the user's raw input, I can correct the choice of intent if necessary.
If you've wanted these capabilities but have had to move forward without them, what tricks do you have to get accurate intent handling with Amazon when you don't know what the user will say?
EDIT 11/21/17:
In September Amazon announced the Alexa Skill Management API (SMAPI) which does provide the 3rd bullet above.

Actually this should be better a comment but i write to less at stackoverflow to be able to comment. I am with you on all.
But Amazons Alexa has also a very big advance.
The intent Schema is seeming to directly influence the Voice to Text recognition. Btw. can someone confirm if this is correct?
At Google Home it seems not to be the case.
So matching of unusual names is even more complicated than at alexa.
And it sometimes just recognize absolute bullshit.
Not sure which I prefer currently.
My feeling is for small apps is Alexa much better, because it better match the Intent phrases when it has lesser choices.
But for large Intent schemas, it get really trouble and in my tests some of the intents were not matched at all correct.
Here the google home and action SDK wins, probably? Cause Speech to text seem to be done before and than a string pattern to intent schema matching is happening. So this is probably more robust for larger schemas?
To get something like an answer on your questions:
You can try to add as much as possible that can be said to a slot. And than match the result from the Alexa request to your database via Jaro winkler or some other string distance.
Was I tried for Alexa was to find phrases that are close to what the user say. And this i added as phrases to fill a slot.
So a module in our webpage was an intent in the schema. And Than I requested To say what exactly should be done in that module (this was the slot filling request). The Answer was the slot filling utterance.
For me that was slightly better working than the regulary intent schema. But it require more talking so i dont like it so much.

Let me go straight to answering your 3 questions:
1) Alexa does provide the raw input via the slot type AMAZON.Literal but it's now deprecated and you're advised to use AMAZON.SearchQuery for free form capture. However, if instead of using SearchQuery you define a custom slot type and provide samples (training data) the ASR will work better.
2) Alexa supports FallbackIntent since I believe May 2018. The way it work is by automatically generating a model for your skill where out-of-domain requests are routed through a fallback intent. It works well
3) Dynamically adding slot type values is not feasible since when you provide samples you're really providing training data for a model than will be able to then process similar values beyond the ones you defined. If you noticed when you provide a voice interaction model schema then you have to build the model (in this step the training data provided in the samples is used to create the model). One example, when you define a custom slot of type "Car" and you provide the samples "Toyota", "Jeep", "Chevrolet" and "Honda" then, the system will also go to the same intent if the user says "Ford"
Note: SMAPI does allow to get and update the interaction model, so technically you could download the model via API, modify it with new training data, upload it again and rebuild the model. This is kind of awkward though

Web data extraction and data mining; Scraping vs Injection and how to get data.. like yesterday

I feel like i should almost give a friggin synopsis to this/these lengthy question(s)..
I apologize if all of these questions have been answered specifically in a previous question/answer post, but I have been unable to locate any that specifically addresses all of the following queries.
This question involves data extraction from the web (ie web scraping, data mining etc). I have spent almost a year doing research into these fields and how it can be applied to a certain industry. I have also familiarized myself with php and mysql/myphpmyadmin.
In a nutshell I am looking for a way to extract information from a site (probably several gigs worth) as fast and efficiently as possible. I have tried web scraping programs like scrapy and webharvey. I have also experimented with programs like HTTrack. All have their strengths and weaknesses. I have found that webharvey works pretty good yet it has its limitations when scraping images that are stored in gallery widgets. Also I find that many of the sites I am extracting from use other methods to make mining data a pain. It would take months to extract the data using webharvey. Which I can't complain given that I'd be extracting millions of rows worth of data exported in csv format into excel. But again, images and certain ajax widgets throw the program off when trying to extract image files.
So my questions are as follows:
Are there any quicker ways to extract said data?
Is there any way to get around the webharvey image limitations (ie only being able to extract one image within a gallery widget / not being able to follow sub-page links on sites that embed their crap funny and try to get cute with coding)?
Are their any ways to bypass site search form parameters that limit the number of search results (ie obtaining all business listings within an entire state instead of being limited to a county per search form restrictions)**
Also, this is public information so therefore it cannot be copyrighted; anybody can take it :) (case in point: Feist Publications v. Rural Telephone Service). Extracting information is extracting information. Its legal to extract as long as we are talking facts/public information.
So with that said, wouldn't the most efficient method (grey area here) of extracting this "public" information (assuming vulnerabilities existed), be through the use of sql injection?... If one was so inclined? :)
As a side question just how effective is Tor at obscuring ones IP address? Lol
Any help, feedback, suggestions or criticism would be greatly appreciated. I am by no means an expert in any of the above mentioned fields. I am just a motivated individual with a growing interest in programming and automation who has a lot of crazy ideas. Thank you.

You may be better off writing your own Linux command-line scraping program using either a headless browser library like PhantomJS (JavaScript), or a test framework like Selenium WebDriver (Java).
Once you have your scrape program completed, you can then scale it up by installing it on a cloud server (e.g. Amazon EC2, Linode, Google Compute Engine or Microsoft Azure) and duplicating the server image to as many are required.

ontology application project

I am looking for a project (application) that makes use of Ontology (for an academic course). Every body is talking about the health care application. I want to work on a different project. please any suggestion could help.

I think that (almost) everything can be represented through an ontology. The idea behind it is to embed semantic meaning to the data you are putting into it.
Take for example Swoogle, it's a search engine that look into several ontologies to retrieve the information.
In the same way you can use it for any purpose:
Tourism: travel information, retrieve meaningful suggestions to you clients
Documents: search for topic that are related, not just by the keywords but by the meaning of those keywords
Shopping store
FAQs
etc
The list goes all the way, if you can use it as a search engine, you can use it everywhere.

Looking for an example of when screen scraping might be worthwhile

Screen scraping seems like a useful tool - you can go onto someone else's site and steal their data - how wonderful!
But I'm having a hard time with how useful this could be.
Most application data is pretty specific to that application even on the web. For example, let's say I scrape all of the questions and answers off of StackOverflow or all of the results off of Google (assuming this were possible) - I'm left with data that is not very useful unless I either have a competing question and answer site (in which case the stolen data will be immediately obvious) or a competing search engine (in which case, unless I have an algorithm of my own, my data is going to be stale pretty quickly).
So my question is, under what circumstances could the data from one app be useful to some external app? I'm looking for a practical example to illustrate the point.

It's useful when a site publicly provides data that is (still) not available as an XML service. I had a client who used scraping to pull flight tracking data into one of his company's intranet applications.
The technique is also used for research. I had a client who wanted to compare the contents of several online dictionaries by part of speech, and all of these sites had to be scraped.
It is not a technique for "stealing" data. All ordinary usage restrictions apply. Many sites implement CAPTCHA mechanisms to prevent scraping, and it is inappropriate to work around these.

A good example is StackOverflow - no need to scrape data as they've released it under a CC license. Already the community is crunching statistics and creating interesting graphs.
There's a whole bunch of popular mashup examples on ProgrammableWeb. You can even meet up with fellow mashupers (O_o) at events like BarCamps and Hack Days (take a sleeping bag). Have a look at the wealth of information available from Yahoo APIs (particularly Pipes) and see what developers are doing with it.
Don't steal and republish, build something even better with the data - new ways of understanding, searching or exploring it. Always cite your data sources and thank those who helped you. Use it to learn a new language or understand data or help promote the semantic web. Remember it's for fun not profit!
Hope that helps :)

If the site has data that would benefit from being accessible through an API (and it would be free and legal to do so), but they just haven't implemented one yet, screen scraping is a way of essentially creating that functionality for yourself.
Practical example -- screen scraping would allow you to create some sort of mashup that combines information from the entire SO family of sites, since there's currently no API.

Well, to collect data from a mainframe. That's one reason why some people use screen scraping. Mainframes are still in use in the financial world and often it's running software that has been written in the previous century. The people who wrote it might already be retired and since this software is very critical for these organizations, they really hate it when some new code needs to be added. So, screenscraping offers an easy interface to communicate with the mainframe to collect information from the mainframe and then send it onwards to any process that needs this information.
Rewrite the mainframe application, you say? Well, software on mainframes can be very old. I've seen software on mainframes that was over 30 years old, written in COBOL. Often, those applications work just fine and companies don't want to risk rewriting parts because it might break some code that had been working for over 30 years! Don't fix things if they're not broken, please. Of course, additional code could be written but it takes a long time for mainframe code to be used in a production environment. And experienced mainframe developers are hard to find.
I myself had to use screen scraping too in a software project. This was a scheduling application which had to capture the output to the console of every child process it started. It's the simplest form of screen scraping, actually, and many people don't even realize that if you redirect the output of one application to the input of another, that it's still a kind of screen scraping. :)
Basically, screen scraping allows you to connect one (web) application with another one. It's often a quick solution, used when other solutions would cost too much time. Everyone hates it, but the amount of time it saves still makes it very efficient.

Let's say you wanted to get scores from a popular sports site that did not offer the information available with an XML feed or API.

For one project we found a (cheap) commercial vendor that offered translation services for a specific file format. The vendor didn't offer an API (it was, after all, a cheap vendor) and instead had a web form to upload and download from.
With hundreds of files a day the only way to do this was to use WWW::Mechanize in Perl, screen scrape the way through the login and upload boxes, submit the file, and save the returned file. It's ugly and definitely fragile (if the vendor changes the site in the least it could break the app) but it works. It's been working now for over a year.

One example from my experience.
I needed a list of major cities throughout the world with their latitude and longitude for an iPhone app I was building. The app would use that data along with the geolocation feature on the iPhone to show which major city each user of the app was closest to (so as not to show exact location), and plot them on a 3D globe of the earth.
I couldn't find an appropriate list in XML/Excel/CSV type format anywhere easily, but I did find this wikipedia page with (roughly) the info I needed. So I wrote up a quick script to scrape that page and load the data into a database.

Any time you need a computer to read the data on a website. Screen scraping is useful in exactly the same instances that any website API is useful. Some websites, however, don't have the resources to create an API themselves; screen scraping is the developer's way around that.
For instance, in the earlier days of Stack Overflow, someone built a tool to track changes to your reputation over time, before Stack Overflow itself provided that feature. The only way to do that, since Stack Overflow has no API, was to screen scrape.

The obvious case is when a webservice doesn't offer reverse search. You can implement that reverse search over the same data set, but it requires scraping the entire dataset.
This may be fair use if the reverse search also requires significant pre-processing, e.g. because you need to support partial matching. The data source may not have the technical skills or computing resources to provide the reverse search option.

I use screen scraping daily, I run some eCommerce sites and have screen-scraping scripts running daily to gather product lists automatically from my suppliers wholesale sites. This allows me to have upto date information on all the products available to me from several suppliers and allows me to flag non-economical margins due to price changes.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight