Difference between Forced Glossary and Parallel Corpus in IBM Watson Language Translator - ibm-watson

I've read the documentation about the possibilities of using forced glossary and parallel corpus but I can't quite understand the difference. From the provided examples in the docs of Watson Language Translator I see that we use forced glossary when we want to emphasize on the translation of terms, phrases between different languages. However, with parallel corpus we provide whole translation of a sentence in different language.
So forced glossary is used for small amounts of data, or?

Forced glossary acts like a simple lookup, so that a word or phrase eg. StackOverflow doesn't get translated by the service. Instead a simple lookup and replacement action is performed. No machine learning is involved.
Parallel corpus is used as reinforced learning to get the service to understand how to translate niche or industry specific terminology and language. The translations are still performed by the machine learning service, it just has an improved understanding of how to perform the translations.

Related

comments in multilingual support

I am making a website application to support factory automation, which will have users from various countries knowing different languages. I have internationalized all the string in the website so it is understandable by all users. However users have to write comments on the website related to factory operations, which they will write in their own language and it may not be understandable by users in other countries.
I wanted to know what are the best practices to help with this scenario.
One way I was thinking to not let users write comments- rather I provide possibilities of comments in a drop down which they can select. And I can internationalize those possibilities. But this is not an elegant solution, since the 'possible comments' may not be comprehensive.
There isn't really a solid no-fail solution available for this kind of problem, but here are some possibilities:
Leverage a translation engine and computer-translate the comments. How well this works depends on the engine used and the language, but it gives the reader a gist of the meaning. This solution loses a lot of use when there are a lot of technical or proprietary terms used. A lot of international webshops actually use this technique.
Encourage your users to post comments in a common language, or a language that most of your users will know, like English, Chinese of Spanish, depending on your markets.
Employ translators to regularly translate essential comments
The solution you mentioned is also pretty decent when the possible text is limited, otherwise it will spin out of control very fast.

NLP libraries for simple POS tagging

I'm a student who's working on a summer project in NLP. I'm fairly new to the field, so I apologize if there's a really obvious solution. The project is in C, both due to my familiarity with it, and the computationally intensive nature of the project (my corpus is a plaintext dump of wikipedia).
I'm working on an approach to relationship extraction, exploiting the consistency principle to try to learn (to within some error threshold) a set of rules dictating which clusters of grammar objects imply a connection between those objects.
One of the first steps in the algorithm involves finding the set of all possible grammar objects a given word can refer to (POS disambiguation is done implicitly by the algorithm at a later step). I've looked at several parsers, but they all seem to do the disambiguation step themselves, which (from my end) is counterproductive. I'm looking for something off the shelf that (ideally) gives me a one-command way to turn up this information.
Does such a thing exist? If not, is there an existent dictionary containing this information that's trivially machine parseable?
Thank you for your help.
Look at CMU Sphinx. An open source NLP project. I think its in C++ but you can integrate it or at least get the idea of how to go about things.
What about calling an external POS tagger as a shell script or wrapping it in an http service if you feel frisky?
Java and Python have the vast majority of NLP libraries so it makes sense to take advantage of that. If you can use NLTK in a script to tag stuff, call this script from C, that makes it much easier.

What is the best approach of creating a talking bot?

When creating a AI talking bot what kind of methods of design should I use? Should it be one function, multiple modules, should it have classes?
Understanding language is complicated, so the goal you need to determine first is what aspect of language you want to understand.
An AI must be able to understand what the person says to it, then relate it to what it already knows, and then generate a legitimate response.
These three steps can all be thought of as nearly independent, so you need to address each on its own.
The brain, the world's best language processor, uses a Neural Network, but that's not likely to work well for you.
A logic-based proof solving system, where facts that follow from facts are derived would probably work best, and I know of at least one system that uses it fairly effectively.
I'd start with an existing AI program (like the famous Eliza) and run its output through a speech synthesizer.
Some source for Eliza is available here. One open source speech synthisizer is FreeTTS.
If you're using a language other than Java, there are similar candidates AI bots and text-to-speech code out there.
I've started to do some work in this space using this open source project called Talkify:
https://github.com/manthanhd/talkify
It is a bot framework intended to help orchestrate flow of information between bot providers like Microsoft (Skype), Facebook (Messenger) etc and your backend services. The framework doesn't really provide implementation for the bot providers yet but does provide hooks into its natural language recognition engine.
The built in natural language recognition library can be used to classify sentences to topics which you can then map to skill functions.
Give it a try! I'd really like people's input to see if how it can be improved.

Lemon power or not?

For grammar parser, I used to "play" with Bison which have its pros/cons.
Last week, I noticed on SqLite site that the engine is done with another grammar parser: Lemon
Sounds great after reading the thin documentation.
Do you have some feedback about this parser?
Cannot really see pertinent information on Google and Wikipedia (just a few examples, same tutorials) It doesn't seem very popular. (there is no lemon tag in Stack Overflow [ed: there is now :P])
Reasons we are using Lemon in our firmware project are:
Small size of generated code and memory footprint. It produces the smallest parser I found (I compared parsers of similar complexity generated by flex, bison, ANTLR, and Lemon);
Excellent support of embedded systems: Lemon doesn't depend on standard library, you can specify external memory management functions, debug logging is removable.
Public domain license. There is separate fork of Lemon licensed under GPLv2 that is not suitable for our needs because of viral license. So we get latest sqlite sources and compile Lemon out of them (it consists of only two files);
Pull-parsing. It makes code more straightforward to understand and maintain than Flex/Bison parsing code. Thread-safety as an additional bonus I admire.
Simple integration with tokenizers. Our project nature requires tokenizing of binary stream with variable tokens size. It was quite an easy to implemented tokenizer and integrate with parser API of only 3 functions and one feedback context variable. We investigated ways of integrating Lemon with re2c and Ragel and found them also quite easy to implement.
Very simple syntax fast to learn.
Lemon explicitly separate development of tokenizer and lexical analyzer(parser). My development flow starts with designing of parser grammar. I'm able to check complex rules with implicit token sequence by the means of several Parser(...) calls at this first stage. Tokenizer is implemented afterwards.
Surely Lemon is not a silver bullet, it has limited area of application. Among disadvantages:
Lemon requires to write more rules in comparison with Bison because of simplified syntax: no repetitions and optionals, one action per rule, etc.
Complete set of LALR(1) parser limitations.
Only the C language.
Weigh the pros and cons before making your choice. I've done mine ;-)
Interesting find! I haven't actually used it, so the commentary is based on reading the documentation.
The redesign so that the lexical analysis is done separately from the parsing immediately seems to have merit. In particular, it has the potential to simplify operations such as handling multiple or nested source files. The Lex-based yywrap() mechanism is less than ideal. That it avoids all global variables and has careful memory allocation and deallocation control should count in its favour (that it allows the choice of allocator and deallocator greatly helps too - at least for the environments where I work, where memory allocation is always an issue).
The rethinking on how the rules are organized and how the terminals are identified is a good idea.
All in all, it looks like a well thought out redesign of Bison.
It is in the public domain according to the referenced web pages.

What programming language is used to IMPLEMENT google algorithm?

It is known that google has best searching & indexing algorithm.
The also have good relevancy.
They are also quicker in getting down the latest results.
All that's fine.
What programming language (c, c++, java, etc...) & database (oracle, MySQL, etc...) have they used in achieving this (since they have to manipulate with volume of data quickly and effectively)?.
Though I'm not looking for their in-depth architecture (if in case violates their company policies) an overview of all such things could be useful.
Anybody please add you valuable suggestions and insight on this?
Google internally use C++, Java and Python. See Rhino on Rails:
One of the (hundreds of) cool things
about working for Google is that they
let teams experiment, as long as it's
done within certain broad and
well-defined boundaries. One of the
fences in this big playground is your
choice of programming language. You
have to play inside the fence defined
by C++, Java, Python, and JavaScript.
Google's search algorithm is essentially MapReduce, which stems from functional programming techniques, implemented in C++.
Google has its own storage mechanism for this called the Google File System.
Mainly pigeons:
PigeonRank's success relies primarily on the superior trainability of the domestic pigeon (Columba livia) and its unique capacity to recognize objects regardless of spatial orientation. The common gray pigeon can easily distinguish among items displaying only the minutest differences, an ability that enables it to select relevant web sites from among thousands of similar pages.
Relevance of search results is governed by quality of information retrieval algorithms they use, not the programming language.
But C++ is what most of their backend code is written in (for most services).
They don't use any off-the-shelf RDBMS products for data storage. All of that is written in-house.
Check it out, the Bigtable.

Resources