How to detect if a text is in a given language? - spam-prevention

I have a kind of Q&A site (very approximately) where users enter questions to be answered by our Staff. I am quite concerned about users posting non-questions, which are an annoyance. The best I thought to far is a system to detect whether the text is in Italian (our users' language), and if it is, to check if it's not a copypasta against a list of common copypastas.
So, long story short: users will input some text, I have to make sure it's a proper question in Italian and not random characters.

Not sure what language you'll make
http://www.easywayserver.com/blog/java-string-contains-example/
How do I check if a string contains a specific word in PHP?
Checking if the input String (Question) contains any forbidden word would be one way to go at it.
Pseudo code
ListOfForbiddenWords;
if Language = Italian
if Input does not contain any of ListOfForbiddenwords
//It's fine
else
//Don't spam
else
//You're not Italian
Not quite sure on what's the best way to check if a string is written in a specific language

You can use Rosoka's language detection if you want a commercial option.
You can try it out at Rosoka Cloud for about $1/hour with all of the features. The language ID is available as a stand alone library. So you can feed it examples inputs that you are concerned with to see if it gives back what you want.
Random text like "jgujqkwfjpihoujlkfa" will be flagged as ROMANIZATION or a tag based on the underlying codeblocks that where used if it is non ascii. i.e. input that is not a language will not be tagged as a language.

There are many free language detection libraries. One popular example is libexttextcat from LibreOffice. There are many clones and ports and variants if you don't want a C library; see e.g. http://odur.let.rug.nl/vannoord/TextCat/competitors.html for an (incomplete, slightly dated) list of pointers.

A similar question was asked here a while ago and the answers listed a number of language detection API solutions. One of the answers points to detectlanguage.com which offers up a limited free language detection service.

Related

how to set language in PayPal ExpressCheckout

I'm working on a multi-lingual online store for customers in Ireland who speak English, Latvian and Russian.
PayPal is available in English and Russian (and others) but not Latvian.
I would like to have my form sent to PayPal setting it to show in English by default, or Russian if the customer is reading the store in Latvian or Russian.
The problem is that the API code for this, LOCALECODE, requires both the country and the language. So for example, ie_EN would be Ireland-localised English (which PayPal doesn't support), ie_LV would be Ireland-localised Latvian (again not supported) and ie_RU would be Ireland-localised Russian (again, etc).
Is there a generic way of saying "just use the language, please", without needing to hard-code a list of available languages?
You used to be able to send LOCALECODE=EN, but that has now been superseded by the more traditional LOCALECODE=en_US.
If it's just two languages you're worried about, I'm not sure what's the problem in a simple if-statement to set the correct language?
I do not know whether you had figure out this issue.
You can follow these:
For Example : https://www.paypal.com/cgi-bin/webscr?locale.x=zh_HK
You can add up "locale.x=language value allowed by paypal".
I follow this and there is okay.

Autocompletion in joe or jed editor

I want to use joe or jed to code in C language. Any way to have autocompletion?
joe and jed are two completely different programs, but the answer is "not really" for both:
jed: theoretically yes, but practically no. The home page claims that jed is "extensible in the C-like S-Lang language making the editor completely customizable". You might be able to hack in autocompletion this way, but it'd probably be more work than learning to use a better text editor.
joe's documentation claims that you can "complete word in edit buffer by hitting ESC Enter (uses other words in buffer for dictionary)". Which is something, but it's not full autocompletion.
Ok. In the last hours I have learned the following (Only for jed):
Checking at Guido Gonzato's excellent Quick Reference Guide. we found:
Completion is the automatic expansion of partially-typed words. If
your text contains the word frobnication', typingfro' followed by
M-/ (M-X dabbrev) will either expand the whole word, or cycle amongst
all possible completions.
It works for previously typed keywords.
A better autocompletion can be obtained through complete mode. you will need:
Installing modes from Jedmodes.

How to automatically excerpt user generated content?

I run a website that allows users to write blog-post, I would really like to summarize the written content and use it to fill the <meta name="description".../>-tag for example.
What methods can I employ to automatically summarize/describe the contents of user generated content?
Are there any (preferably free) methods out there that have solved this problem?
(I've seen other websites just copy the first 100 or so words but this strikes me as a sub-optimal solution.)
Think of the task of summarization as a challenge to 'select the most important sentences' from the document.
The method described in The Automatic Creation of Literature Abstracts by H.P. Luhn (1958) describes a naive method that actually performs quite well. Try giving it a shot.
If your website is in Python coding this algorithm using the NLTK (Natural Language Toolkit) is a fun task.
Make it predictable.
From a users perspective simply using the first paragraph is not bad at all.
Using any automation is bound to fall flat in some cases. So I suggest to display
the first paragraph (maybe truncating at some point) as a summary and offer the ability to override that by an optional field.
I might try using mechanical Turk or any number of other crowdsourcing options.
Another item to check out, a SourceForge project, AutoSummary Semantic Analysis Engine
Not a trivial task... You should look for articles or books on "extractive summarization"
A few starters could be:
Books:
Natural Language Processing with Python
Foundations of Statistical Natural Language Processing
Articles:
Language independent extractive summarization
Extractive summarization: how to identify the gist of a text
Extractive Summarization using Inter- and Intra- Event Relevance
Yahoo has a free API for this:
http://developer.yahoo.com/search/content/V1/termExtraction.html
Apple's patent 6424362 - Auto-summary of document content contains sample code which might be useful...
This borders on artificial intelligence so there's not going to be an "easy" solution out there, but there are products that target this problem.
Check out Copernic Summarizer, for one.
Noun phrases typically tend to be important elements of a sentence. Picking sentence(s) with a high density of noun phrases could yield a good summary. You could get noun phrases using a POS tagger.
For a good summary, it is desirable that it is a meaningful sentence. Reading a broken sentence is slightly jarring.
Alternatively, when the author posts the article, the author can highlight what are the keywords that can be used in the description which can then be automatically put in the meta description tag.

How to create a smart chat-bot? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I know that it's still an open problem so I don't expect to see complete answers here. I just want to find some approaches to solve the next problem:
I have a model (assume that is's bot's memory), and different words are associated with different objects in the model. Speaking with the bot is like executing sql-queries with a DB. Language is a very hard formalizable protocol. And we can't just write a million lines of code to implement some real language. But I believe that it's absolutely possible to implement some self-learning mechanism. How can it be implemented? Is it possible to implement learning "from scratch" or "from few basic words"? Just want to hear your ideas.
Actually, English is a very strict language and it's one of the easiest languages for experimenting with AI. Many other languages allow you to change the order of words (for example). And in some cases changed order can change the whole meaning or just add some intonation. I really don't have any ideas how to teach a bot for these things.
The first step, in taking this game to the next level, is ...
...to have a very clear view of prior art!
(and pardon me to say, the question doesn't suggest that you have such an extensive insight into the matter [and you're not alone, count me in ;-)])
Even, and maybe in particular, if your intention is to apply completely novel techniques and models, it seems important to review the literature on current and past practices. Aside from possibly identifying elements that may be adapted or reused in a new implementation, a survey of the domain will provide an keen understanding of the nature of the problem[s].
I've personally tried -on various and multiple occasions!- either the naive approach or the sophomoric approach to tackling broadly-defined problems. With the naive approach, one has but a very slight idea of the true nature and scope of the problem. The sophomoric sees us better equipped with domain knowledge and also with related tools, but this can also be misleading because without a deeper understanding, we tend to mis-read/mis-understand new material offered to us and also misuse some of the tools (a bit like the the fellow who's "good with a hammer" for whom many things look like a nail...)
It is particularly easy to make these mistakes in the field of NLP. That's because
Common sense seems to be all is required: after all a child, who's native tongue is English understands subtleties like
"He's not really an expert"
"He's really not an expert" (small wink at the OP's reference to the ordering of word in the English language)
We live in such exciting times, technology and knowledge wise: Processing power, programming language and tools, mathematical techniques, availability of affordable corpora... to name a few of these things that make this moment in time so special.
Far from me the idea of discouraging you in your chat-bot endeavor, I just hope that this long and generic exposé will encourage to look-before-you-leap, as this will truly save you time in the long run, I think in two ways:
provide you some frames of references (again, even if your intention is to "think outside these boxes")
maybe entice you to redefine the problem, for example by limiting it to particular domains of conversation (sports, or health, or life at a particular university campus...) or by focusing on a particular aspect of the problem (semantic awareness, smooth, natural sounding grammar, use of colloquial forms...)
Good luck ;-)
Check out MegaHAL's implementation for some ideas. We've used a variant of this bot for ages in an IRC channel of ours, and he does on occasion appear to be the intelligent mixture of many of our dominant personalities.
You "train" the bot -
each time the bot answer, you rank (or the tester) the answer - if the answer is good/logical - give high rank, if the answer is bad... low/negative rank.
use the ranking in the future to choose the answer, and this is how the bot learns...
There's a great description of Eliza in Paradigms of AI Programming. You should be able to implement a simple Eliza bot in a few days of work.
This isn't a learning algorithm, but it's surprising how realistic answers can be from something so simple.
You can create your own chat bot on BOT libre, http://www.botlibre.com.
The bots learns, can be trained, can be scripted, and your can program them, or let them program themselves.
Thew site supports embedding your bot on your own site, has REST API access, Android, IRC, Twitter. Free hosting, even for commercial bots.
AIML from the AliceBot project may help you out. It's a whole XML schema (if that doesn't put you off) for the branch of AI its concerned with.
An example from Wikipedia:
<category>
<pattern>WHAT IS YOUR NAME</pattern>
<template>My name is <bot name="name"/>.</template>
</category>
RebbeccaAIML is one quite well documented implementation.

Internationalizing Desktop App within a couple years... What should we do now?

So we are sure that we will be taking our product internationally and will eventually need to internationalize it. How much internationalizing would you recommend we do as we go along?
I guess in other words, is there any internationalization that is easy now but can be much worse if we let the code base mature and that won't slow us down very much if we choose to start doing it now?
Tech used: C#, WPF, WinForms
Prepare it now, before you write all the strings in the codebase itself.
Everything after now will be too late. It's now or never!
It's true that it is a bit of extra effort to prepare well now, but not doing it will end up being a lot more expensive.
If you won't follow all the guidelines in the links below, at least heed points 1,2 and 7 of the summary which are very cheap to do now and which cause the most pain afterwards in my experience.
Check these guidelines and see for yourself why it's better to start now and get everything prepared.
Developing world ready applications
Best practices for developing world ready applications
Little extract:
Move all localizable resources to separate resource-only DLLs. Localizable resources include user interface elements such as strings, error messages, dialog boxes, menus, and embedded object resources. (Moving the resources to a DLL afterwards will be a pain)
Do not hardcode strings or user interface resources. (If you don't prepare, you know you will hardcode strings)
Do not put nonlocalizable resources into the resource-only DLLs. This causes confusion for translators.
Do not use composite strings that are built at run time from concatenated phrases. Composite strings are difficult to localize because they often assume an English grammatical order that does not apply to all languages. (After the interface design, changing phrases gets harder)
Avoid ambiguous constructs such as "Empty Folder" where the strings can be translated differently depending on the grammatical roles of the strings' components. For example, "empty" can be either a verb or an adjective, and this can lead to different translations in languages such as Italian or French. (Same issue)
Avoid using images and icons that contain text in your application. They are expensive to localize. (Use text rendered over the image)
Allow plenty of room for the length of strings to expand in the user interface. In some languages, phrases can require 50-75 percent more space. (Same issue, if you don't plan for it now, redesign is more expensive)
Use the System.Resources.ResourceManager class to retrieve resources based on culture.
Use Microsoft Visual Studio .NET to create Windows Forms dialog boxes, so they can be localized using the Windows Forms Resource Editor (Winres.exe). Do not code Windows Forms dialog boxes by hand.
IMHO, to claim something is going to happens "in a few years" literally translates to "we hope one day" which really means "never". Although I would still skim over various tutorials to make sure you don't make any horrendous mistakes. Doing correct internationalization support now will mean less work in the future, and once you get use to it, it won't have any real affect on today's productivity. But if you can measure the goal in years, maybe it's not worth doing at all right now.
I have worked on two projects that did internationalization: a C# ASP.NET (existed before I joined the project) app and a PHP app (homebrewed my own method using a free Internationalization control and my own management app).
You should store all the text (labels, button text, etc etc) as data inside a database. Reference these with keys (I prefer to use the first 4 words, made uppercase, spaces converted to underscores and non alpha-numerics stripped out) and when you have a duplicate, append a number to the end. The benefit of this key method is the programmer has a pretty strong understanding of the content of the text just by looking at the key.
Write a utility to extract the data and build .NET resource files that you add into your project for compile. Create a separate resource file for each language. In your code, use the key to point to the proper entry.
I would skim over the MS documents on the subject:
http://www.microsoft.com/globaldev/getwr/dotneti18n.mspx
Some basic things to avoid:
never ever ever use translation software, hire a pro or an intern taking that language at a local college
never try to create text by appending two existing entries, because grammar differs greately in each language, this will never work. So if you have a string that says "Click" and want one that says "Click Now", do not try to create a setup that merges two entries, or during translation, copy the word for click and translate the word now. Treat every string as a totally new translation from scratch
I will add to store and manipulate string data as Unicode (NVARCHAR in MS SQL).
Some questions to think about…
How match can you afford to delay the shipment of the English version of your application to save a bit of cost internationalize later?
Will you still be trading if you don’t get the cash flow from shipping the English version quickly?
How will you get the UI right, if you don’t get feedback quickly from some customers about it?
How often will you rewrite the UI before you have to internationalize it?
Do you English customers wish to be able to customize strings in the UI, e.g. not everyone calls a “shipping note” the same think.
As a large part of the pain of internationalize is making sure you don’t break the English version, is automated system testing of the UI a better investment?
The only thing I think I will always do is: “Do not use composite strings that are built at run time from concatenated phrases” and if you do so, don’t spread the code that builds up the a single string over lots of methods.
Having your UI automatically resize (and layout) to cope with length of labels etc will save you lots of time over the years if you can do it cheaply. There a lots of 3rd party control sets for Windows Forms that lets you label text boxes etc without having to put the labels on as separate controls.
I just starting to internationalize a WinForms application, we hope to mostly be able to use the “name” of each control as the lookup key, without having to move lots into resource files etc. It is not always as hard as you think at first….
You could use NGettext.Wpf (it can be installed from NuGet, and yes I am the author, but I made it out of the frustrations listed in the other answers).
It is hosted this github repository, and here is the getting started section at the time of writing:
NGettext.Wpf is intended to work with dependency injection. You need to call the following at the entry point of your application:
NGettext.Wpf.CompositionRoot.Compose("ExampleDomainName");
The "ExampleDomainName" string is the domain name. This means that when the current culture is set to "da-DK" translations will be loaded from "Locale\da-DK\LC_MESSAGES\ExampleDomainName.mo" relative to where your WPF app is running (You must include the .mo files in your application and make sure they are copied to the output directory).
Now you can do something like this in XAML:
<Button CommandParameter="en-US"
Command="{StaticResource ChangeCultureCommand}"
Content="{wpf:Gettext English}" />
Which demonstrates two features of this library. The most important is the Gettext markup extension which will make sure the Content is set to the translation of "English" with respect to the current culture, and update it when the current culture is changed. The other feature it demonstrates is the ChangeCultureCommand which changes the current culture to the given culture, in this case "en-US".
I also highly recommend reading Preparing Strings from the gettext utilities manual.
Internationalization will let your product be usable in other countries, it's easy and should be done from the start (this way English speaking people all over the world can use your software), those 3 rules will get you most of the way there:
Support international characters - use only Unicode data types in files and databases.
Support international date, time and number formats - use CultureInfo.InvariantCulture when storing data to file or computer readable storage, use CultureInfo.CurrentCulture when displaying data or parsing user input, never do your own parsing, never use any other culture objects.
textual data entered by the user should be considered a black box, don't try to break it up into words or letters, especially when displaying it to the user - different languages have diffract rules and the OS knows how to display left-to-right text, you don't.
Localization is translating the software into different languages, this is difficult and expensive, a good start is to never hard code strings and never build sentences out of smaller strings.
If you use test data, use non-English (e.g.: Russian, Polish, Norwegian etc) strings.
Encoding peeks it's little ugly head at every corner. If not in your own libraries, then in external ones.
I personally favor Russian because although I don't speak a word Russian (despite my name's origin) it has foreign chars in it and it takes way more space then English and therefor tests your spacing too.
Don't know if that is something language specific, or just because our Russian translator likes verbose strings.

Resources