Is it possible to adapt and existing NLP tool in english to Swedish? and what´s the best approach? - artificial-intelligence

Whats the best approach of using existing NLP tools in english with another language ex.spanish ?

That's an awfully broad question, and you'd need to provide some more pointers. However, if you're interested in general research on the topic, you can try Hana, Feldman, Brew (2004) "Tagging Russian using Czech morphology" and Resnik's 2004 "Using bilingual text for monolingual annotation" and start from there.
In general, you'd want to have a bicorpus (say, English/Swedish). Then establish mappings using alignment (that's a common topic in machine translation with many established results.)
You can then tag the English side, and use the mapping to "translate" these mappings into the Swedish side. Then you can train the same tool that created the mappings on the English side using the newly annotated Swedish corpus.
It goes without saying that you'll lose quite a bit of quality and that this technique only works for supervised methods. You should probably try to find properly annotated Swedish corpora and tools. There are a few out there.

Related

Pulling out Popular terms from a Solr core

I have an Apache Solr core where i need to pull out the popular terms out of it, i am already aware of luke, facets, and Apache Solr stopwords but i am not getting what i want, for example, when i try to use luke to get the popular terms and after applying the stopwords on the result set i get a bunch of words like:
http, img, que ...etc
While what i really want is:
Obama, Metallica, Samsung ...etc
Is there any better way to implement this in Solr?, am i missing something that should be used to do this?
Thank You
Finding relevant words from a text is not easy. The first thing I would have a deeper look at is Natural Language Processing (NLP) with Solr. The article in Solr's wiki is a starting point for this. Reading the page you will stumble over the Full Example which extracts nouns and verbs, probably that already helps you.
During the process of getting this running you will need to install additional software (Apache's OpenNLP project) so after reading in Solr's wiki that project's home page maybe the next step.
To get a feeling what is possible with that you should have a look on the demonstration of the searchbox guy. There you can paste a sample text and get relevant words and terms extracted from it.
There are several tutorials out there you may have a look at for further reading.
If you went down the path and the results are not as expected or not as good as required, you may go down that road even further and start thinking about text mining with Apache Mahout. There are again several tutorials out there to cross it with Solr.
In any case you should then search Stackoverflow or the web for tutorials and How-Tos you will certainly need.
Update about arabic
If you are going to use OpenNLP for not supported languages, which Arabic unfortunately is out of the box as of version 1.5, you will need to train OpenNLP for the language. The reference about it is found on the developer docs of OpenNLP. Probably there is already something out there from the arabic community, but my arabic google-fu is not that good.
Should you decide to do the work and train it for the arabic language, why not share your traning with the project?
Update about integration in Solr/Lucene
There is work going on to integrate it as a module. In my humble opinion this is as far as it will and should get. If you compare this problem field to stemming stemming appears to be rather easy. But even stemming got complex when supporting different languages. Analysing a language to the level that you can extract nouns, verbs and so forth is so complex that a whole project evolved around it.
Having a module/contrib at hand, which you could simply copy to solr_home/lib would already be very handy. So there would be no need to run a different installer an so forth.
Well , this is a bit open ended.
First you will need to facet and find "popular terms" out of your index, then add all the non useful items such as http , img , time , what, when etc to your stop word list and re-index to get cream of the data you care about. I do not think there is an easier way of knowing popular names unless you can bounce your data against a custom dictionary of nouns during indexing (that is an option by the way)- You can choose to index only names by having a custom token filter (look how stopword filter works) and have your own nouns.txt file to go with your own nouns filter, in the case you allow only words in your dictionary to into index, and this approach is only possible if you have finite known list of nouns.

Cooperation tool group: coding in C

Is there any tool, similar to codepad, writing code in C language that I can share my code with a group and my group can make changes and simultaneous views in real time editing?
I can't tell you enough that this is going to make your work more difficult if you're planning on using this for anything other than something like a code review. However, it's called a real-time collaborative editor. There are a ton of them. I used one on linux a while back that I can't remember the name of, but in the mean-time, let wikipedia start you off...
http://en.wikipedia.org/wiki/Collaborative_real-time_editor
Edit:
The tool I used on Linux that worked well was called Gobby.
There are a bunch of others in this question on SO Real time tool for collaborative coding
Sorry for resurrecting an old question but I thought I should share this.
I usually use Collab.Center (http://collab.center). Some features I like about it better than others are:
Online, real-time collaborative coding
Support for a lot of languages (40+, I think) (EX: C, C++, Java, HTML/CSS/JS, PHP, etc)
Text and Video (Webcam) chat (Requires Sign-In)
Syntax highlighting, auto-closing brackets, matching brackets, etc.
Ability to manage all your documents (Requires Sign-In)
Private documents (Requires Sign-In)
I think it would be great for you and your group, if you haven't already found an alternative.

Data source of English speech phrases

I am doing a research on developing a simulated environment for the students (who use English as second language) to practice English speaking.
In one part of my development, I need a data source which contains mostly using English speech phrases which are tagged against the real incident. As an exmaple,
“Ways to Apologise
Sorry.
I’m sorry.
I’m so sorry!
Sorry for your loss.”
I could find several sites which are providing this service http://edition.englishclub.com, but not a data source.
Has somebody used such a data source , which can be used like ‘wordnet’ ? If so please help me to carry on this forward. Otherwise I have to develop such a data source which I feel like reinventing the wheel.

Simplest way to create a tiny database app in linux

I'm looking to create a very small cataloguing app for personal use (although I'd open source it if I thought anyone else would use it). I don't want a web app as it seems like overkill to have an application server just for this - plus I like the idea of it being standalone and sticking it on a USB stick.
My Criterea:
Interface must be simple to program. It can be curses-style if that makes it easer to code. My experience with ncurses would suggest otherwise, but I'd actually quite like a commanline UI.
Language doesn't really matter. My rough order of preference (highest first):
Python
C
C++
Java
I'll consider anything linux-friendly
I'm thinking sqlite for storage, but other (embeddable) suggestions welcome.
Has anyone done this sort of thing in the past? Any suggestions? Pitfalls to avoid?
EDIT:
Ok, it looks like python+sqlite is the early winner. That just leaves the question of which ui library. I know you get tkinter for free in python - but it's just so ugly (I'd rather have a curses interface). I've done some GTK in C, but it looks fairly un-natural in python. I had a very brief dabble with wxwidgets but the documentation's pretty atrocious IIRC (They renamed the module at some point I think, and it's all a bit confused).
So that leaves me with pyqt4, or some sort of console library. Or maybe GTK. Thoughts? Or have I been too hasty in writing off one of the above?
I would definitely recommend (or second, if you're already thinking it) - python with sqlite3. It's simple, portable and no big db drivers. I wrote a similar app for my own cataloguing purposes and it's doing just fine.
I vote for pyqt or wx for the GUI. (And second the Python+sqlite votes to answer the original question.)
I second (or third) python and sqlite.
As far as suggestions are concerned:
If you're feeling minimally ambitious, I'd suggest building a very simple web service to synchronize your catalog to a server. I've done this (ashamedly, a few times) for similar purposes in the past.
With sqlite, backups can literally be as simple as uploading or downloading the latest database file, depending on the file's timestamp.
Then, if you lose or break your flash drive (smashed to pieces, in my case), your catalog isn't lost. You gain more portability, at least 1 backup, and some peace of mind.

What is a DSL and where should I use it?

I'm hearing more and more about domain specific languages being thrown about and how they change the way you treat business logic, and I've seen Ayende's blog posts and things, but I've never really gotten exactly why I would take my business logic away from the methods and situations I'm using in my provider.
If you've got some background using these things, any chance you could put it in real laymans terms:
What exactly building DSLs means?
What languages are you using?
Where using a DSL makes sense?
What is the benefit of using DSLs?
DSL's are good in situations where you need to give some aspect of the system's control over to someone else. I've used them in Rules Engines, where you create a simple language that is easier for less-technical folks to use to express themselves- particularly in workflows.
In other words, instead of making them learn java:
DocumentDAO myDocumentDAO = ServiceLocator.getDocumentDAO();
for (int id : documentIDS) {
Document myDoc = MyDocumentDAO.loadDoc(id);
if (myDoc.getDocumentStatus().equals(DocumentStatus.UNREAD)) {
ReminderService.sendUnreadReminder(myDoc)
}
I can write a DSL that lets me say:
for (document : documents) {
if (document is unread) {
document.sendReminder
}
There are other situations, but basically, anywhere you might want to use a macro language, script a workflow, or allow after-market customization- these are all candidates for DSL's.
DSL stands for Domain Specific Language i.e. language designed specifically for solving problems in given area.
For example, Markdown (markup language used to edit posts on SO) can be considered as a DSL.
Personally I find a place for DSL almost in every large project I'm working on. Most often I need some kind of SQL-like query language. Another common usage is rule-based systems, you need some kind of language to specify rules\conditions.
DSL makes sense in context where it's difficult to describe\solve problem by traditional means.
If you use Microsoft Visual Studio, you are already using multiple DSLs -- the design surface for web forms, winforms, etc. is a DSL. The Class Designer is another example.
A DSL is just a set of tools that (at least in theory) make development in a specific "domain" (i.e. visual layout) easier, more intuitive, and more productive.
As far as building a DSL, some of the stuff people like Ayende have written about is related to "text parsing" dsls, letting developers (or end users) enter "natural text" into an application, which parses the text and generates some sort of code or output based on it.
You could use any language to build your own DSL. Microsoft Visual Studio has a lot of extensibility points, and the patterns & practices "Guidance Automation Toolkit" and Visual Studio SDK can assist you in adding DSL functionality to Visual Studio.
DSL are basic compilers for custom languages. A good 'free and open' tool to develop them is available at ANTLR. Recently, I've been looking at this DSL for a state machine language use on a new project . I agree with Tim Howland above, that they can be a good way to let someone else customize your application.
FYI, a book on DSLs is in the pipeline as part of Martin Fowler's signature series.
If its of the same standard as the other books in the series, it should be a good read.
More information here
DSL is just a fancy name and can mean different things:
Rails (the Ruby thing) is sometimes called a DSL because it adds special methods (and overwrites some built-in ones too) for talking about web applications
ANT, Makefile syntax etc. are also DSLs, but have their own syntax. This is what I would call a DSL.
One important aspect of this hype: It does make sense to think of your application in terms of a language. What do you want to talk about in your app? These should then be your classes and methods:
Define a "language" (either a real syntax as proposed by others on this page or a class hierarchy for your favorite language) that is capable of expressing your problem.
Solve your problem in terms of that language.
DSL is basically creating your own small sublanguage to solve a specific domain problem. This is solved using method chaining. Languages where dots and parentheses are optional help make these expression seem more natural. It can also be similar to a builder pattern.
DSL aren't languages themselves, but rather a pattern that you apply to your API to make the calls be more self explanatory.
One example is Guice, Guice Users Guide http://docs.google.com/View?docid=dd2fhx4z_5df5hw8 has some description further down of how interfaces are bound to implementations, and in what contexts.
Another common example is for query languages. For example:
NewsDAO.writtenBy("someUser").before("someDate").updateStatus("Deleted")
In the implementation, imagine each method returning either a new Query object, or just this updating itself internally. At any point you can terminate the chain by using for example rows() to get all the rows, or updateSomeField as I have done above here. Both will return a result object.
I would recommend taking a look at the Guice example above as well, as each call there returns a new type with new options on them. A good IDE will allow you to complete, making it clear which options you have at each point.
Edit: seems many consider DSLs as new, simple, single purpose languages with their own parsers. I always associate DSL as using method chaining as a convention to express operations.

Resources