NLP libraries for simple POS tagging - c

I'm a student who's working on a summer project in NLP. I'm fairly new to the field, so I apologize if there's a really obvious solution. The project is in C, both due to my familiarity with it, and the computationally intensive nature of the project (my corpus is a plaintext dump of wikipedia).
I'm working on an approach to relationship extraction, exploiting the consistency principle to try to learn (to within some error threshold) a set of rules dictating which clusters of grammar objects imply a connection between those objects.
One of the first steps in the algorithm involves finding the set of all possible grammar objects a given word can refer to (POS disambiguation is done implicitly by the algorithm at a later step). I've looked at several parsers, but they all seem to do the disambiguation step themselves, which (from my end) is counterproductive. I'm looking for something off the shelf that (ideally) gives me a one-command way to turn up this information.
Does such a thing exist? If not, is there an existent dictionary containing this information that's trivially machine parseable?
Thank you for your help.

Look at CMU Sphinx. An open source NLP project. I think its in C++ but you can integrate it or at least get the idea of how to go about things.

What about calling an external POS tagger as a shell script or wrapping it in an http service if you feel frisky?
Java and Python have the vast majority of NLP libraries so it makes sense to take advantage of that. If you can use NLTK in a script to tag stuff, call this script from C, that makes it much easier.

Related

Where should a programmer with NO static/compiled language exp start learning Go?

I'm an experienced software developer, but I've only worked in dynamic languages (primarily Python, PHP in the past, JavaScript, and a little Ruby). Last night, I found myself reading through the tour on the Go website's tour when I realized that the language (syntax, libraries, etc.) would probably be fairly easy to learn, but my lack of knowledge about static/compiled languages would bar me from easy entry. It's not that I don't understand the core concepts of a static language, namely that function argument/variable/return types are static and that a program must be compiled before use. It's more that I don't know where to begin after writing a program. For instance, if I wrote a web application using the Revel framework, it would handle these steps for me (according to the website). Is that pretty typical of frameworks for static languages. Am I worrying too much about a small part of the process that will be quick to learn, or are the (as I call them) formalities of using a static language pretty cumbersome?
As other suggested, any tutorial on Go would work, and you probably worry too much about the dynamic -> static switch. Statically typed languages can be a bit cumbersome sometimes if you come from dynamic typing world, but you'll quickly get used to your compiler yelling at you when types are not correct, and quickly fix it. Eventually, you'll start double guessing it and write (mostly) type-correct code.
Rob Pike noticed that people coming to Go where coming mainly from dynamic languages, which means this cannot be all that hard to do the switch.
There are a lot of tutorials all over the internet titled "Go for ", such as "Go for Rubyists", "Go for Pythonistas" which can help you map your existing knowledge to Go concepts. But as other underlined, the best (only ?) way of properly learning go is to take a tutorial and dive in ! For the books, the standard Effective Go or the very good Programming in Go are very good reads, no matter your background.
Well obviously practice makes perfect, and reading through the extensive documentation. I also find this book really nice Go-lang book, it has some exercises at the end of the chapters which is nice.
Just get a basic tutorial for the language you want and follow it. You will soon pick up how to structure the program. You can then apply your current knowledge of programming to make it do what you want.

Interesting examples of Domain Specific Languages

I'm considering doing something with Domain Specific Languages for my undergraduate project. My one problem is I can't really find any interesting examples that I can root around in. Does anyone have any good examples of DSELs (preferably open source)?
Also, one area I would love to look at is solving/addressing concurrency problems (coroutines etc) with DSEL's. Are there any good examples that anyone uses of this in DSELs? If this is a stupid application of DSELs please explain why...
Another potential area to explore would database programming. Again is this a stupid area to explore with DSEL's. For example, would adding some crazy database manipulation syntax to C# say be a good project to undertake?
EDIT: General languages I would be looking at implementing in would be Java, Python, Scala, C# etc. Probably not C++ or C.
Linda implementations can be considered as eDSLs. STM implementations like CL-STM are certainly eDSLs.
Unrelated to concurrency, but extremely useful are embedded Prolog implementations, there are plenty of them for Scheme, Lisp and Clojure. Parsing eDSLs had been mentioned already - and their patriarch Parsec definitely worth digging into.
EDIT: with your list of implementation languages you're missing the most interesting eDSL opportunities. The most powerful and flexible eDSLs are made with metaprogramming. Scala-style (or even Haskell-style) eDSLs are based on high order functions, i.e., on mini-interpreters. They're more complicated in design, much less flexible and limited to the syntax of your host language.
boost::spirit if you're after C++ is an interesting example. Quote:
Spirit is a set of C++ libraries for
parsing and output generation
implemented as Domain Specific
Embedded Languages (DSEL)...
(I have no idea what you mean by "solving concurrency" though. I don't see how you can solve "concurrency problems" in general, or how a DSEL could help.)

How to generate sequence diagram for my Native (C, C++) code?

I would like to know how to generate a sequence diagram for my Native (C, C++) code. I have written my C code using vim editor.
Thanks,
Sen
First of all, sequence diagram is an object oriented concept. It is meant to convey, at a glance, message passing between objects in an object oriented program in a sequential fashion, which is supposed to help understand time-considerate interaction between the objects. As such, it does not make sense to talk about sequence diagrams in the context of a procedural language like C.
When it comes to C++, sequence diagrams are defined in the general sense by the UML specification, which is the same for all object oriented languages. UML is considered a higher-level concept from source code that looks the same for all languages, and the process of converting source code to UML is called code reverse engineering. There are tools that allow you to convert source code of Java, C++ and other languages into UML diagrams that show relationships between classes, like Enterprise Architect, Visual Paradigm and IBM Rational Software Architect.
A sequence diagram, however, is a special kind of a UML diagram and it turns out that reverse engineering a sequence diagram is quite challenging. First, if you wanted to generate a sequence diagram through static analysis, one of the first questions you must answer is whether, given two objects and a message passed between them, a result is ever returned. This means that, given a method, you would have to analyze its algorithm and figure out if it loops forever or it returns. This is known as the halting problem and has been proven to be undecidable in computer science. This means that in order to produce a sequence diagram through static analysis, you would have to sacrifice accuracy. Dynamic analysis works by actually running the code and mapping the interactions between the objects at run time. This presents its own challenges. First, you would have to instrument the code. Then, filtering out the interactions you are interested in from library and system calls and other fluff present in the code would not be doable without user intervention.
This is not to say that creating a tool that would produce usable sequence diagrams is not possible, but the market interest has apparently not been strong enough to justify the effort, and apart from a few research papers on the subject, like CPP2XMI, I'm not aware of any commercially available tools to reverse engineer C++ into sequence diagrams.
Compounding the problem is the fact that C++ is one of the most complex object oriented languages around, so even if somebody devised a good way of reverse engineering sequence diagrams, C++ would be the last language to receive the treatment. Case in point: Visual Paradigm offers rudimentary support for reversing Java code into sequence diagrams, but not for C++.
Even if such a tool existed for C++, the sad truth is that if your C++ code is complex enough that you would rather use a tool to make a sequence diagram for it instead of doing it manually, then it is most likely too complex for the tool to give you anything useful and you would have to fix it up yourself anyways.
You can try CppDepend which provides the Dependency graph and the dependency matrix to explore the dependencies between directories, files and functions.
Have you tried with plantuml? It works really well with Doxygen, I use it at work with the company template and the syntax it's really easy, you have to write the call sequence yourself though. There are plenty examples in the page, if you are working in Linux you can use your native packaging tool to install it, the same applies to Doxygen (e.g. sudo apt-get plantuml). Otherwise if you are using Windows you can use the installers from the official pages too.
You'll have to do some configuration but it's pretty straightforward, I'll leave you the links to each tool.
Download pages:
http://plantuml.com/download
http://www.doxygen.nl/download.html
Plantuml examples:
http://plantuml.com/sequence-diagram
You can find the documentation in each page, for plantmul you use java executable (.jar) then you don't have to install nothing, you just need to configure doxygen to find the executable, you can find how in the doxygen documentation page:
http://www.doxygen.nl/manual/index.html
If you want to configure it without reading the documentation you could also watch this video:
https://www.youtube.com/watch?v=LZ5E4vEhsKs
I hope this helps, cheers.
You could explore trace2uml with works with doxygen.

Lemon power or not?

For grammar parser, I used to "play" with Bison which have its pros/cons.
Last week, I noticed on SqLite site that the engine is done with another grammar parser: Lemon
Sounds great after reading the thin documentation.
Do you have some feedback about this parser?
Cannot really see pertinent information on Google and Wikipedia (just a few examples, same tutorials) It doesn't seem very popular. (there is no lemon tag in Stack Overflow [ed: there is now :P])
Reasons we are using Lemon in our firmware project are:
Small size of generated code and memory footprint. It produces the smallest parser I found (I compared parsers of similar complexity generated by flex, bison, ANTLR, and Lemon);
Excellent support of embedded systems: Lemon doesn't depend on standard library, you can specify external memory management functions, debug logging is removable.
Public domain license. There is separate fork of Lemon licensed under GPLv2 that is not suitable for our needs because of viral license. So we get latest sqlite sources and compile Lemon out of them (it consists of only two files);
Pull-parsing. It makes code more straightforward to understand and maintain than Flex/Bison parsing code. Thread-safety as an additional bonus I admire.
Simple integration with tokenizers. Our project nature requires tokenizing of binary stream with variable tokens size. It was quite an easy to implemented tokenizer and integrate with parser API of only 3 functions and one feedback context variable. We investigated ways of integrating Lemon with re2c and Ragel and found them also quite easy to implement.
Very simple syntax fast to learn.
Lemon explicitly separate development of tokenizer and lexical analyzer(parser). My development flow starts with designing of parser grammar. I'm able to check complex rules with implicit token sequence by the means of several Parser(...) calls at this first stage. Tokenizer is implemented afterwards.
Surely Lemon is not a silver bullet, it has limited area of application. Among disadvantages:
Lemon requires to write more rules in comparison with Bison because of simplified syntax: no repetitions and optionals, one action per rule, etc.
Complete set of LALR(1) parser limitations.
Only the C language.
Weigh the pros and cons before making your choice. I've done mine ;-)
Interesting find! I haven't actually used it, so the commentary is based on reading the documentation.
The redesign so that the lexical analysis is done separately from the parsing immediately seems to have merit. In particular, it has the potential to simplify operations such as handling multiple or nested source files. The Lex-based yywrap() mechanism is less than ideal. That it avoids all global variables and has careful memory allocation and deallocation control should count in its favour (that it allows the choice of allocator and deallocator greatly helps too - at least for the environments where I work, where memory allocation is always an issue).
The rethinking on how the rules are organized and how the terminals are identified is a good idea.
All in all, it looks like a well thought out redesign of Bison.
It is in the public domain according to the referenced web pages.

Which has a better code base to learn from: nginx or lighttpd?

Primary goal is to learn from a popular web server codebase (implemented in C) with priority given to structure/design instead of neat tricks throughout the code.
I didn't include Apache since its code base is an order of magnitude larger than the two mentioned.
Ngxinx might just be the best straight-c code-base I have encountered. I have read large chunks of Apache, and I always came out feeling unclean, it is a monolithic mess.
You will not just learn about web-servers by exploring Nginx, but pretty much the best practises for writing networked software under Unix and straight-c, from code architecture to meta-programming techniques.
I have heard nothing but good things about Lighttpd, however it is limited in scope compared to Nginx. therefore I would invest time in nginx if I was you. Although lighttpd's limited scope might be beneficial to you, as a first target to study.
Neat tricks always happen in any codebase worth its salt, to be honest. Nevertheless, the answer you probably don't want to hear is that it would probably be good to study both so you can kind of learn through the intersection. The alternative might really leave you stuck in a box of the "lighthttpd" way or the "nginx" way, etc.
I didn't include Apache since its code base is an order of magnitude larger than the two mentioned.
Actually Apache code is quite readable. It has large code base because it does lots of things. But it is well structured and quite easy to understand. You can also check APR library (Apache Portable Runtime) which has plethora of small things to learn from.
IMO if you want to learn programming, you should start with lower profile projects - and not HTTPd, but something simpler.
Both nginx and LightHTTPd (just like Apache) are production quality software, meaning very steep learning curve. And the learning unfortunately often means digging archives to see why it is that way - that comes with age to any mature project.
If you are simply into C and learning design, you might want to check the FreeBSD or its derivatives. In my experience it is a better place for starting: there are lots of tools and libraries of all calibers there. And their TODO lists are never empty, what serves well as a guide to where to start.

Resources