I'm looking for a good open source C/C++ regular expression library that has full Unicode support.
I'm using this in an environment where the library might get ASCII, UTF-8, or UTF-16. If it gets UTF-16 it might or might not have the necessary quoting characters (FF FE) or (FE FF).
I've looked around and there don't seem to be any options other than PCRE.
My second problem is that I'm currently using flex to build some HUGE regular expressions. Ideally I would have a flex-like lexical expression generator that also handles Unicode.
Any suggestions?
Have you considered ICU?
It has mature regular expression support.
I believe Boost Spirit and Boost Regex both have at least some degree of Unicode support.
Related
I just read some glibc 2.22 source code (the source file at /sysdeps/posix/readdir.c) and came across this comment:
/* The only version of `struct dirent*' that lacks `d_reclen' is fixed-size. */
(Newline removed.)
The weird emphasis of the type and identifier bugs me. Why not use just single-quotes or des accents graves? Is there some specific reason behind this? Might it be some character set conversion mistake?
I also searched the glibc style guide but didn't found anything concerning code formatting in comments.
Convention.
As you no doubt know, comments are ignored by the C compiler. They make no difference, but the developer who wrote that comment probably preferred their appearance to plain single quotes.
ASCII
Using non-ASCII characters (unicode) in source code is a relatively new practice (moreso when English-authored source code is concerned), and there are still compatibility issues remaining in many programming language implementations. Unicode in program input/output is a different thing entirely (and that isn't perfect either). In program source code, unicode characters are still quite uncommon, and I doubt we'll see them make much of an appearance in older code like the POSIX header files for some time, yet.
Source code filters
There are some source code filters, such as document generation packages like the the well-known Javadoc, that look for specific comment strings, such as /** to open a comment. Some of these programs may treat your `quoted strings' specially, but that quoting convention is older than most (all?) of the source code filters that might give them special treatment, so that's probably not it.
Backticks for command substutution
There is a strong convention in many scripting languages (as well as StackExchange markdown!) to use backticks (``) to execute commands and include the output, such as in shell scripts:
echo "The current directory is `pwd`"
Which would output something like:
The current directory is /home/type_outcast
This may be part of the reason behind the convention, however I believe Cristoph has a point as well, about the quotes being balanced, similar to properly typeset opening and closing quotation marks.
So, again, and in a word: `convention'.
This goes back to early computer fonts, where backtick and apostrophe were displayed as mirror images. In fact, early versions of the ASCII standard blessed this usage.
Paraphrased from RFC 20, which is easier to get at than the actual standards like USAS X3.4-1968:
Column/Row Symbol Name
2/7 ' Apostrophe (Closing Single Quotation Mark Acute Accent)
6/0 ` Grave Accent (Opening Single Quotation Mark)
This heritage can also be seen in tools like troff, m4 and TeX, which also used this quoting style originally.
Note that syntactically, there is a benefit to having different opening and closing marks: they can be nested properly.
I've written a Lexer in C, it currently lexes files in ASCII successfully, however I'm confused as to how I would lex unicode. What unicode would I need to lex, for instance should I support utf-8, utf-16, etc. What do languages like Rust or Go support?
If so are there any libraries that can help me out, although I would prefer to try and do it myself so I can learn. Even then, a small library that I could read to learn from would be great.
There are already version of lex (and other lexer tools that support UniCode) and they are tabulated on the WikiPedia Page: List of Lexer Generators. There is also a list of lexer tools on the Wikipedia Parser Page. In summary, the following tools handle UniCode:
JavaCC - JavaCC generates lexical analyzers written in Java.
JFLex - A lexical analyzer generator for Java.
Quex - A fast universal lexical analyzer generator for C and C++.
FsLex - A lexer generator for byte and Unicode character input for F#
And, of course, there are the techniques used by W3.org and cited by #jim mcnamara at http://www.w3.org/2005/03/23-lex-U.
You say you have written your own lexer in C, but you have used the tag lex for the tool called lex; perhaps that was an oversight?
In the comments you say you have not used regular expressions, but also want to learn. Learning something about the theory of language recognition is key to writing an efficient and working lexer. The symbols being recognised are classified as a Chomsky Type 3 Language, or a Regular Language, which can be described by Regular Expressions. Regular Expressions can be implemented by coding that implements a Finite State Automata (or Finite State Machine). The standard implementation for a finite state machine is coded by a loop containing a switch. Most experienced coders should know, and be able to recognise and exploit this form:
while ( not <<EOF>> ) {
switch ( input_symbol ) {
case ( state_symbol[0] ) :
...
case ( state_symbol[1] ) :
...
default:
....
}
}
If you had coded in this style, the same coding could simply work whether the symbols being handled were 8 bit or 16 bit, as the algorithmic coding pattern remains the same.
Ad-Hoc coding of a lexical analyser without an understanding of the underlying theory and practice will eventually have its limits. I think you will find it beneficial to read a little more into this area.
I am trying to use regular expression in c/c++ using regex.h.
I am trying to use lookahead options, for example:
(?=>#).*
in order to extract strings after a '#'
For some reason it fails to find any matches.
Does regex.h supports lookahead/lookbehind? is there another library I can use?
I am using regex.h, on linux.
I'm pretty sure NSRegularExpression is just a wrapper for libicu, which does support lookaheads. You have a typo in your example, right syntax is (?=#).* according to the link.
It doesn't really seem needed in this case though, why not just #.*?
I suspect it's really lookbehind you're talking about, not lookahead. That would be (?<=#).*, but why make it so complicated? Why not just use #(.*), as some of the other responders suggested, and pull the desired text out of the capturing group?
Also, are you really using NSRegularExpression? That seems unlikely, considering it's an Objective-C class in Apple's iOS/MacOS developer framework.
I would like to add Unicode support to a C library I am maintaining. Currently it expects all strings to be passed in utf8 encoded. Based on feedback it seems windows usually provides 3 function versions.
fooA() ANSI encoded strings
fooW() Unicode encoded strings
foo() string encoding depends on the UNICODE define
Is there an easy way to add this support without writing a lot of wrapper functions myself? Some of the functions are callable from the library and by the user and this complicates the situation a little.
I would like to keep support for utf8 strings as the library is usable on multiple operating systems.
The foo functions without the suffix are in fact macros. The fooA functions are obsolete and are simple wrappers around the fooW functions, which are the only ones that actually perform work. Windows uses UTF-16 strings for everything, so if you want to continue using UTF-8 strings, you must convert them for every API call (e.g. with MultiByteToWideChar).
For the public interface of your library, stick to exactly one encoding, either UTF-16, UTF-32 or UTF-8. Everything else (locale-dependent or OS-dependent encodings) is too complex for the callers. You don't need UTF-8 to be compatible with other OSes: many platform-independent libraries such as ICU, Qt or the Java standard libraries use UTF-16 on all systems. I think the choice between the three Unicode encodings depends on which OS you expect the library will be used most: If it will mostly be used on Windows, stick to UTF-16 so that you can avoid all string conversions. On Linux, UTF-8 is a common choice as a filesystem or terminal encoding (because it is the only Unicode encoding with an 8-bit-wide character unit), but see the note above regarding libraries. OS X uses UTF-8 for its POSIX interface and UTF-16 for everything else (Carbon, Cocoa).
Some notes on terminology: The words "ANSI" and "Unicode" as used in the Microsoft documentation are not in accordance to what the international standard say. When Microsoft speaks of "Unicode" or "wide characters", they mean "UTF-16" or (historically) the BMP subset thereof (with one code unit per code point). "ANSI" in Microsoft parlance means some locale-dependent legacy encoding which is completely obsolete in all modern versions of Windows.
If you want a definitive recommendation, go for UTF-16 and the ICU library.
Since your library already requires UTF-8 encoded strings, then it is already fully Unicode enabled, as UTF-8 is a loss-less Unicode encoding. If you are wanting to use your library in an environment that normally uses UTF-16 or even UTF-32 strings, then it could simply encode to, and decode from, UTF-8 when talking with your library. Otherwise, your library would have to expose extra UTF-16/32 functions that do those encoding/decoding operations internally.
I'm building my own language using Flex, but I want to know some things:
Why should I use lexical analyzers?
Are they going to help me in something?
Are they obligatory?
Lexical analysis helps simplify parsing because the lexemes can be treated as abstract entities rather than concrete character sequences.
You'll need more than flex to build your language, though: Lexical analysis is just the first step.
Any time you are converting an input string into space-separated strings and/or numeric values, you are performing lexical analysis. Writing a cascading series of else if (strcmp (..)==0) ... statements counts as lexical analysis. Even such nasty tools as sscanf and strtok are lexical analysis tools.
You'd want to use a tool like flex instead of one of the above for one of several reasons:
The error handling can be made much better.
You can be much more flexible in what different things you recognize with flex. For instance, it is tough to parse a C-format hexidecimal value properly with scanf routines. scanf pretty much has to know the hex value is comming. Lex can figure it out for you.
Lex scanners are faster. If you are parsing a lot of files, and/or large ones, this could become important.
You would consider using a lexical analyzer because you could use BNF (or EBNF) to describe your language (the grammar) declaratively, and then just use a parser to parse a program written in your language and get it in a structure in memory and then manipulate it freely.
It's not obligatory and you can of course write your own, but that depends on how complex the language is and how much time you have to reinvent the wheel.
Also, the fact that you can use a language (BNF) to describe your language without changing the lexical analyzer itself, enables you to make many experiments and change the grammar of your language until you have exactly what it works for you.