MalformedInputException when trying to read entire file

MalformedInputException when trying to read entire file - file

I have a 132 kb file (you can't really say it's big) and I'm trying to read it from the Scala REPL, but I can't read past 2048 char because it gives me a java.nio.charset.MalformedInputException exception
These are the steps I take:
val it = scala.io.Source.fromFile("docs/categorizer/usig_calles.json") // this is ok
it.take(2048).mkString // this is ok too
it.take(1).mkString // BANG!
java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:277)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:338)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
Any idea what could be wrong?
--
Apparently the problem was that the file was not UTF encoded
I saved it as UTF and everything works, I just issue mkString on the iterator and it retrieves the whole contents of the file
The strange thing is that the error only aroused passing the first 2048 chars...

Cannot be certain without the file, but the documentation on the exception indicates it is thrown "when an input byte sequence is not legal for given charset, or an input character sequence is not a legal sixteen-bit Unicode sequence." (MalformedInputException javadoc)
I suspect that at 2049 is the first character encountered that is not valid with whatever the default JVM character encoding is in you environment. Consider explicitly stating the character encoding of the file using one of the overloads to fromFile.
If the application will be cross platform, you should know that the default character encoding on the JVM does vary by platform, so if you operate with a specific encoding you either want to explicitly set it as a command line parameter when launching of your application, or specify it at each call using the appropriate overload.

Any time you call take twice on the same iterator, all bets are off. Iterators are inherently imperative, and mixing them with functional idioms is dicey at best. Most of the iterators you come across in the standard library tend to be fairly well-behaved in this respect, but once you've used take, or drop, or filter, etc., you're in undefined-behavior land, and in principle anything could happen.
From the docs:
It is of particular importance to note that, unless stated otherwise,
one should never use an iterator after calling a method on it. The two
most important exceptions are also the sole abstract methods: next
and hasNext ...
def take(n: Int): Iterator[A] ...
Reuse: After calling this method, one should discard the iterator it
was called on, and use only the iterator that was returned. Using the
old iterator is undefined, subject to change, and may result in
changes to the new iterator as well.
So it's probably not worth trying to track down exactly what went wrong here.

If you just wish to convert the bytes to plain Latin data:
// File:
io.Source.fromFile(file)(io.Codec.ISO8859).mkString
// InputStream:
io.Source.fromInputStream(System.io)(io.Codec.ISO8859).mkString

Related

Extracting the domain extension of a URL stored in a string using scanf()

I am writing a code that takes a URL address as a string literal as input, then runs the domain extension of the URL through an array and returns the index if finds a match, -1 if does not.
For example, an input would be www.stackoverflow.com, in this case, I'd need to extract only the com part. In case of www.google.com.tr, I'd need only com again, ignoring the .tr part.
I can think of basically writing a function that'll do that just fine but I'm wondering if it is possible to do it using scanf() itself?

It's really an overhead to use scanf here. But you can do this to realize something similar
char a[MAXLEN],b[MAXLEN],c[MAXLEN];
scanf("%[^.].%[^.].%[^. \n]",a,b,c);
printf("Desired part is = %s\n",c);
To be sure that formatting is correct you can check whether this scanf call is successful or not. For example:
if( 3 != scanf("%[^.].%[^.].%[^. \n]",a,b,c)){
fprintf(stderr,"Format must be atleast sth.something.sth\n");
exit(EXIT_FAILURE);
}
What is the other way of achieving this same thing. Use fgets to read the whole line and then parse with strtok with delimiters ".". This way you will get parts of it. With fgets you can easily support different kind of rules. Instead of incorporating it in scanf (which will be a bit difficult in error case), you can use fgets,strtok to do the same.
With the solution provided above only the first three parts of the url is being considered. Rest are not parsed. But this is hardly the practical situation. Most the time we have to process the whole information, all the parts of the url (and we don't know how many parts can be there). Then you would be better using fgets/strtok as mentioned above.

Tcl String function that does "shimmering" ruined my customized tcl type defined in c

I have defined a customized tcl type using tcl library in c/c++. I basically make the Tcl_Obj.internalRep.otherValuePtr point to my own data structure. The problem happens by calling [string length myVar] or other similar string functions that does so called shimmering behaviour which replace my internalRep with it's own string structure. So that after the string series tcl function, myVar cannot convert back! because it's a complicate data structure cannot be converted back from the Tcl_Obj.bytes representation plus the type is no longer my customized type. How can I avoid that.

The string length command converts the internal representation of the values it is given to the special string type, which records information to allow many string operations to be performed rapidly. Apart from most of the string command's various subcommands, the regexp and regsub commands are the main ones that do this (for their string-to-match-the-RE-against argument). If you have a precious internal representation of your own and do not wish to lose it, you should avoid those commands; there are some operations that avoid the trouble. (Tcl mostly assumes that internal representations are not fragility, and therefore that they can be regenerated on demand. Beware when using fragility!)
The key operations that are mostly safe (as in they generate the bytes/length rep through calling the updateStringProc if needed, but don't clear the internal rep) are:
substitution into a string; the substituted value won't have the internal rep, but it will still be in the original object.
comparison with the eq and ne expression operators. This is particularly relevant for checks to see if the value is the empty string.
Be aware that there are many other operations that spoil the internal representation in other ways, but most don't catch people out so much.
[EDIT — far too long for a comment]: There are a number of relatively well-known extensions that work this way (e.g., TCOM and Tcl/Java both do this). The only thing you can really do is “be careful” as the values really are fragile. For example, put them in an array and then pass the indexes into the array around instead, as those need not be fragile. Or keep things as elements in a list (probably in a global variable) and pass around the list indices; those are just plain old numbers.
The traditional, robust approach is to put a map (e.g., a Tcl_HashTable or std::map) in your C or C++ code and have the indices into that be short strings with not too much meaning (I like to use the name of the type of value followed by either a sequence number or a serialisation of the pointer, such as you might get with the %p conversion in sprintf(); the printed pointer reveals more of the implementation details, is a little more helpful if you're debugging, and generally doesn't actually make that much difference in practice). You then have the removal of things from the map be an explicit deletion operation, and it is also easy to provide operations like listing all the known current values. This is safe, but prone to “leaking” (though it's not formally a memory leak if you provide the listing operation). It can be accelerated by caching the lookup in a Tcl_Obj*'s internal representation (a cheap way to handle deletion is to use a sequence number that you increment when you delete something, and only bypass the map lookup if the sequence number that you cache in the intrep is equal to the main sequence number) but it's not usually a big deal; only go to that sort of thing if you've measured a bottleneck in the lookups.
But I'd probably just live with fragility in my own code, and would just take care to ensure that I never bust the assumptions. The problem is really that you're being incautious about how you use the values; the Tcl code should just pass them around and nothing else really. Also, I've experimented a fair bit with wrapping such things up inside a TclOO object; it's far too heavyweight (by the design of TclOO) for values that you're making a lot of, but if you've only got a few of them and you're wanting to treat them as objects with methods, this can work very well indeed (and gives many more options for automatic cleanup).

Good way to create an assembler?

I have an homework to do for my school. The goal is to create a really basic virtual machine as well as a simple assembler. I had no problem creating the virtual machine but I can't think of a 'nice' way to create the assembler.
The grammar of this assembler is really basic: an optional label followed by a colon, then a mnemonic followed by 1, 2 or 3 operands. If there is more than one operand they shall be separated by commas. Also, whitespaces are ignored as long as they don't occur in the middle of a word.
I'm sure I can do this with strtok() and some black magic, but I'd prefer to do it in a 'clean' way. I've heard about Parse Trees/AST, but I don't know how to translate my assembly code into these kinds of structures.

I wrote an assembler like this when I was a teenager. You don't need a complicated parser at all.
All you need to do is five steps for each line:
Tokenize (i.e. split the line into tokens). This will give you an array of tokens and then you don't need to worry about the whitespace, because you will have removed it during tokenization.
Initialize some variables representing parts of the line to NULL.
A sequence of if statements to walk over the token array and check which parts of the line are present. If they are present put the token (or a processed version of it) in the corresponding variable, otherwise leave that variable as NULL (i.e. do nothing).
Report any syntax errors (i.e. combinations of types of tokens that are not allowed).
Code generation - I guess you know how to do this part!

What you're looking for is actually lexical analyses, parsing en finally the generation of the compiled code. There are a lot of frameworks out there which helps creating/generating a parser like Gold Parser or ANTLR. Creating a language definition (and learning how to depending on the framework you use) is most often quite a lot of work.
I think you're best off with implementing the shunting yard algorithm. Which converts your source into a representation computers understand, which makes it easy to understand for your virtual machine.
I also want to say that diving into parsers, abstract syntax trees, all the tools available on the web and reading a lot of papers about this subject is a really good learning experience!

You can take a look at some already-made assemblers, like PASMO: an assmbler for Z80 CPU, and get ideas from it. Here it is:
http://pasmo.speccy.org/
I've written a couple of very simple assemblers, both of them using string manipulation with strtok() and the like. For a simple grammar like the assembly language is, it's enough. Key pieces of my assemblers are:
A symbol table: just an array of structs, with the name of a symbol and its value.
typedef struct
{
char nombre[256];
u8 valor;
} TSymbol;
TSymbol tablasim[MAXTABLA];
int maxsim = 0;
A symbol is just a name that have associated a value. This value can be the current position (the address where the next instruction will be assembled), or it can be an explicit value assigned by the EQU pseudoinstruction.
Symbol names in this implementation are limited to 255 characters each, and one source file is limited to MAXTABLA symbols.
I perform two passes to the source code:
The first one is to identify symbols and store them in the symbol table, detecting whether they are followed by an EQU instruction or not. If there is such, the value next to EQU is parsed and assigned to the symbol. In other case, the value of the current position is assigned. To update the current position I have to detect if there is a valid instruction (although I do not assemble it yet) and update it acordingly (this is easy for me because my CPU has a fixed instruction size).
Here you have a sample of my code that is in charge of updating the symbol table with a value from EQU of the current position, and advancing the current position if needed.
case 1:
if (es_equ (token))
{
token = strtok (NULL, "\n");
tablasim[maxsim].valor = parse_numero (token, &err);
if (err)
{
if (err==1)
fprintf (stderr, "Error de sintaxis en linea %d\n", nlinea);
else if (err==2)
fprintf (stderr, "Simbolo [%s] no encontrado en linea %d\n", token, nlinea);
estado = 2;
}
else
{
maxsim++;
token = NULL;
estado = 0;
}
}
else
{
tablasim[maxsim].valor = pcounter;
maxsim++;
if (es_instruccion (token))
pcounter++;
token = NULL;
estado = 0;
}
break;
The second pass is where I actually assemble instructions, replacing a symbol with its value when I find one. It's rather simple, using strtok() to split a line into its components, and using strncasecmp() to compare what I find with instruction mnemonics

If the operands can be expressions, like "1 << (x + 5)", you will need to write a parser. If not, the parser is so simple that you do not need to think in those terms. For each line get the first string (skipping whitespace). Does the string end with a colon? then it is a label, else it is the menmonic. etc.

For an assembler there's little need to build an explicit parse tree. Some assemblers do have fancy linkers capable of resolving complicated expressions at link-time time but for a basic assembler an ad-hoc lexer and parsers should do fine.
In essence you write a little lexer which consumes the input file character-by-character and classifies everything into simple tokens, e.g. numbers, labels, opcodes and special characters.
I'd suggest writing a BNF grammar even if you're not using a code generator. This specification may then be translated into a recursive-decent parser almost by-wrote. The parser simply walks through the whole code and emits assembled binary code along the way.
A symbol table registering every label and its value is also needed, traditionally implemented as a hash table. Initially when encountering an unknown label (say for a forward branch) you may not yet know the value however. So it is simply filed away for future reference.
The trick is then to spit out dummy values for labels and expressions the first time around but compute the label addresses as the program counter is incremented, then take a second pass through the entire file to fill in the real values.
For a simple assembler, e.g. no linker or macro facilities and a simple instruction set, you can get by with perhaps a thousand or so lines of code. Much of it brainless through-free hand translation from syntax descriptions and opcode tables.
Oh, and I strongly recommend that you check out the dragon book from your local university library as soon as possible.

At least in my experience, normal lexer/parser generators (e.g., flex, bison/byacc) are all but useless for this task.
When I've done it, nearly the entire thing has been heavily table driven -- typically one table of mnemonics, and for each of those a set of indices into a table of instruction formats, specifying which formats are possible for that instruction. Depending on the situation, it can make sense to do that on a per-operand rather than a per-instruction basis (e.g., for mov instructions that have a fairly large set of possible formats for both the source and the destination).
In a typical case, you'll have to look at the format(s) of the operand(s) to determine the instruction format for a particular instruction. For a fairly typical example, a format of #x might indicate an immediate value, x a direct address, and #x an indirect address. Another common form for an indirect address is (x) or [x], but for your first assembler I'd try to stick to a format that specifies instruction format/addressing mode based only on the first character of the operand, if possible.
Parsing labels is simpler, and (mostly) separate. Basically, each label is just a name with an address.
As an aside, if possible I'd probably follow the typical format of a label ending with a colon (":") instead of a semicolon (";"). Much more often, a semicolon will mark the beginning of a comment.

Parsing a stream of data for control strings

I feel like this is a pretty common problem but I wasn't really sure what to search for.
I have a large file (so I don't want to load it all into memory) that I need to parse control strings out of and then stream that data to another computer. I'm currently reading in the file in 1000 byte chunks.
So for example if I have a string that contains ASCII codes escaped with ('$' some number of digits ';') and the data looked like this... "quick $33;brown $126;fox $a $12a". The string going to the other computer would be "quick brown! ~fox $a $12a".
In my current approach I have the following problems:
What happens when the control strings falls on a buffer boundary?
If the string is '$' followed by anything but digits and a ';' I want to ignore it. So I need to read ahead until the full control string is found.
I'm writing this in straight C so I don't have streams to help me.
Would an alternating double buffer approach work and if so how does one manage the current locations etc.

If I've followed what you are asking about it is called lexical analysis or tokenization or regular expressions. For regular languages you can construct a finite state machine which will recognize your input. In practice you can use a tool that understands regular expressions to recognize and perform different actions for the input.
Depending on different requirements you might go about this differently. For more complicated languages you might want to use a tool like lex to help you generate an input processor, but for this, as I understand it, you can use a much more simple approach, after we fix your buffer problem.
You should use a circular buffer for your input, so that indexing off the end wraps around to the front again. Whenever half of the data that the buffer can hold has been processed you should do another read to refill that. Your buffer size should be at least twice as large as the largest "word" you need to recognize. The indexing into this buffer will use the modulus (remainder) operator % to perform the wrapping (if you choose a buffer size that is a power of 2, such as 4096, then you can use bitwise & instead).
Now you just look at the characters until you read a $, output what you've looked at up until that point, and then knowing that you are in a different state because you saw a $ you look at more characters until you see another character that ends the current state (the ;) and perform some other action on the data that you had read in. How to handle the case where the $ is seen without a well formatted number followed by an ; wasn't entirely clear in your question -- what to do if there are a million numbers before you see ;, for instance.
The regular expressions would be:
[^$]
Any non-dollar sign character. This could be augmented with a closure ([^$]* or [^$]+) to recognize a string of non$ characters at a time, but that could get very long.
$[0-9]{1,3};
This would recognize a dollar sign followed by up 1 to 3 digits followed by a semicolon.
[$]
This would recognize just a dollar sign. It is in the brackets because $ is special in many regular expression representations when it is at the end of a symbol (which it is in this case) and means "match only if at the end of line".
Anyway, in this case it would recognize a dollar sign in the case where it is not recognized by the other, longer, pattern that recognizes dollar signs.
In lex you might have
[^$]{1,1024} { write_string(yytext); }
$[0-9]{1,3}; { write_char(atoi(yytext)); }
[$] { write_char(*yytext); }
and it would generate a .c file that will function as a filter similar to what you are asking for. You will need to read up a little more on how to use lex though.

The "f" family of functions in <stdio.h> can take care of the streaming for you. Specifically, you're looking for fopen(), fgets(), fread(), etc.
Nategoose's answer about using lex (and I'll add yacc, depending on the complexity of your input) is also worth considering. They generate lexers and parsers that work, and after you've used them you'll never write one by hand again.

UTF-8 tuple storage using lowest common technological denominator, append-only

EDIT: Note that due to the way hard drives actually write data, none of the schemes in this list work reliably. Do not use them. Just use a database. SQLite is a good simple one.
What's the most low-tech but reliable way of storing tuples of UTF-8 strings on disk? Storage should be append-only for reliability.
As part of a document storage system I'm experimenting with I have to store UTF-8 tuple data on disk. Obviously, for a full-blown implementation, I want to use something like Amazon S3, Project Voldemort, or CouchDB.
However, at the moment, I'm experimenting and haven't even firmly settled on a programming language yet. I have been using CSV, but CSV tend to become brittle when you try to store outlandish unicode and unexpected whitespace (eg vertical tabs).
I could use XML or JSON for storage, but they don't play nice with append-only files. My best guess so far is a rather idiosyncratic format where each string is preceded by a 4-byte signed integer indicating the number of bytes it contains, and an integer value of -1 indicates that this tuple is complete - the equivalent of a CSV newline. The main source of headaches there is having to decide on the endianness of the integer on disk.
Edit: actually, this won't work. If the program exits while writing a string, the data becomes irrevocably misaligned. Some sort of out-of-band signalling is needed to ensure alignment can be regained after an aborted tuple.
Edit 2: Turns out that guaranteeing atomicity when appending to text files is possible, but the parser is quite non-trivial. Writing said parser now.
Edit 3: You can view the end result at http://github.com/MetalBeetle/Fruitbat/tree/master/src/com/metalbeetle/fruitbat/atrio/ .

I would recommend tab delimiting each field and carriage-return delimiting each record.
Within each string, Replace all characters that would affect the field and record interpretation and rendering. This would include control characters (U+0000–U+001F, U+007F–U+009F), non-graphical line and paragraph separators (U+2028, U=2029), directional control characters (U+202A–U+202E), and the byte order mark (U+FEFF).
They should be replaced with escape sequences of constant length. The escape sequences should begin with a rare (for your application) character. The escape character itself should also be escaped.
This would allow you to append new records easily. It has the additional advantage of being able to load the file for visual inspection and modification into any spreadsheet or word processing program, which could be useful for debugging purposes.
This would also be easy to code, since the file will be a valid UTF-8 document, so standard text reading and writing routines may be used. This also allows you to convert easily to UTF-16BE or UTF-16LE if desired, without complications.
Example:
U+0009 CHARACTER TABULATION becomes ~TB
U+000A LINE FEED becomes ~LF
U+000D CARRIAGE RETURN becomes ~CR
U+007E TILDE becomes ~~~
etc.
There are a couple of reasons why tabs would be better than commas as field delimiters. Commas appear more commonly within normal text strings (such as English text), and would have to be replaced more frequently. And spreadsheet programs (such as Microsoft Excel) tend to handle tab-delimited files much more naturally.

Mostly thinking out loud here...
Really low tech would be to use (for example) null bytes as separators, and just "quote" all null bytes appearing in the output with an additional null.
Perhaps one could use SCSU along with that.
Or it might be worth to look at the gzip format, and maybe ape it, if not using it:
A gzip file consists of a series of "members" (compressed data sets).
[...]
The members simply appear one after another in the file, with no additional information before, between, or after them.
Each of these members can have an optional "filename", comment, or the like, and i believe you can just keep appending members.
Or you could use bencode, used in torrent-files. Or BSON.
See also Wikipedia's Comparison of data serialization formats.
Otherwise i think your idea of preceding each string with its length is probably the simplest one.