Understanding this character database format

Understanding this character database format - database

For the purpose of doing a project on character recognition, I found a database I could use as a training set. On the other hand, I am not able to understand the given format even though the below instructions were given with it. I could find no further help on how to figure this format out.
Fields 1-6 are separated by commas.
ID number of source article
2-byte symbol code (written in hexadecimal, using 4 bytes)
Character height of bitmap
Character width of bitmap
Bitmap image, where each 8-bit unit is written as a decimal from 0 to 255
Line feed
The link to the file(Google drive) for the database is attached below.
https://drive.google.com/file/d/0B-WsCQkhd_1iUUtJdHg0R1hfTHM/view?usp=sharing
It would be of great help if someone could figure out the way this format is presented. It is literally puzzling me.

Well, as far as I can understand this format every char description takes one line (until line feed sign).
ID number of source article
byte symbol code (written in hexadecimal, using 4 bytes)
Character height of bitmap
Character width of bitmap
Bitmap image, where each 8-bit unit is written as a decimal from 0 to 255 - and here the magic starts. Bitmap image is not only one comma separated value, but all values until you meet line feed. So it will be a lot of comma separated values that you can divide in rows using bitmap height and width values.
If you open this file in for example Notepad++ instead of stanart windows notepad, you will get a bit better view (turn on "Show all characters" to see line feed).
Hope it will help you.

Related

About the wav data sub-chunk

I am working on a project in which I have to merge two 8bits .wav files using C and i still have no clue how to do it.
I have read about wav files and I want to start by reading one of the files.
There's one thing i didn't understand:
Let's say i have an 8bit WAV audio file, And i was able to read (even tho I am still trying to) the Data that starts after the 44 byte, I will get numbers between 0 and 255 logically.
My question is:
What do those numbers mean?
If I get 255 or 0 what do they mean?
Are they samples from the wave?
Can anyone please explain?
Thanks in advance

Assuming we're not dealing with file format issues, getting values between 0 and 255 means that the audio samples are of unsigned eight-bit format, as you have put it.
One way of merging data would consist of reading data from files into buffers, arrays a and b and summing them value by value: c[i] = a[i] + b[i]. By doing so, you'd have to take care of the following:
length of the files may not be equal
on summing the unsigned 8-bit buffers, such as yours will almost certainly overflow
This is usually achieved using a for loop. You first get the sizes of the chunks. Your for loop has to be written in such a way that it neither reads past the array boundary, nor ignores what can be read. For preventing overflows you can either:
divide values by two on reading
or
read (convert) into a format which wouldn't overflow, then normalize and convert the merged data back into the original format or whichever format desired.
For all particulars of reading from and writing to a .wav format file you may use some of the existing audio file libraries, or write your own routine. Dealing with audio file format is not a trivial thing, though. Here's a reference on .wav format.
Here are few audio file APIs worth of looking at:
libsndfile
sndlib
Hope this can help.

See any good guide to WAVE for information on the format of samples in the data chunk, such as this one I found: http://www.neurophys.wisc.edu/auditory/riff-format.txt
Relevant excerpts:
In a single-channel WAVE file, samples are stored consecutively. For
stereo WAVE files, channel 0 represents the left channel, and channel
1 represents the right channel. The speaker position mapping for more
than two channels is currently undefined. In multiple-channel WAVE
files, samples are interleaved.
Data Format of the Samples
Each sample is contained in an integer i. The size of i is the
smallest number of bytes required to contain the specified sample
size. The least significant byte is stored first. The bits that
represent the sample amplitude are stored in the most significant bits
of i, and the remaining bits are set to zero.
For example, if the sample size (recorded in nBitsPerSample) is 12
bits, then each sample is stored in a two-byte integer. The least
significant four bits of the first (least significant) byte is set to
zero.
The data format and maximum and minimums values for PCM waveform
samples of various sizes are as follows:
Sample Size Data Format Maximum Value Minimum Value
One to Unsigned 255 (0xFF) 0
eight bits integer
Nine or Signed Largest Most negative more bits
integer i positive value of i
value of i
N.B.: Even if the file has >8 bits of audio resolution, you should read the file as an array of unsigned char and reconstitute the larger samples manually as per the above spec. Don't try to do anything like reading the samples directly over an array of native C ints, as their layout and size is platform-dependent and therefore should not be relied upon in any code.
Note also that the header is not guaranteed to be 44 bytes long: How can I detect whether a WAV file has a 44 or 46-byte header? You need to read the length and process the header based on that, not any assumption.

Unicode with MAX7219

I'm trying to implement Asian symbols with my max7219 and using the 8x8 led displays.
I've had a look online and I've found libraries for the max7219 but it is only in ASCII. I was wondering if there was an easy way of implementing using a UNICODE library - assuming there is one.
I'd like to easily copy and paste say " な " this character into my code and print it onto the LED displays. So far, all attempts have not been working. My other option is to use binary/hex to manually draw up the symbols but I would really prefer to make it easy for the user to copy and paste any character and it prints onto the LEDs. Or will I have to create my own Arduino Library?
Any help is greatly appreciated!
Many thanks.

The problem with Unicode is that it's just so damn big (the first kana is U+3041), and most Arduinos have not nearly enough flash to store all the characters required.
My recommendation is to use an 8-bit encoding that maps to all the characters you need. I suggest starting with the character set used by the HD44780UA00 and replacing the characters where they make sense. Since some other libraries already use this set it won't be a huge leap to use them with your display.

You can't copy and paste a CHARACTER in a 8x8 matrix.
You have to find an 8x8 matrix of your font (katakana, kanji, etc.), and build an array that contains all the characters in a bit-by-bit format.
To all who have commented that MAX7219 is a 7-segment display, I want to say that MAX7219 is an IC which contains a buffer of 8x8 bit. You can use it to drive an 8x8 LED matrix, or an 8-in-line 7-segments (plus a decimal point) display, or anything else.

ASCII characters in RGB565 format

I want to show some text in a 640*480 screen. Where can I get the codes for ASCII characters in RGB565 format for a C program, such that I can have a natural look-and-feel as a command-line terminal for such a screen.
1- What would be the best width-height for a character?
2- Where can I get the 16-bit hex code (known as Bitmap Font or Raster Font) for each character?
e.g. const unsigned short myChar[] = {0x0001, 0x0002, 0x0003, 0x0004 ...}

"... the 16-bit hex code ..." is a misconception. You must have meant 16 bytes – one byte (8 pixels) per character line. A 640*480 screen resolution with 'natural' sized text needs 8x16 bitmaps. That will show as 30 lines of 80 columns (the original MCGA screens actually showed only 25 lines, but that was with the equivalent of 640*400 – stretched a bit).
Basic Google-fu turns up this page: https://fossies.org/dox/X11Basic-1.23/8x16_8c_source.html, and the character set comes pretty close to as I remember it from ye olde monochrome monitors:a
................................................................
................................................................
................................................................
................................................................
...XXXX.........................................................
....XX..........................................................
....XX..........................................................
....XX...XXXXX..XX.XXX...XXX.XX.XX...XX..XXXX...XX.XXX...XXXXX..
....XX..XX...XX..XX..XX.XX..XX..XX...XX.....XX...XXX.XX.XX...XX.
....XX..XX...XX..XX..XX.XX..XX..XX.X.XX..XXXXX...XX..XX.XXXXXXX.
XX..XX..XX...XX..XX..XX.XX..XX..XX.X.XX.XX..XX...XX.....XX......
XX..XX..XX...XX..XX..XX..XXXXX..XXXXXXX.XX..XX...XX.....XX...XX.
.XXXX....XXXXX...XX..XX.....XX...XX.XX...XXX.XX.XXXX.....XXXXX..
........................XX..XX..................................
.........................XXXX...................................
................................................................
Since this is a simple monochrome bitmap pattern, you don't need "RGB565 format for a C program" (another misconception). It is way easier to loop over each bitmap and use your local equivalent of PutPixel to draw each character in any color you want. You can choose between not drawing the background (the 0 pixels) at all, or having a "background color". The space at the bottom of the bitmap is large enough to put in an underline.
That said: I've used such bitmaps for years but I recently switched to a fully antialiased gray shade format. The bitmaps are thus larger (a byte per pixel instead of a single bit) but you don't have to loop over individual bits anymore, which is a huge plus. Another is, I now can use the shades of gray as they are (thus drawing 'opaque') or treat them as alpha, and get nicely antialiased text in any color and over any background.
That looks like this:
I did not draw this font; I liked the way it looked on my terminal, so I wrote a C program to dump a basic character set and grabbed a copy of the screen. Then I converted the image to pure grayscale and wrote a quick-and-dirty program to convert the raw data into a proper C structure.
a Not entirely true. The font blitter in the MCGA video card added another column at the right of each character, so effectively the text was 9x16 pixels. For the small set of border graphics – ╔╦╤╕╩ and so on –, the extra column got copied from the rightmost one.

No the most elegant solution, but I created a bmp empty image and filled it with characters.
Then I used This tool to convert the bmp file to the C bitmap array.
You should then be able to distinguish the characters in your array.

If you can access some type of 16 bit dos mode, you might be able to get the fonts from a BIOS INT 10 (hex 10) call. In this example, the address of the font table is returned in es:bp (es is usually 0xc000). This works for 16 bit programs in Windows dos console mode on 32 bit versions of Windows. For 64 bit versions of Windows, DOSBOX may work, or using a virtual PC should also work. If this doesn't work, do a web search for "8 by 16 font", which should get you some example fonts.
INT 10 - VIDEO - GET FONT INFORMATION (EGA, MCGA, VGA)
AX = 1130h
BH = pointer specifier
00h INT 1Fh pointer
01h INT 43h pointer
02h ROM 8x14 character font pointer
03h ROM 8x8 double dot font pointer
04h ROM 8x8 double dot font (high 128 characters)
05h ROM alpha alternate (9 by 14) pointer (EGA,VGA)
06h ROM 8x16 font (MCGA, VGA)
07h ROM alternate 9x16 font (VGA only) (see #0020)
11h (UltraVision v2+) 8x20 font (VGA) or 8x19 font (autosync EGA)
12h (UltraVision v2+) 8x10 font (VGA) or 8x11 font (autosync EGA)
Return: ES:BP = specified pointer
CX = bytes/character of on-screen font (not the requested font!)
DL = highest character row on screen
Note: for UltraVision v2+, the 9xN alternate fonts follow the corresponding
8xN font at ES:BP+256N
BUG: the IBM EGA and some other EGA cards return in DL the number of rows on
screen rather than the highest row number (which is one less).
SeeAlso: AX=1100h,AX=1103h,AX=1120h,INT 1F"SYSTEM DATA",INT 43"VIDEO DATA"

parsing const char * returns a little up looking triangle. which character is it?

Hi I am writing a function that prases one by one the elements of a const char * variable and
then I write every one of the characters in a txt file. It recognizes all of them
but I have some questions. when it changes line and does not show the character recognized this means that the character is '\n'???
and last character in the cons char * variable it shows me a little triangle, looking up.
It is without colour, only 3 lines forming a little triangle. I am using UTF-8 encoding.
The triangle which character it represents??
any advice welcome

(I am not sure I understand where the little triangle is from. But this seems irrelevant.) This link will make you happy: To lookup unknown Unicode glyphs by shape, see shapecatcher.com. Just draw the character, and shapecatcher looks up glyphs that look similar.
Internally uses enslaved ape brains.
EDIT: Based on your comments, you are actually outputting a '\0' character to the file. How this gets displayed as small triangle is not clear - maybe you are looking at the file via an editor that displays '\0' this way.

You could 'od -x' the resulting file to see the hex representation of the character.

Creating a personal image format in plain C

I am not a software engineer. Excuse me if you find the question awkward.
I'd like to have an image format which is not supposed to be memory efficient but easy to manipulate in plain C. To be more specific, I desire to store every pixel in an array of the form:
pixel[row#][column#][Color]
where the indexes row# and column# (255 at max) are coordinates, and the index Color (2 at max) contains the RGB values of the pixel specified by the position ( i.e. pixel[255][255][1] is used to check or manipulate the Green amount inside the pixel on the bottom right corner ).
I aim to use this form in robotic applications to be able to find the coordinates of the first red/blue/green pixel easily by scanning the image starting from the top left corner using "nested for loops" (yes, not a creative solution). Here, you might say, if there is a white area on the image, the code will return with wrong coordinates. I am aware of this fact but the images will not have a complex pattern, and (if necessary) i can store the irrelevant colors as if they were black. I do not care about the brightness, gamma, alpha whatever, too.
So is it possible to write a C (or C++ if mandatory) code to take snapshots from the webcam say at every 0.5 seconds and convert the raw image from the webcam to the form specified above? If you say C can not reach to the camera directly, is it possible to write a code which calls for a software that can reach the camera, take a snapshot and then store the raw data in a file? If yes, how can i read this raw data file using C codes to be able to at least try a conversion? I am using Windows Vista on my laptop.
Sorry for keeping the question long, but I don't want to leave any points unclear.

Yes, such a file format would be possible. Only sanity prevents it from being implemented/used.
Some formats in fairly wide use would be almost as simple to scan in the way you're considering though. A 32-bit BMP, for one example, has a small header on the file to tell a few things like the size of the picture (x,y pixel dimensions) followed by the raw pixel values, so it's basically just ColorColorColor... for the number of pixels in the image.
Code to do the scanning you're talking about with a 32-bit BMP will be pretty trivial -- the code to open the file, allocate space for a buffer to read data into, etc., could easily be longer than the scanning code itself.

Adopting a 'standard' image format also means you have tools to generate test data and independently view your results
The easiest to code in pure C (even if it's not very efficent) is portable pixmap (ppm).
It's just a plain text file
P3 <cr> # P3 means an ascii color file R,G,B
640 480 <cr> # 640 pixels wide, 480 rows deep
255 <cr> # maximum value is 255
# then just a row of RGB values separated by space with a CR
255 0 0 0 255 0 0 0 255 #....... 640 triplets <cr>
255 255 0 255 255 255 0 0 0 # ....... next row etc
There is also a more efficnet binary version where the data is pure RGB bytes so is very easy to read into a C array in a single operation