I was writing a small program which was supposed to display a ☻ character to the screen. The program is listed below:
#include <stdio.h>
main()
{
printf("☻\n");
}
However, when I run this program, I get the output of
Γÿ║
Why am I getting this output, and what should I do to get the output I want?
You're getting that because whatever terminal program you're using isn't that compatible with some Unicode encodings.
For example, my Debian box compiles that fine and it actually prints out the smiley face, because gnome-terminal is a damn fine piece of software :-)
The fact that you're seeing three characters instead of one is a fairly good indication that it's outputting UTF-8. In fact, if I run that program on my Debian box and capture the binary output with od -xcb, I see:
0000000 98e2 0abb
342 230 273 \n
342 230 273 012
0000004
showing that it is coming out in UTF-8, it's just that gnome-terminal is smart enough to turn that back into the correct glyph.
Those bytes translate to binary as follows:
e2 98 bb
1110 0010 : 1001 1000 : 1011 1011
And, using this excellent answer here, stating that bit patterns starting with 10 are continuation bytes, we can decode it as follows:
U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx
10yyyyxx
10xxxxxx
e2 98 bb
1110 0010 : 1001 1000 : 1011 1011
yyyy yy yyxx xx xxxx
Hence the code point is 0010 0110 : 0011 1011 which equates to 263b which, in a total lack of coincidence, is the black smiling face character.
In terms of fixing the problem of Windows not displaying Unicode correctly, as indicated by your comment:
I am on Windows Command Prompt. How should I make cmd.exe work with unicode?
You may want to look at this question, particularly the answer about using chcp to change the code page to 65001 (UTF-8). Note I haven't tested this, I provide it only as a pointer for you.
#include <fcntl.h>
_setmode(_fileno(stdout), _O_U16TEXT);
wprintf(L"☻\n");
valter
Related
I need to read data from stdin where I need to check if 2 bytes are equal to 0x001 or 0x002 and then depending on the value execute one of two types of code. Here is what i tried:
uint16_t type;
uint16_t type1 = 0x0001;
while( ( read(0, &type, sizeof(type)) ) > 0 ) {
if (type == type1) {
//do smth
}
}
I'm not sure what numeric system the read uses since printing the value of type both as decimal and as hex returns something completely different even when i wrtie 0001 on stdin
The characters 0x0001 and 0x0002 are control characters in UTF-8, you wont be able to type them manually.
If you are trying to compare the character input - use gets and atoi to convert the character-based input of stdin into an integer, or if you want to continue using read - compare against the characters '0' '0' '0' '1'/'2' instead of the values to get proper results
You didn't say what kind of OS you're using. If you're using Unix or Linux or MacOS, at the command line, it can be pretty easy to rig up simple ways of testing this sort of program. If you're using Windows, though, I'm sorry, my answer here is not going to be very useful.
(I'm also going to be making heavy use of the Unix/Linux
"pipe" concept, that is, the vertical bar | which sends the output of one program to the input of the next. If you're not familiar with this concept, my answer may not make much sense at first.)
If it were me, after compiling your program to a.out, I would just run
echo 00 00 00 01 00 02 | unhex | a.out
In fact, I did compile your program to a.out, and tried this, and... it didn't work. It didn't work because my machine (like yours) uses little-endian byte order, so what I should have done was
echo 00 00 01 00 02 00 | unhex | a.out
Now it works perfectly.
The only problem, as you've discovered if you tried it, is that unhex is not a standard program. It's one of my own little utilities in my personal bin directory. It's massively handy, and it's be super useful if something like it were standard, but it's not. (There's probably a semistandard Linux utility for doing this sort of thing, but I don't know what it is.)
What is pretty standard, and you probably can use it, is a base-64 decoder. The problem there is that there's no good way to hand-generate the input you need. But I can generate it for you (using my unhexer). If you can run
echo AAABAAIA | base64 -d | a.out
that will perform an equivalent test of your program. AAABAAIA is the base-64 encoding of the three 16-bit little-endian words 0000 0001 and 0002, as you can see by running
echo AAABAAIA | base64 -d | od -h
On some systems, the base64 command uses a -D option, rather than -d, to request decoding. That is, if you get an error message like "invalid option -- d", you can try
echo AAABAAIA | base64 -D | a.out
I have been coding "duel" from the book "Sixty Programmes for the Commodore 64" (by R. Erskine et al.), into my C64 mini in basic. I keep getting the following error: "? Out of data error in 60". I've checked the code for typos and can't find any. Has anyone else had this problem and do you have a fix? Thank you
I have checked the code for typos and I can't find any.
Lines 5-60:
5 REM *** D U E L *** # MICHAEL BEWS
*** TRANSLATED BY IAN YATES
10 V-53248:X=RND(-TI):POKEV+32,4:POKEV+33,5:POKEV+24,23:POKE650,255:M20
20 Y$="String of C64 Characters":X$="String of C64 Characters
30 PRINT"String of C64 CharactersPLEASE WAIT WHILE USER-DEFINED",,"CHARACTERS ARE SET UP."
40 POKE52,48:POKE56,48:POKE56334,PEEK(56334)AND254:POKE1,PEEK(1)AND251
50 FORX=14336TO15143:POKEX,PEEK(X+40960):NEXT:FORX=1TO30:READA:NEXT
60 FORX=15144To15247:READA:POKEX,A:NEXT:M$="String of C64 Characters":N$="String of C64 Characters"
DATA is a way of feeding a sequence of values into a BASIC program. The number of values in the DATA statements must be greater or equal to the number of times READ is called. If READ runs out of DATA values then it raises an "Out of Data" error.
In this case, there should be 133 values separated by commas or different DATA statements. However, the end of line 50 is somewhat odd. It reads 30 values into A without doing anything with them so that part is pointless.
Check your source for the code to see if there are any misprints or missing lines. If not, try commenting out that line 50 FOR statement.
After many years away from programming, I've decided to take it up again for fun, and finding myself enjoying it quite a lot. In the process of finding things to code, I've found this data which is openly available from Network Rail in the United Kingdom.
Among other things, you can obtain schedule data, which is a list of all train, bus and ferry journeys.
A schedule record for a train journey might look like this:
BSNY819581902281902280001000 PEE5A99 122112002 EMU390 125
BX VTY
LOMNCRPIC 2131 00008 FL TB
LIARDWCKJ 2133 00000000
LISLDLJN 2134H00000000 SL H
LIHTNOJN 2138 00000000 FL 1H
LISTKP 2140H000000002 SL
LISTKPE1 2141H00000000
LIADSWDRD 2142H00000000
LICHDH 2143 000000002
LIWLMSL 2146 000000004 1
LIALDEDGE 2148 00000000 1 3
LISBCH 2200 000000001 FL
LICREWSBG 2203 00000000
LICREWUML 2204 00000000
LICREWE 2206 000000001 FL H
LICREWBHJ 2207H00000000 1
LIMADELEY 2212 00000000 FL FL 5H
LINTNB 2222H00000000 FL FL
LISTAFFDJ 2226H00000000 SL
LISTAFFRD 2228H2231H 000000004 SL SL A C
LISTAFTVJ 2233 00000000
LIPNKRDG 2236H00000000 1 1
LIBSBYJN 2244 00000000
LIPBLJWM 2248 00000000
LIDRLSTNJ 2251H00000000
LIBSCTSTA 2252H00000000 1
LIPRYBRNJ 2257 00000000 7
LIASTON 2306H000000002
LISTECHFD 2311H00000000
LIBHAMINT 2315 000000004
LIBKSWELL 2318H00000000 1H
LICOVNTRY 2324 000000001 2 3
LIRUGBTVJ 2336 00000000 UNL 3
LIRUGBY 2340 000000005 UNLUNL 1H
LIHMTNJ 2343 00000000 1H
LIDVNTYNJ 2346H00000000
LILNGBKBY 2351 00000000 1 1
LINMPTN 0001H000000001 6
LIHANSLPJ 0016 00000000 SL
LIMKNSCEN 0021 000000001 SL SL
LIBLTCHLY 0023 000000004 SL SL 5 1H
LILEDBRNJ 0036 00000000 SL SL 2
LITRING 0042 000000002 SL SL 2H
LIBONENDJ 0048H00000000 SL SL 1H
LIWATFDJ 0056 000000009 SL SL 1H
LIHROW 0101H000000006 SL SL 3
LIWMBY 0107H000000006 CL
LTWMBYICD 0117H0000 TF
tl;dr the first two lines describe what type of train is running, when it runs, how fast, etc. The other lines describe the points the train will pass, and what time they are expected to do so. The main takeaway is that each record has a different length depending on the journey.
When I saw this, I thought "This would be a great thing to try and mess around with in COBOL." I went to polytech and learned PASCAL and COBOL, but only had to deal with files with a consistent length and consistent data, not something like this.
I spent a couple of hours trying to find some sort of answer to this on Google, but nothing really showed, hence my asking.
Just for reference, I have managed to do this in GW-BASIC, and could do it in elementary Python if needed, but COBOL, being what it is, is a whole different kettle of fish.
Is it possible to read something like this in to COBOL without having to resort to witchcraft, or is it just in the "too hard" basket? I'm only doing this for fun, so it's really no big deal.
Any responses or feedback would be most welcome.
Many thanks,
Joseph.
Yes it is possible. For the file use Line Sequential
The file definition
select lineseq assign to "lineseq.dat"
organization is line sequential.
To split the lines up use UNSTRING. i.e.
UNSTRING in-line
DELIMITED BY SPACES
into item-1, item-2, item-3
END-UNSTRING
It is probably easier to do in languages like python
Actually (after reformatting the question) I think COBOL is
perfect for this job as the data is fixed-length (may came out of COBOL, too...)
define the file (line-sequential may not even be needed if it contains, like your post even trailing spaces; but as this may change, line-sequential would be fine)
OPEN INPUT file, READ until end of file
put the complete record read into a local record with defined sub-fields and just access the data from the sub-fields, after validating (the file may be broken for many reasons)
Depending on the amount of lines in that data you may either process the records direct, move them to a table (have a look at OCCURS), or WRITE them to another file (likely and INDEXED with multiple KEY definitions)
To expand on #Simon Sobisch's answer.
Looking at the data, and trying to work it out, I can see these things.
Top two lines as you say are the type of train and that.
Then you have a line starting LO which must be the start of the journey. The next 7 characters would be the station, with MNCRPIC being presumably "Manchester Piccadily". Then there's a space, then four digits which would be a time.
Then you have a load of lines starting LI which are intermediate points. Some of those have a "H" after the time, others don't. This will be a problem if you're going to do UNSTRING DELIMITED BY SPACE. I'm going to presume that H means Halt.
LISTAFFRD 2228H2231H 000000004 SL SL A C
is an odd looking line.
At the end we have LT which is the end of the journey, arriving at WMBYICD at 0117.
01 TRAIN-SCHEDULE.
03 RECORD-TYPE PIC XX.
88 JOURNEY-START VALUE 'LO'.
88 JOURNEY-INTERMEDIATE VALUE 'LI'.
88 JOURNEY-TERMINATE VALUE 'LT'.
03 TRAIN-STATION PIC X(7).
03 FILLER PIC X(11).
03 TRAIN-TIME.
05 TRAIN-TIME-HH PIC 99.
05 TRAIN-TIME-MM PIC 99.
03 TRAIN-HALT-FLAG PIC X.
88 TRAIN-STOPS-HERE VALUE 'H'.
And so on.
COBOL input files can be VARIABLE or FIXED. If it is variable, the first positions will be the number of cols of the row.
Here you have the information at IBM official webpage.
Processing Files with Variable Length Records
I know there are multiple questions regarding this subject, but they did not help.
When trying to compile, whatever, I keep getting the same error:
arm-none-eabi-gcc.exe: error: CreateProcess: No such file or directory
I guess it means that it can not find the compiler.
I have tried tweaking the path settings
C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\nxp\LPCXpresso_7.6.2
326\lpcxpresso\tools\bin;
Seems to be right?
I have tried using Sysinternals process monitor
I can see that a lot of arm-none-eabi-gcc.exe are getting a result of name not found but there are a lot of successful results too.
I have also tried reinstalling the compiler and the LPCXpresso, no luck.
If i type arm-none-eabi-gcc -v i get the version, so it means its working
but when i am trying to compile in CMD like this arm-none-eabi-gcc led.c
i get the same error as stated above
arm-none-eabi-gcc.exe: error: CreateProcess: No such file or directory
Tried playing around more with PATH in enviroments, no luck. I feel like something is stopping LPCXpresso from finding the compiler
The only Antivirus this computer has is Avira and i disabled it. I also allowed compiler and LPCXpresso through the firewall
I have tried some more things, i will add it shortly after trying to duplicate the test.
It seems your problem is a happy mess with Vista and GCC. Long story short, a CRT function, access, has a different behavior on Windows and Linux. This difference is actually mentioned on Microsoft documentation, but the GCC folks didn't notice. This leads to a bug on Vista because this version of Windows is more strict on this point.
This bug is mentioned here : https://gcc.gnu.org/bugzilla/show_bug.cgi?id=33281
I have no proof that your problem comes from here, but the chances are good.
The solutions :
Do not use Vista
Recompile arm-none-eabi-gcc.exe with the flag -D__USE_MINGW_ACCESS
Patch arm-none-eabi-gcc.exe
The 3rd is the easiest, but it's a bit tricky. The goal is to hijack the access function and add an instruction to prevent undesired behavior. To patch your gcc, you have two solutions : you upload your .exe and I patch it for you, or I give you the instructions to patch it yourself. (I can also patch it for you, then give the instructions if it works). The patching isn't really hard and doesn't require advanced knowledge, but you must be rigorous.
As I said, I don't have this problem myself, so I don't know if my solution really works. The patch seems to be working for this problem.
EDIT2:
The exact problem is that the linux access have a parameter flag to check whether a file is executable. The Windows access cannot check for this. The behavior of most Windows versions is just to ignore this flag, and check if the file exists instead, which will usually give the same behavior. The problem is that Vista doesn't ignore this, and whenever the access is used to check for executability, it will return an error. This lead to GCC programs to think that some executables are not here. The patch induced by -D__USE_MINGW_ACCESS, or done manually, is to delete the flag when access is called, thus checking for existence instead just like other Windows versions.
EDIT:
The patching is actually needed for every GCC program that invokes other executables, and not only gcc.exe. So far there is only gcc.exe and collect2.exe.
Here are the patching instruction :
Backup your arm-none-eabi-gcc.exe.
Download and install CFF Explorer (direct link here).
Open arm-none-eabi-gcc.exe with CFF Explorer.
On the left panel, click on Import Directory.
In the module list that appears, click on the msvcrt.dll line.
In the import list that appears, find _access. Be careful here, the list is long, and there are multiple _access entries. The last one (the very last entry for me) is probably the good one.
When you click on the _access line, an address should appear on the list header, in the 2nd column 2nd row, just below FTs(IAT). Write down that address on notepad (for me, it is 00180948, it may be different). I will refer to this address as F.
On the left panel, click on Address Converter.
Three fields should appear, enter address F in the File Offset field.
Write down on notepad a 6 bytes value : the first two bytes are FF 25, the last 4 are the address that appeared in the VA field, IN REVERSE. For example, if 00586548 appeared in the VA field, write down FF 25 48 65 58 00 (spaces added for legibility). I will refer to this value as J. This value J is the instruction that jumps to the _access function.
On the left panel, click on Section Headers.
In the section list that appeared on the right, click on the .text line (the .text section is where the code resides).
In the editor panel that appeared below, click on the magnifier and, in the Hex search bar, search for a series of 11 90 (9090909090..., 90 is NOP in assembly). This is to find a code cave (unused space) to insert the patch, which is 11 bytes long. Once you found a code cave, write down the offset of the first 90. The exact offset is displayed on the very bottom as Pos : xxxxxxxx. I will refer to this offset as C.
Use the editor to change the sequence of 11 90 : the first 5 bytes are 80 64 E4 08 06. These 5 bytes are the instruction that prevents the wrong behavior. The last 6 bytes are the value J (edit the next 6 bytes to J, ex. FF 25 48 65 58 00), to jump back to the _access function.
Click on the arrow icon (Go To Offset) a bit below, and enter 0, to navigate to the beginning of the file.
Use the Hex search bar again to search for value J. If you find the bytes you just modified, skip. The J value you need is located around many value containing FF 25 and 90 90. That is the DLL jump table. Write down the offset of the value J you found (offset of the first byte, FF). I will refer to this offset as S. Note 1: If you can't find the value, maybe you picked the wrong _access in step 6, or did something wrong between step 6 to 10. Note 2: The search bar doesn't loop when it hit the end; go to offset 0 manually to re-find.
Use a hexadecimal 32-bit 2-complement calculator (like this one : calc.penjee.com) to calculate C - S - 5. If your offset C is 8C0 and your offset S is 6D810, you must obtain FF F9 30 AB (8C0 minus 6D810, minus 5).
Replace the value J you found in the file (at step 16) by 5 bytes : the first byte is E9, the last 4 are the result of the last operation, IN REVERSE. If you obtained FF F9 30 AB, you must replace the value J (ex: FF 25 48 65 58 00) by E9 AB 30 F9 FF. The 6th byte of J can be left untouched. These 5 bytes are the jump to the patch.
File -> Save
Notes : You should have modified a total of 16 bytes. If the patched program crash, you did something wrong. Even if it doesn't work, this patch can't induce a crash.
Let me know if you have difficulties somewhere.
I am working on a program that outputs PDF documents. Given a sequence of UTF-8 encoded characters and the name of a font that shall be used to render it, I would like to show the appropriate glyphs that make the actual content of the document. I would like to be able to display national characters such as č or ö. It would be great to support ligatures like ae or ffi.
The problem is, I do not know how the actual glyphs to be shown are specified (inside a content stream, for example).
If I, for example, want to display the string "Hello World", I need not to worry about encoding, I simply write (Hello World)Tj. The PDF reader will then use the appropriate font to render this string.
But what if I wanted to show the string
It is difficult to read the PDF specification all day. Prostě dočista nemožné!
with the ligatures ffi, fi and ea and the Czech national symbols ě, č and é in a given font, how would I proceed?
I am trying to get through the PDF specification, but it is not easy.
How do I find out the "code of the glyph" that corresponds to a given character or ligature?
How is this code encoded within a PDF content stream?
Help is much appreciated.
Edit: I may have overestimated the problem. Counting the glyphs that are needed to display a "common European document", I cannot think of a way how this number could exceed 256. If my assumptions are correct, I can remap the encoding of the font completely. This should be sufficient to cover all common symbols of the latin alphabet, numbers, punctuation, common symbols like ( and [ and still I would have plenty of room for national symbols, ligatures and other elements of high-quality typography. (I can implement a priority queue to select the most used ligatures if the total number of glyphs shall exceed 256.)
That being said, I do not think I need to use the CID-keyed fonts.
Still I wander how do I map UTF-8 encoded characters onto glyphs of an arbitrary font. I have the AFM of the font available. For the DejaVu font, for example, character information go like this:
C 63 ; WX 536 ; N question ; B 67 -15 488 743 ;
C 64 ; WX 1000 ; N at ; B 65 -174 930 705 ;
C 65 ; WX 722 ; N A ; B -6 0 732 730 ;
But after the 256th character is mapped, the codes are -1:
C 255 ; WX 564 ; N ydieresis ; B -3 -223 563 767 ;
C -1 ; WX 722 ; N Amacron ; B -6 0 732 899 ;
C -1 ; WX 596 ; N amacron ; B 49 -15 568 746 ;
For example, if I had the sequence 11100010 10000010 10101100 (Euro sign) in my input, how would I know what glyph name it corresponds to so that I can map it in the /Encoding dictionary?
Encoding varies based on the font type. Typically, there is a font resource that is defined as the current font and within that font dictionary is a reference to a base font and a means of describing the encoding (via the /Encoding key). If that key doesn't exist, the encoding will be "standard", but you can use other simple encodings such as /MacRoman and /WinAnsi for the value of the encoding, or you can specify a standard encoding and an encoding delta to show the differences.
Easy so far - as long as you're working with 8-bit characters. For many early apps, they would create a couple different fonts, one with say Roman encoding and another that maps roman characters to unavailable characters. In order to do that, your encoding delta would include references to the ligatures and other typically non-encoded symbols. This works great for Type 1 fonts, but is specifically contraindicated by the spec in the section on TrueType Fonts:
A nonsymbolic font should specify MacRomanEncoding or WinAnsiEncoding as the value of its Encoding entry, with no Differences array
This is vastly different when you want to use, say, Unicode. In which case you would be using a CID font (a font based on character IDs). In that case there is a procedure referenced by the font which is used to map from a character encoding in your string to a character ID in your font (and vice versa). I would strongly recommend that you read and fully understand section 9.7 in the PDF specification on Composite Fonts, which describes everything you need in order to encode UTF16BE into strings to get them to render properly in PDF. It is decidedly non-trivial in that there are a lot of details that if missed will result in a blank rendered page in Acrobat.
As a software engineer who professionally writes code that produces and consumes PDF, let me state that when I get tasked with having to put in special cases in my code to deal with non-spec compliant PDF, a little piece of me dies inside. Please, please, don't even think of releasing any documents you produce into the wild until they pass Preflight at the least. This is not the same as "Acrobat renders it so it must be OK." Let me give you an example - I've seen a number of files in the wild that include fonts that are missing the key elements of the FontDescriptor dictionary, including /Ascent, /Descent, /CapHeight, etc. These render in Acrobat, but are in violation of the spec since each of those is required. I know how Acrobat handles that - it comes with an enormous database of font metrics and looks up the value if it can't find it in the file (heck, it might even ignore the metrics in the file). I don't have that luxury, so I have to do a number of (potentially expensive/invalid) stop gap measures.
You might want to consider using a library to do this work for you - maybe iText which has a decent enough licensing scheme for education because, I get it, you're a student. There are some C based libraries too. Maybe you can figure a way to make GhostScript do your bidding.
If you are unwilling or unable to follow my advice with regards to cleaving to the specification or to use a library which ostensibly does so, please do me the favor of at least filling out the /Creator and /Producer strings in the Document Information Dictionary referenced by the trailer (see sections 14.3.3 and section 7.5.5). That way, when I have to parse/consume/manipulate your documents, I will have a way to directly cast aspersions on your parentage.
Let's go top down and start with the page object - I'm using output from my own library and am stripping out what I think you don't need:
1 0 obj <<
/Type /Page
/Parent 18 0 R
/Resources <<
/Font <<
/U0 13 0 R
>>
/ProcSet [ /PDF /Text ]
>>
/MediaBox [ 0 0 612 792 ]
/Contents 19 0 R
/Dur -1
>>
endobj
U0 is a reference to a font that will be used for unicode text.
The content stream is intended to print the following text: Greek: Γειά σου κόσμος.
BT /U0 24 Tf 72 670 Td
(\000G\000r\000e\000e\000k\000:\000 \003\223\003\265\003\271\003\254\000 \003\303\003\277\003\305\000 \003\272\003\314\003\303\003\274\003\277\003\302)
Tj ET
The font dictionary referenced looks like this:
13 0 obj <<
/BaseFont /DejaVuSansCondensed
/DescendantFonts [ 4 0 R ]
/ToUnicode 14 0 R
/Type /Font
/Subtype /Type0
/Encoding /Identity-H
>>
endobj
Which has the /ToUnicode entry points to a stream containing the following PostScript code:
/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def /CMapName /Adobe-Identity-UCS def /CMapType 2 def 1 begincodespacerange <0000> <FFFF> endcodespacerange 1 beginbfrange <0000> <FFFF> <0000> endbfrange endcmap CMapName currentdict /CMap defineresource pop end end
which is defined by the CID font specification.
and the DescendantFonts array points to this object:
4 0 obj <<
/Subtype /CIDFontType2
/Type /Font
/BaseFont /DejaVuSansCondensed
/CIDSystemInfo 7 0 R
/FontDescriptor 8 0 R
/DW 1000
/W 9 0 R
/CIDToGIDMap 10 0 R
>>
The CIDToGIDMap is a compressed stream with the actual map, the CIDSystemInfo is <</Registry (Adobe) /Ordering (USC) /Supplement 0>> (it's a reference because I share it among all unicode fonts that I output. The FontDescriptor is a straight forward boiler plate, and the W array is derived from the font metrics.
With all this detail, are you understanding why I don't say lightly, "walk away before you pollute my environment any furhter"?
I'm really beginning to question the nature of the this assignment. Writing a simple PDF is one thing, but writing code that can handle full unicode in any arbitrary OpenType/TrueType font requires you to understand the CID spec and the TrueType spec (hint: I have a full TrueType parser that can extract all the metrics for any glyph in a font so that I can output the /W array).
If, however, you are required to only output to Type 1 fonts, well my friend, your life got a whole lot easier, because you would take your entire UTF8 stream, read it as unicode and for every unique character that comes in, you build a map from a unicode character to a glyph name and an internal character number by using this table. The internal character number essentially the unique index of the character that came in mod. So for example, if you have less than 257 unique characters on the page, you will have exactly one font that is encoded to map to the characters in the order that the arrived. If you had "abcba" for input, the output string in pdf would be (\000\001\002\001\000) and would map to a font with an encoding dictionary with a differences array that would be [0/a/b/c]. If you have n unique characters where n > 256, you're going to have (n / 256) + 1 fonts, each with encodings.
If your teacher/professor wants anything but Type 1 fonts in a short period of time, s/he has unrealistic expectations for the students and/or low expectations for the quality of output. You should ask whether your are required to handle CID fonts and if you are, then your professor is at the very least a sadist. It took me, a seasoned professional, about 4 days to write a TrueType parser for extracting widths. I had the advantage of (1) using a managed language (C#) which cut down on concerns that will be biting your ass in C and was also able to use reflection to automate parsing and (2) when I don't have interruptions, I write solid code about 10-20 times faster than a typical student, so my 32 hours would translate into 320 student hours, more or less (then again, my code has different constraints than yours - it has to consume any crap font it gets gracefully), so let's call it 200 or less if you're allowed to steal something like stb. That's just for getting one particular element in the font descriptor.