What does the datatype specification '9(7)V9T' mean? - file

In some functional specs I'm reading they are talking about a numeric format with a 9(7)V9T presentation.
-How do I interprete this kind of format notations?
-How is this type physically stored in a flatfile (e.g. numeric?, signs? separators?)
Thank you for your wise answers!

A COBOL PICTURE string, such as 9(7)V9T specifies the general characteristics and editing requirements of an elementary
data item. A 9 represents a decimal digit, the (7) is a repetition factor for the preceding character. In this case
a 9. The V is an implied decimal point. This is all standard COBOL. So far we have an 8 digit decimal number with
an implied decimal point between the 7th and 8th digits.
The T is a bit of a curve ball. I have never
actually come across it before. However,
I Goolged up this reference.
It states that a T in a PICTURE string "... indicates that a display numeric field should only insert the sign into the upper
half of the last byte if the value is negative". Unfortunately, the author doesn't provide a reference so I can't
give you the source of this convention.
A COBOL picture of PIC S9(7)V9 USAGE DISPLAY on an IBM platform conforms to the 9(7)V9T description you have. This
data item
takes 8 bytes to represent. Each of the 8 digits are represented in the low 4 bits of each byte with the sign
recorded in the upper 4 bits of the low order byte. This just happens to be the way IBM choose to implement zoned-decimal.
Using a 9(7)V9T representation makes the representation explicit.

An alternative to the other answers is that the T is a character to be displayed or printed after the numeric value to represent a specific state, similar to use of CR for credit value or a trailing '-' to indicate a negative value.

Related

How many bytes will be required to store number in binary and text files respectively

If I want to store a number, let's say 56789 in a file, how many bytes will be required to store it in binary and text files respectively? I want to know how bytes are allocated to data in binary and text files.
It depends on:
text encoding and number system (decimal, hexadecimal, many more...)
signed/not signed
single integer or multiple (require separators)
data type
target architecture
use of compressed encodings
In ASCII a character takes 1 byte. In UTF-8 a character takes 1 to 4 bytes, but digits always take 1 byte. In UTF-16 or Unicode it takes 2 or more bytes per character.
Non-ASCII formats may require additional 2 bytes (initial BOM) for the file, this depends on the editor and/or settings used when the file was created.
But let's assume you store the data in a simple ASCII file, or the discussion becomes needlessly complex.
Let's also assume you use the decimal number system.
In hexadecimal you use digits 0-9 and letters a-f to represent numbers. A decimal (base-10) like 34234324423 would be 7F88655C7 in hexadecimal (base-16). In the first system we have 11 digits, in the second just 9 digits. The minimum base is 2 (digits 0 and 1) and the common maximum base is 64 (base-64). Technically, with ASCII you could go as high as base-96 maybe base-100, but that's very uncommon.
Each digit (0-9) will take one byte. If you have signed integers, an additional minus sign will lead the digits (so negative numbers charge 1 additional byte).
In some circumstances you may want to store several numerals. You will need a separator to tell the numerals apart. A comma (,), colon (:), semicolon (;), pipe (|) or newline (LF, CR or on Windows CRLF, which takes 2 bytes) have all been observed in the djungle as legit separators of numerals.
What is a numeral? The concept or idea of the quantity 8 that is IN YOUR HEAD is the number. Any representation of that concept on stone, paper, magnetic tape, or pixels on a screen are just that: REPRESENTATIONS. They are symbols which stand for what you understand in your brain. Those are numerals. Please don't ever confuse numbers with numerals, this distinction is the foundation of mathematics and computer science.
In these cases you want to count an additional character for the separator per numeral. Or maybe per numeral minus one. It depends on if you want to terminate each numeral with a marker or separate the numerals from each other:
Example (three digits and three newlines): 6 bytes
1<LF>
2<LF>
3<LF>
Example (three digits and two commas): 5 bytes
1,2,3
Example (four digits and one comma): 5 bytes
2134,
Example (sign and one digit): 2 bytes
-3
If you store the data in a binary format (not to be confused with the binary number system, which would still be a text format) the occupied memory depends on the integer type (or, better, bit length of the integer).
An octet (0..255) will occupy 1 byte. No separators or leading signs required.
A 16-bit float will occupy 2 bytes. For C and C++ the underlying architecture must be taken into account. A common integer on a 32-bit architecture will take 4 bytes. The very same code, compiled against a 64-bit architecture, will take 8 bytes.
There are exceptions to those flat rules. As an example, Google's protobuf uses a zig-zag VarInt implementation that leverages variable length encoding.
Here is a VarInt implementation in C/C++.
EDIT: added Thomas Weller's suggestion
Beyond the actual file CONTENT you will have to store metadata about the file (for bookkeeping such as the first sector, the filename, access permissions and more). This metadata is not shown for the file occupying space on disk, but actually is there.
If you store each numeral in a separate file such as the numeral 10 in the file result-10, these metadata entries will occupy more space than the numerals themselves.
If you store ten, hundred, thousands or millions/billions of numerals in one file, that overhead becomes increasingly irrelevant.
More about metadata here.
EDIT: to be clearer about file overhead
The overhead is under circumstances relevant, as discussed above.
But it is not a differentiator between textual and binary formats. As doug65536 says, however you store the data, if the filesystem structure is the same, it does not matter.
A file is a file, independently if it contains binary data or ASCII text.
Still, the above reasoning applies independently from the format you choose.
The number of digits needed to store a number in a given number base is ceil(log(n)/log(base)).
Storing as decimal would be base 10, storing as hexadecimal text would be base 16. Storing as binary would be base 2.
You would usually need to round up to a multiple of eight or power of two when storing as binary, but it is possible to store a value with an unusual number of bits in a packed format.
Given your example number (ignoring negative numbers for a moment):
56789 in base 2 needs 15.793323887 bits (16)
56789 in base 10 needs 4.754264221 decimal digits (5)
56789 in base 16 needs 3.948330972 hex digits (4)
56789 in base 64 needs 2.632220648 characters (3)
Representing sign needs an additional character or bit.
To look at how binary compares to text, assume a byte is 8 bits, each ASCII character would be a byte in text encoding (8 bits). A byte has a range of 0 to 255, a decimal digit has a range from 0 to 9. Each character (8 bits) can encode about 3.32 bits of a number per byte (log(10)/log(2)). A binary encoding can store 8 bits of a number per byte. Encoding numbers as text takes about 2.4x more space. If you pad out your numbers so they line up in fields, then numbers are very poor storage encoding, with a typical width being 10 digits you'll be storing 80 bits, which would be only 33 bits of binary encoded data.
I am not too developed in this subject; however, I believe it would not just be a case of the content, but also the META-DATA attached. But if you were just talking about the number, you could store it in ASCII or in a binary form.
In binary, 56789 could be converted to 1101110111010101; there is a 'simple' way to work this out on paper. But, http://www.binaryhexconverter.com/decimal-to-binary-converter is a website you can use to convert it.
1101110111010101 has 16 characters, therefore 16 bits which is two bytes.
Each integer is usually around 4 bytes of storage. So if you are storing the number in binary in the text file, and the binary equivalent is 1101110111010101, there are 16 integers in that binary number. 16 * 4 = 64. So your number will take up about 64 bytes of storage. If your integers were stored in 64bit rather than 32bit, each integer would instead take up 8 bytes of storage, so your total would equal 128 bytes.
Before you post any question, you should do your research.
Size of the file depends on many factors but for the sake of simplicity, in text format numbers will occupy 1 byte for each character if you are using UTF-8 encoding. On the other hand a binary value for long data type will take 4 bytes.

Convert COMP and COMP-3 Packed Decimal into readable value with C

I have an EBCDIC flat file to be processed from a mainframe into a C module. What can be a good process in converting the COMP and COMP-3 values into readable values? Do I have to convert the ebcdic characters to ascii then hex for COMP-3? What about for COMP? Thanks
Bill Woodger has given you some very good advice through his comments to your question, actually he answered the question and should have
posted his comments as an answer.
I would like to reiterate a few of his points and expand on a few others.
If you need to convert a file created from what is probably a COBOL application so it may be read
by some other non-COBOL program, possibly on a machine with an architecture unlike the one where it was created, then
you should demand that the file be created using only display formatted data (i.e. all character data). Mashing non-display
(binary, packed, encoded) data outside of the operating environment where it was created is just a formula for
long term pain. You will be subjected to the joys of sorting out various endianness issues
between architectures and code page conversions. These are the things that
file transfer protocols are designed to manage - they do it well so don't try to reinvent them. Short answer, use FTP or
similar file transport mechanism to move data between machines. And only transport display (character) based data.
Packed Decimal (COMP-3) data types occupy a varying number of bytes depending on their specific PICTURE layout. The position of the decimal point
is implied so cannot be determined without reference to the PICTURE used to define it. Packed Decimal fields may be either signed
or unsigned. If signed, the sign is imbedded in the low 4 bits of the least significant digit. Each byte of a Packed Decimal
data type contains two digits, except possibly the first and last bytes. The first byte contains only 1 digit if the field is signed
and contains an even number of digits. The last byte contains 2 digits if unsigned but only 1 if signed. There are several other subtlies that
you need to be aware of if you want to do your own Packed Decimal to character conversions. At this point I hope you can see
that this is not going to be a trivial exercise.
Binary (COMP) data types have a different but no less complex set of issues to resolve. Again, not a trivial exercise.
So what should you be doing? Basically, do as Bill suggested. Have the program that generates this file use display formats
for output (meaning you have to do nothing). Or, failing that, use a utility program such as DFSORT/SYNCSORT do the conversions
for you. Going the utility
route still requires that you have the original COBOL file layout (and that you understand it) in order to do the conversion.
The last resort is simply writing a simple read-a-record-write-a-record COBOL program that takes in the unformatted data, MOVEes
each COMP-whatever field to a corresponding DISPLAY field and write it out again.
As Bill said, if the group that produced this file tells you that it is too difficult/expensive to produce a DISPLAY formatted
output file they are lying to you or they are incompetent or just too lazy to
do the job they were hired to do. I can think of no other excuses.
Use XML to transport data.
That is, write a program that converts your file into characters (if on mainframe, stay with the EBCIDIC but numeric fields are unpacked, etc.) and then enclose each record and each field in XML tags.
This avoids formatting issues (what field is in column 1, what field in column 2, are the delimters spaces or commas or either, etc. ad nauseum).
Then transmit the XML file with your favorite utility that converts from EBCIDIC to ASCII.

LZW Compression with Entire unicode library

I am trying to do this problem:
Assume we have an initial alphabet of the entire Unicode character set,
instead of just all the possible byte values. Recall that unicode
characters are unsigned 2-byte values, so this means that each
2 bytes of uncompressed data will be treated as one symbol, and
we'll have an alphabet with over 60,000 symbols. (Treating symbols as
2-byte Unicodes, rather than a byte at a time, makes for better
compression in the case of internationalized text.) And, note, there's
nothing that limits the number of bits per code to at most 16. As you
generalize the LZW algorithm for this very large alphabet, don't worry
if you have some pretty long codes.
With this, give the compressed version of this four-symbol sequence,
using our project assumptions, including an EOD code, and grouping
into 4-byte ints. (These three symbols are Unicode values,
represented numerically.) Write your answer as 3 8-digit hex values,
space separated, using capital hex digits, not lowercase.
32767 32768 32767 32768
The problem I am having is that I don't know the entire range of the alphabet, so when doing LZW compression I don't know what byte value the new codes will have. Stemming from that problem I also don't know the the EOD code will be.
Also, it seems to me that it will only take two integers the compressed data.
The problem statement is ill-formed.
In Unicode, as we know it today, code points (those numbers that represent characters, composable parts of characters and other useful but more sneaky things) cannot be all numbered from 0 to 65535 to fit into 16 bits. There are more than 100 thousand of Chinese, Japanese and Korean characters in Unicode. Clearly, you'd need 17+ bits just for those. So, Unicode clearly cannot be the correct option here.
OTOH, there exist a sort of "abridged" version of Unicode, Universal Character Set, whose UCS-2 encoding uses 16-bit code points and can technically be used for at most 65536 characters and the like. Those characters with codes greater than 65535 are, well, unlucky, you can't have them with UCS-2.
So, if it's really UCS-2, you can download its specification (ISO/IEC 10646, I believe) and figure out exactly which codes out of those 64K are used and thus should form your initial LZW alphabet.

Convert the integer value to hex value

I have this function in xilinx for giving output to Seven segment.
int result;
XIo_Out32(XPAR_SSG_DECODER_0_BASEADDR, result);
The function gets the int result and puts the output to seven segment as a hex value. So basicly, if i give result = 11; I would see A as a result in seven segment. To see a decimal value on sseg, one approach is to change the verilog code behind this and change the whole concept of the sseg. Another approach is to write a function that changes decimal value into a hex value. I've been searching for a good code block for this but it seems that every one of them, prints the values digit by digit with a loop. I need the whole value as a block. Unfortunately i cannot use the C++ libraries so i have primitive C code. Is there any known algorithms for converting?
Apparently, you want to convert symbol codes from ASCII to the ones from the 7-segment display character set. If so, you may create a simple mapping, maybe an array of codes indexed by ASCII character id. Then, you'll be able to call your function like:
XIo_Out32(XPAR_SSG_DECODER_0_BASEADDR, 'A');
Be careful to implement the mapping table for the whole ASCII range.
EDIT
Sorry, I've got your question wrong. You'll have to manually convert hexadecimal number to an array of decimal symbols. You may do it by dividing your number by increasing powers of 10 (10^0, 10^1, 10^2, etc) and thus get an array of remainders, which is a decimal representation of your number. You may use snprintf as H2CO3 recommends, but I would recommend against it in some of the embedded applications where RAM is limited; you may even be unable to use sprintf-like functions at all.

Why are hexadecimal numbers prefixed with 0x?

Why are hexadecimal numbers prefixed as 0x?
I understand the usage of the prefix but I don't understand the significance of why 0x was chosen.
Short story: The 0 tells the parser it's dealing with a constant (and not an identifier/reserved word). Something is still needed to specify the number base: the x is an arbitrary choice.
Long story: In the 60's, the prevalent programming number systems were decimal and octal — mainframes had 12, 24 or 36 bits per byte, which is nicely divisible by 3 = log2(8).
The BCPL language used the syntax 8 1234 for octal numbers. When Ken Thompson created B from BCPL, he used the 0 prefix instead. This is great because
an integer constant now always consists of a single token,
the parser can still tell right away it's got a constant,
the parser can immediately tell the base (0 is the same in both bases),
it's mathematically sane (00005 == 05), and
no precious special characters are needed (as in #123).
When C was created from B, the need for hexadecimal numbers arose (the PDP-11 had 16-bit words) and all of the points above were still valid. Since octals were still needed for other machines, 0x was arbitrarily chosen (00 was probably ruled out as awkward).
C# is a descendant of C, so it inherits the syntax.
Note: I don't know the correct answer, but the below is just my personal speculation!
As has been mentioned a 0 before a number means it's octal:
04524 // octal, leading 0
Imagine needing to come up with a system to denote hexadecimal numbers, and note we're working in a C style environment. How about ending with h like assembly? Unfortunately you can't - it would allow you to make tokens which are valid identifiers (eg. you could name a variable the same thing) which would make for some nasty ambiguities.
8000h // hex
FF00h // oops - valid identifier! Hex or a variable or type named FF00h?
You can't lead with a character for the same reason:
xFF00 // also valid identifier
Using a hash was probably thrown out because it conflicts with the preprocessor:
#define ...
#FF00 // invalid preprocessor token?
In the end, for whatever reason, they decided to put an x after a leading 0 to denote hexadecimal. It is unambiguous since it still starts with a number character so can't be a valid identifier, and is probably based off the octal convention of a leading 0.
0xFF00 // definitely not an identifier!
It's a prefix to indicate the number is in hexadecimal rather than in some other base. The programming language uses it to tell compiler.
Example:
0x6400 translates to 6*16^3 + 4*16^2 + 0*16^1 +0*16^0 = 25600.
When compiler reads 0x6400, It understands the number is hexadecimal with the help of 0x term. Usually we can understand by (6400)16 or (6400)8 or whatever ..
For binary it would be:
0b00000001
Good day!
The preceding 0 is used to indicate a number in base 2, 8, or 16.
In my opinion, 0x was chosen to indicate hex because 'x' sounds like hex.
Just my opinion, but I think it makes sense.
Good Day!
I don't know the historical reasons behind 0x as a prefix to denote hexadecimal numbers - as it certainly could have taken many forms. This particular prefix style is from the early days of computer science.
As we are used to decimal numbers there is usually no need to indicate the base/radix. However, for programming purposes we often need to distinguish the bases from binary (base-2), octal (base-8), decimal (base-10) and hexadecimal (base-16) - as the most commonly used number bases.
At this point in time it is a convention used to denote the base of a number. I've written the number 29 in all of the above bases with their prefixes:
0b11101: Binary
0o35: Octal, denoted by an o
0d29: Decimal, this is unusual because we assume numbers without a prefix are decimal
0x1D: Hexadecimal
Basically, an alphabet we most commonly associate with a base (e.g. b for binary) is combined with 0 to easily distinguish a number's base.
This is especially helpful because smaller numbers can confusingly appear the same in all the bases: 0b1, 0o1, 0d1, 0x1.
If you were using a rich text editor though, you could alternatively use subscript to denote bases: 12, 18, 110, 116

Resources