How to check or change encoding in USLT (ID3)? - c

It's not a duplicate. I've read em all.
I have a Nokia-N8-00. It's music player supports USLT (UnSynchronised Lyrics/Text). I use a tool called spotdl (https://github.com/Ritiek/Spotify-Downloader) that fetches song titles from "spotify" and downloads them from other sources (generally youtube) and merges metadata as well.
The problem is then, the music downloaded by that tool have lyrics on all my devices except N8. Fortunately, I got a music that had embedded lyrics that is supported on my phone too. I then analyzed both the files and found that in binary sequence, they have a very little difference (just for USLT section but they are different songs). The differences are :-
The one that supports :
55 53 4C 54 00 00 0A 56 00 00 03 58 58 58
The one that doesn't :
55 53 4C 54 00 00 07 38 00 00 01 58 58 58
(These sequences are for "USLT" declaration in the file)
I think it's an encoding difference. If I am right, what encoding is present and in which one? If it's not encoding, what is it?
I know these sequences can't elaborate the situation. So, here are the files I'm trying https://github.com/gaurav712/music.
I don't need supported USLT, I am just curious about it as I wanna make an implementation of it in C (I don't need language specific help though).

Here is what I got:
55 53 4C 54
Translates to:
USLT
So we got that right. Now, I believe we can merge that result with this answer:
Frame ID $xx xx xx xx (four characters)
Size $xx xx xx xx
Flags $xx xx
Encoding $xx
Text
(Taken from: ID3v2 Specification)
(or see this: https://web.archive.org/web/20161022105303/http://id3.org/id3v2-chapters-1.0)
Now, I couldn't get this from the source (because the site was down) but there is also this:
Encoding flag explanation:
• $00 ISO-8859-1 [ISO-8859-1]
• $01 UTF-16 [UTF-16]
• $02 UTF-16BE [UTF-16]
• $03 UTF-8 [UTF-8]
So, according to these findings (which I'm not too sure about), the one that is supported is UTF-8 encoded and the one not supported is UTF-16.
EDIT
I've downloaded and viewed your mp3 files for further inspection. Here are my new findings:
First of all, we were correct about the encodings:
UTF-8 is on supported:
UTF-16 is on unsupported:
Does this mean you can just turn '01' into '03' and it'll magically work? I doubt. It depends on the driver. What if the driver sees '\x00' bytes and iterprets it as end of string (as in end of USLT payload). To test this, you can try manually converting the encoding on the file (by removing extra bytes).
Secondly, running eyeD3 on linux on both files, I recovered that:
supported.mp3 -> ID3 v2.4
unsupported.mp3 -> ID3 v2.3
Perhaps that's an issue?
Also, note that the location of USLT tag in both files are different:
supported.mp3:
unsupported.mp3:
On linux, there are further tools to give you extra information, if need be:
mp3info, id3info, id3tool, exiftool, eyeD3, lltag
Are a couple examples. However, I think the main problem is in the text encoding. I was able to recover the lyrics quite fine using the above tools. But some of the tools give different answers because of ID3 version being different and so on.

Related

Compare data in bytes to a hex number

I need to read data from stdin where I need to check if 2 bytes are equal to 0x001 or 0x002 and then depending on the value execute one of two types of code. Here is what i tried:
uint16_t type;
uint16_t type1 = 0x0001;
while( ( read(0, &type, sizeof(type)) ) > 0 ) {
if (type == type1) {
//do smth
}
}
I'm not sure what numeric system the read uses since printing the value of type both as decimal and as hex returns something completely different even when i wrtie 0001 on stdin
The characters 0x0001 and 0x0002 are control characters in UTF-8, you wont be able to type them manually.
If you are trying to compare the character input - use gets and atoi to convert the character-based input of stdin into an integer, or if you want to continue using read - compare against the characters '0' '0' '0' '1'/'2' instead of the values to get proper results
You didn't say what kind of OS you're using. If you're using Unix or Linux or MacOS, at the command line, it can be pretty easy to rig up simple ways of testing this sort of program. If you're using Windows, though, I'm sorry, my answer here is not going to be very useful.
(I'm also going to be making heavy use of the Unix/Linux
"pipe" concept, that is, the vertical bar | which sends the output of one program to the input of the next. If you're not familiar with this concept, my answer may not make much sense at first.)
If it were me, after compiling your program to a.out, I would just run
echo 00 00 00 01 00 02 | unhex | a.out
In fact, I did compile your program to a.out, and tried this, and... it didn't work. It didn't work because my machine (like yours) uses little-endian byte order, so what I should have done was
echo 00 00 01 00 02 00 | unhex | a.out
Now it works perfectly.
The only problem, as you've discovered if you tried it, is that unhex is not a standard program. It's one of my own little utilities in my personal bin directory. It's massively handy, and it's be super useful if something like it were standard, but it's not. (There's probably a semistandard Linux utility for doing this sort of thing, but I don't know what it is.)
What is pretty standard, and you probably can use it, is a base-64 decoder. The problem there is that there's no good way to hand-generate the input you need. But I can generate it for you (using my unhexer). If you can run
echo AAABAAIA | base64 -d | a.out
that will perform an equivalent test of your program. AAABAAIA is the base-64 encoding of the three 16-bit little-endian words 0000 0001 and 0002, as you can see by running
echo AAABAAIA | base64 -d | od -h
On some systems, the base64 command uses a -D option, rather than -d, to request decoding. That is, if you get an error message like "invalid option -- d", you can try
echo AAABAAIA | base64 -D | a.out

freebcp: "Unicode data is odd byte size for column. Should be even byte size"

This file works fine (UTF-8):
$ cat ok.txt
291054 Ţawī Rifā
This file causes an error (UTF-8):
$ cat bad.txt
291054 Ţawī Rifā‘
Here's the message:
$ freebcp 'DB.dbo.table' in bad.txt ... -c
Starting copy...
Msg 20050, Level 4
Attempt to convert data stopped by syntax error in source field
Msg 4895, Level 16, State 2
Server '...', Line 1
Unicode data is odd byte size for column 2. Should be even byte size.
Msg 20018, Level 16
General SQL Server error: Check messages from the SQL Server
The only difference is the last character, which is unicode 2018 (left single quotation mark)
Any idea what is causing this error?
The SQL Server uses UTF-16LE (though TDS starts with UCS-2LE and switches over I believe)
The column in question is nvarchar(200)
Here's the packet sent right before the error:
packet.c:741:Sending packet
0000 07 01 00 56 00 00 01 00-81 02 00 00 00 00 00 08 |...V.... ........|
0010 00 38 09 67 00 65 00 6f-00 6e 00 61 00 6d 00 65 |.8.g.e.o .n.a.m.e|
0020 00 69 00 64 00 00 00 00-00 09 00 e7 90 01 09 04 |.i.d.... ...ç....|
0030 d0 00 34 04 6e 00 61 00-6d 00 65 00 d1 ee 70 04 |Ð.4.n.a. m.e.Ñîp.|
0040 00 13 00 62 01 61 00 77-00 2b 01 20 00 52 00 69 |...b.a.w .+. .R.i|
0050 00 66 00 01 01 18 - |.f....|
Update: This issue has apparently been fixed in FreeTDS v1.00.16, released 2016-11-04.
I can reproduce your issue using FreeTDS v1.00.15. It definitely looks like a bug in freebcp that causes it to fail when the last character of a text field has a Unicode code point of the form U+20xx. (Thanks to #srutzky for correcting my conclusion as to the cause.) As you noted, this works ...
291054 Ţawī Rifā
... and this fails ...
291054 Ţawī Rifā‘
... but I found that this also works:
291054 Ţawī Rifā‘x
So, an ugly workaround would be to run a script against your input file that would append a low-order non-space Unicode character to each text field (e.g., x which is U+0078, as in the last example above), use freebcp to upload the data, and then run an UPDATE statement against the imported rows to strip off the extra character.
Personally, I would be inclined to switch from FreeTDS to Microsoft's SQL Server ODBC Driver for Linux, which includes the bcp and sqlcmd utilities when installed using the instructions described here:
https://gallery.technet.microsoft.com/scriptcenter/SQLCMD-and-BCP-for-Ubuntu-c88a28cc
I just tested it under Xubuntu 16.04, and although I had to tweak the procedure a bit to use libssl.so.1.0.0 instead of libssl.so.0.9.8 (and the same for libcrypto), once I got it installed the bcp utility from Microsoft succeeded where freebcp failed.
If the SQL Server ODBC Driver for Linux will not work on a Mac then another alternative would be to use the Microsoft JDBC Driver 6.0 for SQL Server and a little bit of Java code, like this:
connectionUrl = "jdbc:sqlserver://servername:49242"
+ ";databaseName=myDb"
+ ";integratedSecurity=false";
String myUserid = "sa", myPassword = "whatever";
String dataFileSpec = "C:/Users/Gord/Desktop/bad.txt";
try (
Connection conn = DriverManager.getConnection(connectionUrl, myUserid, myPassword);
SQLServerBulkCSVFileRecord fileRecord = new SQLServerBulkCSVFileRecord(dataFileSpec, "UTF-8", "\t", false);
SQLServerBulkCopy bulkCopy = new SQLServerBulkCopy(conn)) {
fileRecord.addColumnMetadata(1, "col1", java.sql.Types.NVARCHAR, 50, 0);
fileRecord.addColumnMetadata(2, "col2", java.sql.Types.NVARCHAR, 50, 0);
bulkCopy.setDestinationTableName("dbo.freebcptest");
bulkCopy.writeToServer(fileRecord);
} catch (Exception e) {
e.printStackTrace(System.err);
}
This issue has nothing to do with UTF-8 given that the data being transmitted, as shown in the transmission packet (bottom of the question) is UTF-16 Little Endian (just as SQL Server would be expecting). And it is perfectly good UTF-16LE, all except for the missing final byte, just like the error message implies.
The problem is most likely a minor bug in freetds that incorrectly applies logic meant to strip off trailing spaces from variable length string fields. There are no trailing spaces, you say? Well, if it hadn't gotten chopped off then it would be a little clearer (but, if it hadn't gotten chopped off there wouldn't be this error). So, let's look at what the packet to see if we can reconstruct it.
The error in the data is probably being overlooked because the packet contains an even number of bytes. But not all fields are double-byte, so it doesn't need to be an even number. If we know what the good data is (prior to the error), then we can find a starting point in the data and move forwards. It is best to start with Ţ as it will hopefully be above the 255 / FF value and hence take 2 bytes. Anything below will have a 00 and many of the characters have that on both sides. While we should be able to assume Little Endian encoding, it is best to know for certain. To that end, we need at least one character that has two non-00 bytes, and bytes that are different (one of the character is 01 for both bytes and that does not help determine ordering). The first character of this string field, Ţ, confirms this as it is Code Point 0162 yet shows up as 62 01 in the packet.
Below are the characters, in the same order as the packet, their UTF-16 LE values, and a link to their full details. The first character's byte sequence of 62 01 gives us our starting point, and so we can ignore the initial 00 13 00 of line 0040 (they have been removed in the copy below for readability). Please note that the "translation" shown to the right does not interpret Unicode, so the 2-byte sequence of 62 01 is displayed as 62 by itself (i.e. lower-case Latin "b") and 01 by itself (i.e. non-printable character; displayed as ".").
0040 xx xx xx 62 01 61 00 77-00 2b 01 20 00 52 00 69 |...b.a.w .+. .R.i|
0050 00 66 00 01 01 18 ?? - |.f....|
Ţ -- 62 01 -- http://unicode-table.com/en/0162/
a -- 61 00 -- http://unicode-table.com/en/0061/
w -- 77 00 -- http://unicode-table.com/en/0077/
ī -- 2B 01 -- http://unicode-table.com/en/012B/
-- 20 00 -- http://unicode-table.com/en/0020/
R -- 52 00 -- http://unicode-table.com/en/0052/
i -- 69 00 -- http://unicode-table.com/en/0069/
f -- 66 00 -- http://unicode-table.com/en/0066/
ā -- 01 01 -- http://unicode-table.com/en/0101/
‘ -- 18 20 -- http://unicode-table.com/en/2018/
As you can see, the last character is really 18 20 (i.e. a byte-swapped 20 18 due to the Little Endian encoding), not 01 18 as it might appear if reading the packet starting at the end. Somehow, the final byte -- hex 20 -- is missing, hence the Unicode data is odd byte size error.
Now, 20 by itself, or followed by 00, is a space. This would explain why #GordThompson was able to get it working by adding an additional character to the end (the final character was no longer trimmable). This could be further proven by ending with another character that is a U+20xx Code Point. For example, if I am correct about this, then ending with ⁄ -- Fraction Slash U+2044 -- would have the same error, while ending with ⅄ -- Turned Sans-Serif Capital Y U+2144 -- even with the ‘ just before it, should work just fine (#GordThompson was kind enough to prove that ending with ⅄ did work, and that ending with ⁄ resulted the same error).
If the input file is null (i.e. 00) terminated, then it could simply be the 20 00 ending sequence that does it, in which case ending with a newline might fix it. This can also be proven by testing a file with two lines: line 1 is the existing row from bad.txt, and line 2 is a line that should work. For example:
291054 Ţawī Rifā‘
999999 test row, yo!
If the two-line file shown directly above works, that proves that it is the combination of a U+20xx Code Point and that Code Point being the last character (of the transmission more than of the file) that exposes the bug. BUT, if this two-line file also gets the error, then it proves that having a U+20xx Code Point as the last character of a string field is the issue (and it would be reasonable to assume that this error would happen even if the string field were not the final field of the row, since the null terminator for the transmission has already been ruled out in this case).
It seems like either this is a bug with freetds / freebcp, or perhaps there is a configuration option to not have it attempt trimming trailing spaces, or maybe a way to get it to see this field as being NCHAR instead of NVARCHAR.
UPDATE
Both #GordThompson and the O.P. (#NeilMcGuigan) have tested and confirmed that this issue exists regardless of where the string field is in the file: in the middle of a row, at the end of the row, on the last row, and not on the last row. Hence it is a general issue.
And in fact, I found the source code and it makes sense that the issue would happen since there is no consideration for multi-byte character sets. I will file an Issue on the GitHub repository. The source for the rtrim function is here:
https://github.com/FreeTDS/freetds/blob/master/src/dblib/bcp.c#L2267
Regarding this statement:
The SQL Server uses UTF-16LE (though TDS starts with UCS-2LE and switches over I believe)
From an encoding stand-point, there is really no difference between UCS-2 and UTF-16. The byte sequences are identical. The only difference is in the interpretation of Surrogate Pairs (i.e. Code Points above U+FFFF / 65535). UCS-2 has the Code Points used to construct Surrogate Pairs reserved, but there was no implementation at that time of any Surrogate Pairs. UTF-16 simply added the implementation of the Surrogate Pairs in order to create Supplementary Characters. Hence, SQL Server stores and retrieves UTF-16 LE data without a problem. The only issue is that the built-in functions don't know how to interpret Surrogate Pairs unless the Collation ends with _SC (for Supplementary Characters), and those Collations were introduced in SQL Server 2012.
This might be an encoding issue of the source file.
As you are using non-standard characters, the source file should be unicode by itself probably. Other encodings use a differing count of bytes (one up to three) to encode one single character. E.g. your Unicode 2018 is 0xE2 0x80 0x98 in UTF-8.
Your packet ends with .R.i.f....| while there should be your ā‘. And the error shows Server '...', Line 1.
Try to find out the encoding of your source file (look at big and little endian too) and try to convert your file to a sure unicode format.
This might solve it:
inf your /etc/freetds/freetds.conf
add:
client charset = UTF-8
also found this about the flag use utf-16
use utf-16 Instead of using UCS-2 for database wide
character encoding use UTF-16. Newer Windows versions use this
encoding instead of UCS-2. This could result in some issues if clients
assume that a character is always 2 bytes.

LPCXpresso error CreateProcess: No such file or directory

I know there are multiple questions regarding this subject, but they did not help.
When trying to compile, whatever, I keep getting the same error:
arm-none-eabi-gcc.exe: error: CreateProcess: No such file or directory
I guess it means that it can not find the compiler.
I have tried tweaking the path settings
C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\nxp\LPCXpresso_7.6.2
326\lpcxpresso\tools\bin;
Seems to be right?
I have tried using Sysinternals process monitor
I can see that a lot of arm-none-eabi-gcc.exe are getting a result of name not found but there are a lot of successful results too.
I have also tried reinstalling the compiler and the LPCXpresso, no luck.
If i type arm-none-eabi-gcc -v i get the version, so it means its working
but when i am trying to compile in CMD like this arm-none-eabi-gcc led.c
i get the same error as stated above
arm-none-eabi-gcc.exe: error: CreateProcess: No such file or directory
Tried playing around more with PATH in enviroments, no luck. I feel like something is stopping LPCXpresso from finding the compiler
The only Antivirus this computer has is Avira and i disabled it. I also allowed compiler and LPCXpresso through the firewall
I have tried some more things, i will add it shortly after trying to duplicate the test.
It seems your problem is a happy mess with Vista and GCC. Long story short, a CRT function, access, has a different behavior on Windows and Linux. This difference is actually mentioned on Microsoft documentation, but the GCC folks didn't notice. This leads to a bug on Vista because this version of Windows is more strict on this point.
This bug is mentioned here : https://gcc.gnu.org/bugzilla/show_bug.cgi?id=33281
I have no proof that your problem comes from here, but the chances are good.
The solutions :
Do not use Vista
Recompile arm-none-eabi-gcc.exe with the flag -D__USE_MINGW_ACCESS
Patch arm-none-eabi-gcc.exe
The 3rd is the easiest, but it's a bit tricky. The goal is to hijack the access function and add an instruction to prevent undesired behavior. To patch your gcc, you have two solutions : you upload your .exe and I patch it for you, or I give you the instructions to patch it yourself. (I can also patch it for you, then give the instructions if it works). The patching isn't really hard and doesn't require advanced knowledge, but you must be rigorous.
As I said, I don't have this problem myself, so I don't know if my solution really works. The patch seems to be working for this problem.
EDIT2:
The exact problem is that the linux access have a parameter flag to check whether a file is executable. The Windows access cannot check for this. The behavior of most Windows versions is just to ignore this flag, and check if the file exists instead, which will usually give the same behavior. The problem is that Vista doesn't ignore this, and whenever the access is used to check for executability, it will return an error. This lead to GCC programs to think that some executables are not here. The patch induced by -D__USE_MINGW_ACCESS, or done manually, is to delete the flag when access is called, thus checking for existence instead just like other Windows versions.
EDIT:
The patching is actually needed for every GCC program that invokes other executables, and not only gcc.exe. So far there is only gcc.exe and collect2.exe.
Here are the patching instruction :
Backup your arm-none-eabi-gcc.exe.
Download and install CFF Explorer (direct link here).
Open arm-none-eabi-gcc.exe with CFF Explorer.
On the left panel, click on Import Directory.
In the module list that appears, click on the msvcrt.dll line.
In the import list that appears, find _access. Be careful here, the list is long, and there are multiple _access entries. The last one (the very last entry for me) is probably the good one.
When you click on the _access line, an address should appear on the list header, in the 2nd column 2nd row, just below FTs(IAT). Write down that address on notepad (for me, it is 00180948, it may be different). I will refer to this address as F.
On the left panel, click on Address Converter.
Three fields should appear, enter address F in the File Offset field.
Write down on notepad a 6 bytes value : the first two bytes are FF 25, the last 4 are the address that appeared in the VA field, IN REVERSE. For example, if 00586548 appeared in the VA field, write down FF 25 48 65 58 00 (spaces added for legibility). I will refer to this value as J. This value J is the instruction that jumps to the _access function.
On the left panel, click on Section Headers.
In the section list that appeared on the right, click on the .text line (the .text section is where the code resides).
In the editor panel that appeared below, click on the magnifier and, in the Hex search bar, search for a series of 11 90 (9090909090..., 90 is NOP in assembly). This is to find a code cave (unused space) to insert the patch, which is 11 bytes long. Once you found a code cave, write down the offset of the first 90. The exact offset is displayed on the very bottom as Pos : xxxxxxxx. I will refer to this offset as C.
Use the editor to change the sequence of 11 90 : the first 5 bytes are 80 64 E4 08 06. These 5 bytes are the instruction that prevents the wrong behavior. The last 6 bytes are the value J (edit the next 6 bytes to J, ex. FF 25 48 65 58 00), to jump back to the _access function.
Click on the arrow icon (Go To Offset) a bit below, and enter 0, to navigate to the beginning of the file.
Use the Hex search bar again to search for value J. If you find the bytes you just modified, skip. The J value you need is located around many value containing FF 25 and 90 90. That is the DLL jump table. Write down the offset of the value J you found (offset of the first byte, FF). I will refer to this offset as S. Note 1: If you can't find the value, maybe you picked the wrong _access in step 6, or did something wrong between step 6 to 10. Note 2: The search bar doesn't loop when it hit the end; go to offset 0 manually to re-find.
Use a hexadecimal 32-bit 2-complement calculator (like this one : calc.penjee.com) to calculate C - S - 5. If your offset C is 8C0 and your offset S is 6D810, you must obtain FF F9 30 AB (8C0 minus 6D810, minus 5).
Replace the value J you found in the file (at step 16) by 5 bytes : the first byte is E9, the last 4 are the result of the last operation, IN REVERSE. If you obtained FF F9 30 AB, you must replace the value J (ex: FF 25 48 65 58 00) by E9 AB 30 F9 FF. The 6th byte of J can be left untouched. These 5 bytes are the jump to the patch.
File -> Save
Notes : You should have modified a total of 16 bytes. If the patched program crash, you did something wrong. Even if it doesn't work, this patch can't induce a crash.
Let me know if you have difficulties somewhere.

Creating applications on MIFARE DESFire EV1 cards

I have a project to create attendance system using MIFARE DESFIRE EV1 cards.
The reader brand that I need to use for this project only supports ISO 7816-x so I need to use DESFIRE ISO7816-4 APDU Wrapping mode to send commands to Card Reader.
I also have access to NXP document resources.
Up to now I can run A few commands like Get Version, Get Application IDs, Free Memory on card.
all these commands can be run in plain with no security required.
however I couldn't create application on this card yet.
I'm sure my command for creating application is correct but it is failing with code 0x7E (Length Error).
here is my create application code which is failing.
-> 90 CA (00 00 05) 414141 0F 0E (00)
<- 91 7E
I like to know:
Am I running the command in correct sequence?
Is it required to authenticate before creating applications in card
The last byte represent the number of keys you want to use in that application. For every Desfire card, only 14 keys can be created per application. So, the number of keys should be from 0x01 to 0x0E.
This command creates an application for me (with AES keys, hence the 0x80 bit in the num_keys byte).
(90) ca (00 00 05) 33 22 11 0b 84 (00)

How can I tell what Database format a file (or set of files) was created with (in Delphi)?

I have a number of data files created by many different programs. Is there a way to determine the database and version of the database that was used to create the data file.
For example, I'd like to identify which files are created from Microsoft Access, dBASE, FileMaker, FoxPro, SQLite or others.
I really just want to somehow quickly scan the files, and display information about them, including source Database and Version.
For reference, I'm using Delphi 2009.
First of all, check the file extension. Take a look at the corresponding wikipedia article, or other sites.
Then you can guess the file format from its so called "signature".
This is mostly the first characters content, which is able to identify the file format.
You've an updated list at this very nice Gary Kessler's website.
For instance, here is how our framework identify the MIME format from the file content, on the server side:
function GetMimeContentType(Content: Pointer; Len: integer;
const FileName: TFileName=''): RawUTF8;
begin // see http://www.garykessler.net/library/file_sigs.html for magic numbers
result := '';
if (Content<>nil) and (Len>4) then
case PCardinal(Content)^ of
$04034B50: Result := 'application/zip'; // 50 4B 03 04
$46445025: Result := 'application/pdf'; // 25 50 44 46 2D 31 2E
$21726152: Result := 'application/x-rar-compressed'; // 52 61 72 21 1A 07 00
$AFBC7A37: Result := 'application/x-7z-compressed'; // 37 7A BC AF 27 1C
$75B22630: Result := 'audio/x-ms-wma'; // 30 26 B2 75 8E 66
$9AC6CDD7: Result := 'video/x-ms-wmv'; // D7 CD C6 9A 00 00
$474E5089: Result := 'image/png'; // 89 50 4E 47 0D 0A 1A 0A
$38464947: Result := 'image/gif'; // 47 49 46 38
$002A4949, $2A004D4D, $2B004D4D:
Result := 'image/tiff'; // 49 49 2A 00 or 4D 4D 00 2A or 4D 4D 00 2B
$E011CFD0: // Microsoft Office applications D0 CF 11 E0 = DOCFILE
if Len>600 then
case PWordArray(Content)^[256] of // at offset 512
$A5EC: Result := 'application/msword'; // EC A5 C1 00
$FFFD: // FD FF FF
case PByteArray(Content)^[516] of
$0E,$1C,$43: Result := 'application/vnd.ms-powerpoint';
$10,$1F,$20,$22,$23,$28,$29: Result := 'application/vnd.ms-excel';
end;
end;
else
case PCardinal(Content)^ and $00ffffff of
$685A42: Result := 'application/bzip2'; // 42 5A 68
$088B1F: Result := 'application/gzip'; // 1F 8B 08
$492049: Result := 'image/tiff'; // 49 20 49
$FFD8FF: Result := 'image/jpeg'; // FF D8 FF DB/E0/E1/E2/E3/E8
else
case PWord(Content)^ of
$4D42: Result := 'image/bmp'; // 42 4D
end;
end;
end;
if (Result='') and (FileName<>'') then begin
case GetFileNameExtIndex(FileName,'png,gif,tiff,tif,jpg,jpeg,bmp,doc,docx') of
0: Result := 'image/png';
1: Result := 'image/gif';
2,3: Result := 'image/tiff';
4,5: Result := 'image/jpeg';
6: Result := 'image/bmp';
7,8: Result := 'application/msword';
else begin
Result := RawUTF8(ExtractFileExt(FileName));
if Result<>'' then begin
Result[1] := '/';
Result := 'application'+LowerCase(Result);
end;
end;
end;
end;
if Result='' then
Result := 'application/octet-stream';
end;
You can use a similar function, from the GAry Kessler's list.
There are lots of database engines with hundreds (if not thousands) of versions and formats. (Binary, CSV, XML...) Many of them are encrypted to protect the content. It is quite "impossible" to identify every database and every format and it is a subject of constant changes.
So first of all you have to limit your task to a list of database engines you want to scan. Thats what i would do...
First, I do not believe you could do more in a "quick scan" than provide a "possible format". Also, it's very difficult to imagine that any quick technique could be reliable.
DBASE files commonly use the extension .dbf. There are variants of the dBase file format used by FoxPro, and Clipper. Wikipedia documents these as xBase. Any dBase library that can open dBase files will also probably be able to (a) show that this is in fact a true dBase file by opening it, and (b) allow you to see which supported variants of the xBase file format are in use.
Access files are usually using the .mdb file format, but can be encrypted with a password. You could probably write your own library that could postiively identify the internal content as being of the "Jet database engine" (internal type of file used by Access) but not read the content, but I doubt that short of cracking the password, you could do this reliably.
FileMaker files can have many file extensions, and their internal file formats are not well documented. According to wikipedia, .fm .fp3 .fp5 and .fp7 are common file extensions. You will have similar "password" problems with filemaker databases, as with Access. I am not aware of any way to read filemaker files in delphi except through ODBC, and even then, I don't think you could provide an "omni-reader" in Delphi that was powered by ODBC, since ODBC requires careful setup and knowledge of the originating file into an odbc data source before it becomes readable through ODBC. Browse/Discovery is not a phase that is supported by ODBC.
SQLite files can have any file extension at all. The easiest way to try to detect it would be to load/open the file using SQLite and see if it opens.
The rest of the list is more or less infinite, and the technique would be the same. Just keep rolling more database engines and access layer libraries into your Katamari Damaci Database Detector Tool.
If you want to start with old database formats as you seem to be, I would investigate using BDE (ancient, but hey, you're talking about ancient stuff), plus ADO, to try to auto-detect and open files.

Resources