Extracting old data from a DOS software

Extracting old data from a DOS software - batch-file

I have an old software created on DOS. All I have is an executable which shows me the UI. What this software does is it takes details of an order given to a door manufacturing company, stores it somewhere and sends the data to a needle printer. The data stored includes things like the name and address of the customer, door dimensions and so on.
The original creators of the software are no longer reachable and I have no idea what language was used to create it. My company wishes to get rid of this system but right now the only way to access information about old orders is by inserting the order number into the UI.
What I need to do is extract this data and convert it to some readable format, I have read research papers, searched this website and many others but have come up empty. I know that when I enter a new order the files that get modified have the following formats:
^01, WRK, DBK, STA
There are other files in the directory with formats like .ALT, .DBI, .ASC, .BAS, .DDF, .MA3 but those dont seem to have changed in the last 20 years.
Thank you very much in advance guys

The file extensions aren't always the best way of finding out things. File extensions are fluid and there's never been much in the way of standardisation, or at least not back in the DOS days. If you look at FilExt, for example, there's a fair bit of double up.
You'd be better off running the files through a tool like TrID/32 - File Identifier v2.10 - (C) 2003-11 By M.Pontello which does a good job of recognising files by their content rather than their file extension. It's not foolproof but can identify a few thousand different file types.
I used to do a lot of development on DOS back in the day. If you want to contact me off list, bruce dot axtens at gmail dot com, I can help identify the files and perhaps cook up a mechanism to extract the data.

Related

Which COBOL database do I have?

A simple but, heh, still weird question. Hope in good section, couldn't find decent answer in whole internet.
First of all, it looks strongly like COBOL (ACUCOBOL?), but I am not sure.
I have binary files with extensions: .AC, .vix, .SC; several MBytes each. Most of files are in pairs eg. ADDRESSES.AC + ADDRESSES.vix or COMPANIES.SC + COMPANIES.vix.
In the middle of these files I can see parts of records, however it seems to be a set of binary files.
No human readable indexes, maps, dialects, configuration files, headers that I know exists in Cobol databases - nothing to be parsed using some normal text tools. No CPY, RDD, XFD files as well. Just files with a lot of binary data and parts of records/ids (?) from time to time. So I can determine e.g., that one file contains set of addresses, next apparently sales, next client data etc.
Questions are:
How to determine which version of COBOL database am I using? (Mostly to obtain a proper tool to extract the data.)
How to convert this database to something that can be parsed and moved to whatever else - even Excel?
I have no access to computer that was working with this database as it is deep in the litter bin from many years, nothing else remained, just one folder with database files.
Had anybody the same problem?
Here is sample:

How to determine which version of COBOL database am I using?
You aren't using a database but ISAM files, very likely ACUCOBOL GT file format 5. For details about the format see official documentation.
Mostly to obtain a proper tool to extract the data.
The proper tool would be vutil and the command vutil -u -t ADDRESSES.AC ADDRESSES.TXT which will present you with a text file that is very likely in fixed-length form (variable form is relative uncommon) -> step 1.
As the data likely contains binary fields you have to investigate the data to check the actual format/record layout --> step2, and calculate the decimal values from the binary fields --> step 3.
But there are tools out there that help you with step 2 and 3, I can recommend RecordEditor where you'll see the data, can set field widths/types (defining a record layout, similar to Excel Import, but also allows you to use binary COBOL types) and convert the resulting file to CSV.
If you don't have access to vutil (or vutil32.exe on Windows) you may find someone that has access to this tool and convert the data for you; or get an evaluation version (would be an old download, the new product owner of ACUCOBOL-GT is MicroFocus and only provides evaluation versions of their not-compatible "Visual COBOL" product).
Alternatively you can reverse-engineer the format (the record layout is in the vix-file, open it with an hex-editor and dive in), but this likely is a bigger task...
Summary:
decide how to do step 1, vutil/vutil32.exe is the easiest way
1: convert the data to text format
2: investigate the files and inspect for the record layout (field width, type)
3: load the file, convert binary fields, export as csv

You definitely have the vision indexed data files as you will see the .vix files which match, if you do not have a .vix file then it is a relative file with a set no of records.
If you have Acubench under the Tools Menu there is an option for Vision File Utility, from there you can Unload your Vision Data to a text file which is tab delimited.
From there you can import to Excel as a tab delimited file and then re-save as a csv file.

So after all I suppose this was ISAM version.
To untangle this the following tools were needed:
First of all some migration tool. In my case it was ISMIGRATE GUI WIzard:
This package comes from isCOBOL 2017 R1, you can find some free demos to download. Note, that you don't need install all package, just this migration tool.
Then you can use ctree2 -> jisam conversion or just try all available options (not every one is available cause of missing libraries that are paid)
After conversion you'll end with something like this:
In worse cases there will be some ASCII special chars, but you can get rid of them using some tools like Notepad++, or even Excel. I mean to search for them by HEX code and replace by space (note, that space will replace one missing character to preserve column ordering)
Note, that you can as well use special function of importing ASCII text files from MS Access/MS Excel. It is really helpful.
to position everything correctly, cut this file and do all adjustements (and export to e.g. csv) you can use http://record-editor.sourceforge.net
that is free. Note, that after several trials I've noticed, that other even paid tools rather won't help you. The problem is in 1st point: conversion.
To be sure that everything works fine you can run even MS Access or similar to see how to create foreign keys and reverse-engineer all database. Having working preview it will be easy to do that on larger scale e.g. in PostgreSQL/Oracle.
That's it. I hope it'll be useful for somebody.
What was UNSUCCESSFUL:
Estabilishing Actian Vector server; it is really great and free tool, but it won't help you significantly
Trying some online tools (despite of who knows where data will be sent)
Any other ASCII editors, cause in my case many of them crashed, i suppose because of size of files and because of some control chars (?)

Is it possible to write/read metadata for a text file using Labview?

The situation
I use Labview 2012 on Windows 7
my test result data is written in text files. First, information about the test is written in the file (product type, test type, test conditions etc) and after that the logged data is written each second.
All data files are stored in folders, sorted to date and the names of the files contain some info about the test
I have years worth of data files and my search function now only works on the file names (opening each file to look for search terms costs too much time)
The goal
To write metadata (additional properties like Word files can have) with the text files so that I can implement a search function to quickly find the file that I need
I found here the way to write/read metadata for images, but I need it for text files or something similar.

You would need to be writing to data files that supports meta data to begin with (such as LabVIEW TDMS or datalog file formats). In a similar situation, I would simply use a separate file with the same name, but a different extension for example. Then you can index those file names, and if you want the data you just swap the meta data filename extension and you are good to go.

I would not bother with files and use database for results logging. It may be not what you wiling to do, but this is the ultimate solution for the search problem and it open a lot of data analytics possibilities.

The metadata in Word files is from a feature called "Alternative Data Streams" which is actually a function of NTFS. You can learn more about it here.
I can't say I've ever used this feature. I don't think there is a nice API for LabVIEW, but one could certainly be made. With some research you should be able to play around with this feature and see if it really makes finding files any easier. My understanding is that the data can be lost if transferred over the network or onto a non-NTFS thumbdrive.

What type of database for storing ML experiments

So I'm thinking to write some small piece of software, which to run/execute ML experiments on a cluster or arbitrary abstracted executor and then save them such that I can view them in real time efficiently. The executor software will have access for writing to the database and will push metrics live. Now, I have not worked too much with databases, thus I'm not sure what is the correct approach for this. Here is a description of what the system should store:
Each experiment will consist of a single piece of code/archive of code such that it can be executed on the remote machine. For now we will assume allow dependencies and etc are installed there. The code will accept command line arguments. The experiment also will consists of a YAML scheme defining the command line arguments. In the code byitself will specify what will be logged in (e.g. I will provide a library in the language for registering channels). Now in terms of logging, you can log numerical values, arrays, text, etc so quite a few types. Each channel will be allowed a single specification (e.g. 2 columns, first int iteration, second float error). The code will also provide special copy of parameters at the end of the experiments.
When one submit an experiments, it will need to provide its unique group name + parameters for execution. This will launch the experiment and log everything.
Implementing this for me is easiest to do with a flat file system. Each project will have a unique name. Each new experiment gets a unique id and folder inside the project. I can store the code there. Each channel gets a file, which for simplicity can be an csv delimeter, with a special schema file describing what type of values are stored there so I can load them there. The final parameters can also be copied in the folder.
However, because of the variety of ways I can do this, and the fact that this might require a separate "table" for each experiment, I have no idea if this is possible in any database systems? Additionally, maybe I'm overseeing something very obvious or maybe not, if you had any experience with this any suggestions/advices are most welcome. The main goal is at the end to be able to serve this to a web interface. Maybe noSQL could accommodate this maybe not (I don't know exactly how those work)?

The data for ML primarily would be unstructured data. That kind of data will not naturally fit into a RDBMS. Essentially a document database like mongodb is far better suited....for such cases.

How to tell the database type checking the file

My friend has a system to manage customers. The program per si is terrible, and my friend lost contact with the developers.
The case is, now my friend lost the access to program (something that the developers say "locked to machine" so when moved to another pc, he lost the access to program and data.
I get mission of to try to recover the database, migrating to another database, and create a cool program to my friend.
Now I need to discover which database was used by the developers. I know that the program was made using Visual Basic, because the MSVBVM60.DLL is required.
There is some program to read the metadata in the .dat files and discover which database was used?

You can try Determining File Format tools.
Unfortunately, it is possible that your .dat file is a "random access file", not database.
You cannot read data in that case, and if you don't know the structure of the file. The records are written in blocks and you have to know exact size of block to be able to jump from one block to next one. Probably some kind of encryptions are used.

If the file is a random access file (the VB sense) then it shouldn't be too hard to reverse engineer the format.
The first step would be determine the size of the records which you should be able to do with little forward knowlegde: it's just a matter of finding where strings begin and end and looking for repeats. For example, look for a string that looks like someone's first name and then scan forward until you find the next string that looks like a first name. That's your record size.
The next step would involve working out the actual fields. This will require a little more work, but basically you'll want to look up a record in the original software and then try to find the corresponding record in the first (for example, look for the first name/last name which should be relatively easy). Then it's just a matter of matching up fields in the UI with what's in the file. For example, dates integers and the like.
Of course, that's just a general overview, and that's assuming the file is in VB's native "random access" format. Good luck!

As well as trying to reverse engineer the file itself, as suggested in other responses, you could also try reverse engineering the application (DLL or EXE.)
There are several decompilers available, for example VB P-code/native compiler. A trial version is available. I have not tried this software, but it may give you enough to understand what is being stored in the data file, or help fill in the gaps where you can't figure out the meaning of data from the data file itself.

Best generic strategy to group items using multiple criteria

I have a simple, real life problem I want to solve using an OO approach. My harddrive is a mess. I have 1.500.000 files, duplicates, complete duplicate folders, and so on...
The first step, of course, is parsing all the files into my database. No problems so far, now I got a lot of nice entries which are kind of "naturaly grouped". Examples for this simple grouping can be obtained using simple queries like:
Give me all files bigger than 100MB
Show all files older than 3 days
Get me all files ending with docx
But now assume I want to find groups with a little more natural meaning. There are different strategies for this, depending on the "use case".
Assume I have a bad habit of putting all my downloaded files first on the desktop. Then I extract them to the appropriate folder, without deleting the ZIP file always. The I move them into a "attic" folder. For the system, to find this group of files a time oriented search approach, perhaps combined with a "check if ZIP is same then folder X" would be suitable.
Assume another bad habit of duplicating files, having some folder where "the clean files" are located in a nice structure, and another messy folders. Now my clean folder has 20 picture galleries, my messy folder has 5 duplicated and 1 new gallery. A human user could easily identify this logic by seeing "Oh, thats all just duplicates, thats a new one, so I put the new one in the clean folder and trash all the duplicates".
So, now to get to the point:
Which combination of strategies or patterns would you use to tackle such a situation. If I chain filters the "hardest" would win, and I have no idea how to let the system "test" for suitable combination. And it seemes to me it is more then just filtering. Its dynamic grouping by combining multiple criteria to find the "best" groups.
One very rough approach would be this:
In the beginning, all files are equal
The first, not so "good" group is the directory
If you are a big, clean directory, you earn points (evenly distributed names)
If all files have the same creation date, you may be "autocreated"
If you are a child of Program-Files, I don't care for you at all
If I move you, group A, into group C, would this improve the "entropy"
What are the best patterns fitting this situation. Strategy, Filters and Pipes, "Grouping".. Any comments welcome!
Edit in reacation to answers:
The tagging approach:
Of course, tagging crossed my mind. But where do I draw the line. I could create different tag types, like InDirTag, CreatedOnDayXTag, TopicZTag, AuthorPTag. These tags could be structured in a hirarchy, but the question how to group would remain. But I will give this some thought and add my insights here..
The procrastination comment:
Yes, it sounds like that. But the files are only the simplest example I could come up with (and the most relevant at the moment). Its actually part of the bigger picture of grouping related data in dynamic ways. Perhaps I should have kept it more abstract, to stress this: I am NOT searching for a file tagging tool or a search engine, but an algorithm or pattern to approach this problem... (or better, ideas, like tagging)
Chris

You're procrastinating. Stop that, and clean up your mess. If it's really big, I recommend the following tactic:
Make a copy of all the stuff on your drive on an external disk (USB or whatever)
Do a clean install of your system
As soon as you find you need something, get it from your copy, and place it in a well defined location
After 6 months, throw away your external drive. Anything that's on there can't be that important.
You can also install Google Desktop, which does not clean your mess, but at least lets you search it efficiently.
If you want to prevent this from happening in the future, you have to change the way you're organizing things on your computer.
Hope this helps.

I don't have a solution (and would love to see one), but I might suggest extracting metadata from your files besides the obvious name, size and timestamps.
in-band metadata such as MP3 ID3 tags, version information for EXEs / DLLs, HTML title and keywords, Summary information for Office documents etc. Even image files can have interesting metadata. A hash of the entire contents helps if looking for duplicates.
out-of-band metadata such as can be stored in NTFS alternate data streams - eg. what you can edit in the Summary tab for non-Office files
your browsers keep information on where you have downloaded files from (though Opera doesn't keep it for long), if you can read it.

You've got a fever, and the only prescription is Tag Cloud! You're still going to have to clean things up, but with tools like TaggCloud or Tag2Find you can organize your files by meta data as opposed to location on the drive. Tag2Find will watch a share, and when anything is saved to the share a popup appears and asks you to tag the file.
You should also get Google Desktop too.