Creating a new file type

Creating a new file type - file

I am assigned a project to create a new binary file structure for storing co-ordinate information coming out of a 3d CAD/CAM software. I would highly appreciate if you kindly point out some articles or books regarding creation of new binary file format. Thanks for your time :)

I would start by taking a look at other similar file formats on wotsit.org. The site is for various different file formats and it contains links to their specifications.
By looking at other file formats you'll get ideas about how best to format and present information about your specification.

There's a universal binary (and compact) notation called ASN.1. It's used widely and there are books about it available. ASN.1 can be compared to XML, but on some lower (more primitive yet more flexible) level than XML. And XML, especially binary XML mentioned above, would be of great help to you too.
Also, if you have more than just one sequence of data to hold in your file, take a look at Solid File System as a container for several data streams in one file.

If I had the same assignment, I would inspect something already existing, like .OBJ and then try to implement something similiar, probably with minor changes.

Short answer: Don't. Use XML or a text format instead for readability, extensability amd portability.
Longer answer: CAD/CAM has loads of 'legacy' formats around. I'd look to use one of those (possibly extending it if necessary). And if there's nothing suitable, and XML is considered to bloaty and slow, look at Binary XML formats instead.

I think that what you really need is to figure out what data you have to save. Then you load it into memory and serialize that memory
Here is a tutorial on serialization in C++. This page also addresses many issues with saving data

Related

How to store "meta" source code in a database

I would like to store a computer program in a database instead of a number of text files. It should contain the structure and all objects, methods, dependencies etc. of the program. I do not want to store a specific language in the database but some kind of "meta" programming language. In a second step I would like to transform/export this structure in the database into either source code of a classic language (C#, Java, etc.) or compile it directly for CLR/JVM.
I think I am not the first person with this idea. I searched the internet and I think what I am looking for is called "source code in a database (SCID)" - unfortunately I could not find an implementation of this idea.
So my questions is:
Is there any program that stores "meta" program code inside of a database and let's you generate traditional text source code from it that can be compiled/executed?
Short remarks:
- It can also be a noSQL database
- I currently don't care how the program is imported/entered into the database

It sounds like you're looking for some kind of common markup language that adequately describes the common semantics of each target language - e.g. objects, functions, inputs, return values, etc.
This is less about storing in a database, and more about having a single (I imagine, XML-like) structure that can subsequently be parsed and eval'd by the target language to produce native source/bytecode. If there was such a thing, storing it in a database would be trivial -- that's not the hard part. Even a key/value database could handle that.
The hard part will be finding something that can abstract away the nuances of multiple languages and attempt to describe them in a common format.
Similar questions have already been asked, without satisfying solutions.
It may be that you don't need the full source, but instead just a description of the runtime data-- formats like XML and JSON are intended exactly for this purpose and provide a simplified description of Objects that can be parsed and mapped to native equivalents, with your source code built around the dynamic parsing of that data.
It may be possible to go further in certain languages. For example, if your language of choice converts to bytecode first, you might technically be able to store the binary bytecode in a BLOB and then run it directly. Languages that offer reflection and dynamic evaluation can probably handle this -- then your DB is simply a wrapper for storing that data on compilation, and retrieving it prior to running it. That'd depend on your target language and how compilation is handled.
Of course, if you're only working with interpreted languages, you can simply store the full source and eval it (in whatever manner is preferred by the target language).
If you give more info on your intended use case, I'm sure you'll get some decent suggestions on how to handle it without having to invent a sourcecode Babelfish.

Extracting information from millions of simple but inconsistent text files

We have millions of simple txt documents containing various data structures we extracted from pdf, the text is printed line by line so all formatting is lost (because when we tried tools to maintain the format they just messed it up). We need to extract the fields and there values from this text document but there is some variation in structure of these files (new line here and there, noise on some sheets so spellings are incorrect).
I was thinking we would create some sort of template structure with information about the coordinates (line, word/words number) of keywords and values and use this information to locate and collect keyword values like that using various algorithms to make up for inconsistant formatting.
Is there any standard way of doing this, any links that might help? any other ideas?

the noise can be corrected or ignored by using fuzzy text matching tools like agrep: http://www.tgries.de/agrep/
However, the problem with extra new-lines will remain.
One technique that i would suggest is to limit the error propagation in a similar way compilers do. For example, you try to match your template or a pattern, and you can't do that. Later on in the text there is a sure match, but it might be a part of the current un-matched pattern.
In this case, the sure match should be accepted and the chunk of text that was un-matched should be left aside for future processing. This will enable you to skip errors that are too hard to parse.

Larry Wall's Perl is your friend here. This is precisely the sort of problem domain at which it excels.
Sed is OK, but for this sort of think, Perl is the bee's knees.

While I second the recommendations for the Unix command-line and for Perl, a higher-level tool that may help is Google Refine. It is meant to handle messy real-world data.

I would recoomnd using graph regular expression here with very weak rules and final accpetion predicate. Here you can write fuzzy matching on token level, then on line level etc.

I suggest Talend data integration tool. It is open source (i.e. FREE!). It is build on Java and you can customize your data integration project anyway you like by modifying underlying java code.
I used it and found very helpful on low budget highly complex data integration projects. Here's the link to their WEB site;Talend
Good luck.

C based XML parser

What are the recommended XML parsers for parsing a TMX file(XML based tilemap) in C?
What are the pros and cons of each, as I would like to have an efficient one because it will be running on an embedded system.

We used libxml on an embedded product a while back. It may be right for you.

On a high-level I think you should be looking at an event based parser rather than a DOM based one. DOM based parsers will take up a lot of memory building the XML tree.
Here's a from a similiar question. The top suggestion from that case looks to be one of the earliest xml parsers: Expat.

I have written sxmlc as a lightweight, potentially embeddable XML parser in C. See my answer on a related question (also linked by #PPC-Coder).
You can use it as DOM or SAX (to save memory), to parse either files or memory buffers.

shape file (GIS) to text

How I can convert a shape file (GIS) to text?
or, How I can extract the information in a shape file?

If you are willing to write some code (and you will probably need to anyway as there is quite a bit of information in a shapefile, not all of it of interest for any given application) check out shapelib. It has bindings to many scripting languages. It also builds dbfdump which is an executable for dumpint the dbffiles and shpdump that dumps the shp files.
Also of interest if you program in R is the maptools package.

Mapwindow (http://mapwindow.org/) is free, open source, and has a convert shp to csv feature.
The csv files it produces are a little strange, but you should be able to manage.
Windows only.

look into Gdal/ogr bindings

I wrote a small app that can convert your shapefile to KML. Not exactly text but human readable and shareable across map applications. http://www.reimers.dk/files/folders/google_maps/entry328.aspx

Try to use MyGeodata GIS Data Formats and Corrdinate systems Converter - online. It uses OGR library mentioned here and alows to convert most used GIS formats to any other GIS format - so for you for example from ESRI ShapeFile to GeoJSON, GML, CSV or other text-based formats.

There is a web page to view the contents:
http://webprocessresults.com/pgsActv/ShpDump.aspx
The same site offers a free desktop (windows) version:
http://webprocessresults.com/svcs/FreeStuff/FreeStuff.aspx

If you are using arcGIS and the shapefile is composed of several files you can try just opening the .dbf file in excel and saving it into another format (e.g. csv)- I've done this plenty of times and haven't had any ill effects and its a pretty quick and easy method of converting your shapefiles or doing any drastic edits saving as csv then reimporting them back into GIS for saving as a new shapefile. I will say this is and inelegant solution though ;)

You can find most of the shp (Shape File) format detailed here: http://en.wikipedia.org/wiki/Shapefile. The full specification is here: http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf.
The shp file format is very simple, but be careful to mind that the length fields are for 16-bit words not 8-bit words. If you forget this you will spend a bit of time debugging what is going wrong when trying to parse out the records.
The dbf generally contains information associated with each shape. You can also parse the dbf file but you will have to roll your own reader. I have done it before, but the easiest may be to load the dbf up into some spreadsheet application and then save it as a csv file then load that. Also, if I remember correctly you have to be careful as some of the sites out there detailing the dbf can be a little off. It had something to do with a different version where some fields are a little different. So if you are rolling your own and you get stuck be mindful that you may be reading it correctly but it is differing from the specification you are using. I think the solution was that was to return to Google and search up some different docs and finally detailed the version I was reading.
The shp and dbf are linked by the record index. The first record in the shp is linked with the first record in the dbf and so forth.
You can fairly easily find format specifications for dbf such as here: http://www.clicketyclick.dk/databases/xbase/format/index.html. If you are willing to roll your own it will not be too much of a project.
In any regard no matter if you choose to roll your own reader for dbf or shp you will need to be mindful of the fields as some are big and others little endian byte ordering. I think this only applies to the shp file.

You can open dbf file and copy content to another format, for example odt or XLSX.
For open dbf file recommend to use LibreOffice Calc

If you are really in need of only the content in the shapefile., then looking for the .dbf file would give you a better view. You can directly open ".dbf" file with any excel viewer to look for the content in the shapefile.

Evaluating HDF5: What limitations/features does HDF5 provide for modelling data?

We are in evaluating technologies that we'll use to store data that we gather during the analysis of C/C++ code. In the case of C++, the amount of data can be relatively large, ~20Mb per TU.
After reading the following SO answer it made me consider that HDF5 might be a suitable technology for us to use. I was wondering if people here could help me answer a few initial questions that I have:
Performance. The general usage for the data will be write once and read "several" times, similar to the lifetime of a '.o' file generated by a compiler. How does HDF5 compare against using something like an SQLite DB? Is that even a reasonable comparison to make?
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format. After reading the user guide I understand that HDF5 is similar to XML or a DB, in that information is associated with a tag/column and so a tool built to read an older structure will just ignore the fields that it is not concerned with? Is my understanding on this correct?
A significant chunk of the information that we wish to write out will be a tree type of structure: scope hierarchy, type hierarchy etc. Ideally we would model scopes as having parents, children etc. Is it possible to have one HDF5 object "point" to another? If not, is there a standard technique to solve this problem using HDF5? Or, as is required in a DB, do we need a unique key that would "link" one object to another with appropriate lookups when searching for the data?
Many thanks!

How does HDF5 compare against using something like an SQLite DB?
Is that even a reasonable comparison to make?
Sort of similar but not really. They're both structured files. SQLite has features to support database queries using SQL. HDF5 has features to support large scientific datasets.
They're both meant to be high performance.
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format.
If you store data in structured form, the data types of those structures are also stored in the HDF5 file. I'm a bit rusty as to how this works (e.g. if it includes innate backwards compatibility), but I do know that if you design your "reader" correctly it should be able to handle types that are changed in the future.
Is it possible to have one HDF5 object "point" to another?
Absolutely! You'll want to use attributes. Each object has one or more strings describing the path to reach that object. HDF5 groups are analogous to folders/directories, except that folders/directories are hierarchical = a unique path describes each one's location (in filesystems w/o hard links at least), whereas groups form a directed graph which can include cycles. I'm not sure whether you can store a "pointer" to an object directly as an attribute, but you can always store an absolute/relative path as a string attribute. (or anywhere else as a string; you could have lookup tables galore if you wanted.)

We produce HDF5 data on my project, but I don't directly deal with it usually. I can take a stab at the first two questions:
We use a write once, read many times model and the format seems to handle this well. I know a project that used to write both to an Oracle database and HDF5. Eventually they removed the Oracle output since performance suffered and no one was using it. Obviously, SQLite is not Oracle, but the HDF5 format was better suited for the task. Based on that one data point, a RDBMS may be better tuned for multiple inserts and updates.
The readers our customers use are robust when we add new data types. Some of the changes are anticipated, but we don't have to worry about breaking thing when adding more data fields. Our DBA recently wrote a Python program to read HDF5 data and populate KMZ files for visualization in Google Earth. Since it was a project he used to learn Python, I'd say it's not hard to build readers.
On the third question, I'll bow to Jason S's superior knowledge.
I'd say HDF5 is a completely reasonable choice, especially if you are already interested in it or plan to produce something for the scientific community.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight