I have a lot of xsd files with different complex types on it. I want to import data into my oracle database, but amount of data so huge and i can't use xsd2db or altova xmlspy because it's blowing my mind. I'm looking for simple and useful etl tool which can help me with it. Does anyone know gui tool to generate ddl by xsd?
This is a follow up to my comment; I am not positioning this as an answer, but it should help you understand more what is it that you are after, and maybe what you can do about it. For sure, it should be a good example for #a_horse_with_no_name...
I am not familiar with xmlspy, but given what I saw in xsd2db, it made me think of the .NET ability to infer a DataSet from an XML Schema. While the authoring style of the XSD itself may affect the way a DataSet is derived, it would be mostly insignificant for larger bodies of XSD. Even more, there is a big chance that the derivation might not even work (so many limitations there actually).
From my own experience, the derivation process in .NET gives you a very normalized structure. To illustrate, I am going to introduce a sample XML:
<ShippingManifest>
<Date>2012-11-21</Date>
<InvoiceNumber>123ABC</InvoiceNumber>
<Customer>
<FirstName>Sample</FirstName>
<LastName>Customer</LastName>
</Customer>
<Address>
<UnitNumber>2A</UnitNumber>
<StreetNumber>123</StreetNumber>
<StreetName>A Street</StreetName>
<Municipality>Toronto</Municipality>
<ProvinceCode>ON</ProvinceCode>
<PostalCode>X9X 9X9</PostalCode>
</Address>
<PackingList>
<LineItem>
<ID>Box1</ID>
<Load>1-233</Load>
<Description>Package box</Description>
<Items>22</Items>
<Cartons>22</Cartons>
<Weight>220</Weight>
<Length>10</Length>
<Width>10</Width>
<Height>10</Height>
<Volume>1000</Volume>
</LineItem>
<LineItem>
<ID>Box2</ID>
<Load>456-233</Load>
<Description>Package box</Description>
<Items>22</Items>
<Cartons>22</Cartons>
<Weight>220</Weight>
<Length>10</Length>
<Width>10</Width>
<Height>10</Height>
<Volume>1000</Volume>
</LineItem>
</PackingList>
</ShippingManifest>
Conceptually, its structure is very simple: a shipping manifest entity, a customer, a shipping address and a packing list.
Converting this to an ADO.NET DataSet is a straight forward exercise, with a very clean output.
It should be easy to imagine how the number of entities (tables in your database if you wish) could mushroom for just a bit more complex XML...
As a sidebar, if one designs the XSD keeping in mind a process involving DataSets, then removing the PackingList element and moving the LineItem collection as repeating under ShippingManifest gives a somewhat simplified layout: one without the PackingList entity.
Automatic tools to convert an XSD data model to a relational model, such as .NET's, are typically designed to generate a highly normalized structure. The denormalization, I guess, is left to the user for obvious reasons.
QTAssistant's XML Builder is different. Our requirement was to create an ER model which would work where .NET's XSD to Dataset doesn't, and with an output containing smaller number of entities, where possible. This is what QTAssistant generates for the same:
What QTAssistant did here, was to merge all entities engaged in a one-to-one relationship. From a modeling perspective, it's an obvious sin. It does have its benefits, particularly for our users interested in a simple structure capable of capturing data (test data to be more specific).
The generated mapping (XSD to ER) is bi-directional. It means that it can be used to generate valid XML from a database, or "shred" XML data into a database (the shredding is done by generating DML statements). The way this technology is used: test cases are stored in Excel spreadsheets, XML is generated, sent to a web service, and the results are stored back in Excel.
We also generate an XML file describing the structure which through an XSLT could be converted to DDL. And this is where things could get messy, depending on you schema. It is rather common to see XSDs where simple types are unconstrained: strings without maxlength, or using patterns without a maximum length; unconstrained decimals, etc. These are just some of the reasons why, in our case, we don't have an out of the box-straight-forward-way to generate the DDL but rather provide hooks for customizations.
So, to close my comment, I pretty much know what you want to do (I have to assume that other things, such as Oracle's XML capabilities, or XML databases and XQuery, etc. have been ruled out). Unfortunately, the XSD really matters here, so if you can share those as per my comment, I can take a look - it'll be up to you how much you want to share back here.
Related
We have a table in our database that stores XML in one of the columns. The XML is always in the exact same format out of a set of 3 different XML formats which is received via web service responses. We need to look up information in this table (and inside of the XML field) very frequently. Is this a poor use of the XML datatype?
My suggestion is to create seperate tables for each different XML structure as we are only talking about 3 with a growth rate of maybe one new table a year.
I suppose ultimately this is a matter of preference, but here are some reasons I prefer not to store data like that in an XML field:
Writing queries against XML in TSQL is slow. Might not be too bad for a small amount of data, but you'll definitely notice it with a decent amount of data.
Sometimes there is special logic needed to work with an XML blob. If you store the XML directly in SQL, then you find yourself duplicating that logic all over. I've seen this before at a job where the guy that wrote the XML to a field was long gone and everyone was left wondering how exactly to work with it. Sometimes elements were there, sometimes not, etc.
Similar to (2), in my opinion it breaks the purity of the database. In the same way that a lot of people would advise against storing HTML in a field, I would advise against storing raw XML.
But despite these three points ... it can work and TSQL definitely supports queries against it.
Are you reading the field more than you are writing it?
You want to do the conversion on whichever step you do least often or the step that doesn't involve the user.
Here is something I've wondered for quite some time, and have not seen a real (good) solution for yet. It's a problem I imagine many games having, and that I can't easily think of how to solve (well). Ideas are welcome, but since this is not a concrete problem, don't bother asking for more details - just make them up! (and explain what you made up).
Ok, so, many games have the concept of (inventory) items, and often, there are hundreds of different kinds of items, all with often very varying data structures - some items are very simple ("a rock"), others can have insane complexity or data behind them ("a book", "a programmed computer chip", "a container with more items"), etc.
Now, programming something like that is easy - just have everything implement an interface, or maybe extend an abstract root item. Since objects in the programming world don't have to look the same on the inside as on the outside, there is really no issue with how much and what kind of private fields any type of item has.
But when it comes to database serialization (binary serialization is of course no problem), you are facing a dilemma: how would you represent that in, say, a typical SQL database ?
Some attempts at a solution that I have seen, none of which I find satisfying:
Binary serialization of the items, the database just holds an ID and a blob.
Pro's: takes like 10 seconds to implement.
Con's: Basically sacrifices every database feature, hard to maintain, near impossible to refactor.
A table per item type.
Pro's: Clean, flexible.
Con's: With a wide variety come hundreds of tables, and every search for an item has to query them all since SQL doesn't have the concept of table/type 'reference'.
One table with a lot of fields that aren't used by every item.
Pro's: takes like 10 seconds to implement, still searchable.
Con's: Waste of space, performance, confusing from the database to tell what fields are in use.
A few tables with a few 'base profiles' for storage where similar items get thrown together and use the same fields for different data.
Pro's: I've got nothing.
Con's: Waste of space, performance, confusing from the database to tell what fields are in use.
What ideas do you have? Have you seen another design that works better or worse?
It depends if you need to sort, filter, count, or analyze those attribute.
If you use EAV, then you will screw yourself nicely. Try doing reports on an EAV schema.
The best option is to use Table Inheritance:
PRODUCT
id pk
type
att1
PRODUCT_X
id pk fk PRODUCT
att2
att3
PRODUCT_Y
id pk fk PRODUCT
att4
att 5
For attributes that you don't need to search/sort/analyze, then use a blob or xml
I have two alternatives for you:
One table for the base type and supplemental tables for each “class” of specialized types.
In this schema, properties common to all “objects” are stored in one table, so you have a unique record for every object in the game. For special types like books, containers, usable items, etc, you have another table for each unique set of properties or relationships those items need. Every special type will therefore be represented by two records: the base object record and the supplemental record in a particular special type table.
PROS: You can use column-based features of your database like custom domains, checks, and xml processing; you can have simpler triggers on certain types; your queries differ exactly at the point of diverging concerns.
CONS: You need two inserts for many objects.
Use a “kind” enum field and a JSONB-like field for the special type data.
This is kind of like your #1 or #3, except with some database help. Postgres added JSONB, giving you an improvement over the old EAV pattern. Other databases have a similar complex field type. In this strategy you roll your own mini schema that you stash in the JSONB field. The kind field declares what you expect to find in that JSONB field.
PROS: You can extract special type data in your queries; can add check constraints and have a simple schema to deal with; you can benefit from indexing even though your data is heterogenous; your queries and inserts are simple.
CONS: Your data types within JSONB-like fields are pretty limited and you have to roll your own validation.
Yes, it is a pain to design database formats like this. I'm designing a notification system and reached the same problem. My notification system is however less complex than yours - the data it holds is at most ids and usernames. My current solution is a mix of 1 and 3 - I serialize data that is different from every notification, and use a column for the 2 usernames (some may have 2 or 1). I shy away from method 2 because I hate that design, but it's probably just me.
However, if you can afford it, I would suggest thinking outside the realm of RDBMS - it sounds like Non-RDBMS (especially key/value storage ones) may be a better fit to store these data, especially if item 1 and item 2 differ from each item a lot.
I'm sure this has been asked here a million times before, but in addition to the options which you have discussed in your question, you can look at EAV schema which is very flexible, but which has its own sets of cons.
Another alternative is database systems which are not relational. There are object databases as well as various key/value stores and document databases.
Typically all these things break down to some extent when you need to query against the flexible attributes. This is kind of an intrinsic problem, however. Conceptually, what does it really mean to query things accurately which are unstructured?
First of all, do you actually need the concurrency, scalability and ACID transactions of a real database? Unless you are building a MMO, your game structures will likely fit in memory anyway, so you can search and otherwise manipulate them there directly. In a scenario like this, the "database" is just a store for serialized objects, and you can replace it with the file system.
If you conclude that you do (need a database), then the key is in figuring out what "atomicity" means from the perspective of the data management.
For example, if a game item has a bunch of attributes, but none of these attributes are manipulated individually at the database level (even though they could well be at the application level), then it can be considered as "atomic" from the data management perspective. OTOH, if the item needs to be searched on some of these attributes, then you'll need a good way to index them in the database, which typically means they'll have to be separate fields.
Once you have identified attributes that should be "visible" versus the attributes that should be "invisible" from the database perspective, serialize the latter to BLOBs (or whatever), then forget about them and concentrate on structuring the former.
That's where the fun starts and you'll probably need to use "all of the above" strategy for reasonable results.
BTW, some databases support "deep" indexes that can go into heterogeneous data structures. For example, take a look at Oracle's XMLIndex, though I doubt you'll use Oracle for a game.
You seem to be trying to solve this for a gaming context, so maybe you could consider a component-based approach.
I have to say that I personally haven't tried this yet, but I've been looking into it for a while and it seems to me something similar could be applied.
The idea would be that all the entities in your game would basically be a bag of components. These components can be Position, Energy or for your inventory case, Collectable, for example. Then, for this Collectable component you can add custom fields such as category, numItems, etc.
When you're going to render the inventory, you can simply query your entity system for items that have the Collectable component.
How can you save this into a DB? You can define the components independently in their own table and then for the entities (each in their own table as well) you would add a "Components" column which would hold an array of IDs referencing these components. These IDs would effectively be like foreign keys, though I'm aware that this is not exactly how you can model things in relational databases, but you get the idea.
Then, when you load the entities and their components at runtime, based on the component being loaded you can set the corresponding flag in their bag of components so that you know which components this entity has, and they'll then become queryable.
Here's an interesting read about component-based entity systems.
I have seen a lot of topics asking for the choice of a database for a voting mechanism,but my inputs are a bit different. I have an application which contains a GUI in which there can be multiple fields/ radio button or a combination of the above. THe GUI is not fixed. Based on the form submitted, the answer XML is dynamically generated.
Thus if there is a form there can be 10000 different people submitting the same form . and i will be having 10000 different forms(numbers will increase).
I now have the following 2 options. Store every xml as it is in the database ( i have not made the choice of using a relational db or a nosql db like mongodb.) or parse the xml and create tables for every form. THat way the number of tables will be huge.
Now , I have to build a voting mechanism which basically looks at all the xml's that have been generated for a particular form i.e 10000 xml's and extract the answers submitted (Note: the xml is complex because 1 form can have multiple answer elements) and then do a vote to find how many people have given the same answer.
My Questions:
Should I use a relational db or NOSQL (MongoDB /Redis or similar ones)?
Do I need to save the xml documents as it is in the db or should I parse it and convert it to tables and save it? Any other approach that I can follow.
I am using JAVA/J2EE for devlepment currenty.
If your question is about how to store data of variable structure, then document database would be pretty handy. As it is schema-less, there will be no issues with rdbms columns maintenance.
Logically this way is pretty similar to storing xml in relational db. The difference is that with rdbms approach, each database reader should have a special xml parsing layer. (Also about xml you refer to Why would I ever choose to store and manipulate XML in a relational database?.)
In general, if you're planning to have a single database client, you can use xml/rdbms.
By the way, instead of storing xml, you can use rdbms in other way - define "generic" structure. For example, you can have "Entities (name, type, id)" table, and "Attributes (entityId, name, type, value)".
If you store XML in the DB - you gain flexibility against performance and maintainability (XML parsing with xpath etc can be verbose and error prone especially with complex and deeply nested XML structures)
If you store tables for each XML - you gain performance, ease of use, complexity against flexibility
Pick a hybrid approach. Store XMLs in a rdbms table as a generic XML structure (as suggested in one of the answers). This way you have fewer tables (less complexity) and avoid all the performance issue of XML parsing.
I work for a billing service that uses some complicated mainframe-based billing software for it's core services. We have all kinds of codes we set up that are used for tracking things: payment codes, provider codes, write-off codes, etc... Each type of code has a completely different set of data items that control what the code does and how it behaves.
I am tasked with building a new system for tracking changes made to these codes. We want to know who requested what code, who/when it was reviewed, approved, and implemented, and what the exact setup looked like for that code. The current process only tracks two of the different types of code. This project will add immediate support for a third, with the goal of also making it easy to add additional code types into the same process at a later date. My design conundrum is that each code type has a different set of data that needs to be configured with it, of varying complexity. So I have a few choices available:
I could give each code type it's own table(s) and build them independently. Considering we only have three codes I'm concerned about at the moment, this would be simplest. However, this concept has already failed or I wouldn't be building a new system in the first place. It's also weak in that the code involved in writing generic source code at the presentation level to display request data for any code type (even those not yet implemented) is not trivial.
Build a db schema capable of storing the data points associated with each code type: not only values, but what type they are and how they should be displayed (dropdown list from an enum of some kind). I have a decent db schema for this started, but it just feels wrong: overly complicated to query and maintain, and it ultimately requires a custom query to view full data in nice tabular for for each code type anyway.
Storing the data points for each code request as xml. This greatly simplifies the database design and will hopefully make it easier to build the interface: just set up a schema for each code type. Then have code that validates requests to their schema, transforms a schema into display widgets and maps an actual request item onto the display. What this item lacks is how to handle changes to the schema.
My questions are: how would you do it? Am I missing any big design options? Any other pros/cons to those choices?
My current inclination is to go with the xml option. Given the schema updates are expected but extremely infrequent (probably less than one per code type per 18 months), should I just build it to assume the schema never changes, but so that I can easily add support for a changing schema later? What would that look like in SQL Server 2000 (we're moving to SQL Server 2005, but that won't be ready until after this project is supposed to be completed)?
[Update]:
One reason I'm thinking xml is that some of the data will be complex: nested/conditional data, enumerated drop down lists, etc. But I really don't need to query any of it. So I was thinking it would be easier to define this data in xml schemas.
However, le dorfier's point about introducing a whole new technology hit very close to home. We currently use very little xml anywhere. That's slowly changing, but at the moment this would look a little out of place.
I'm also not entirely sure how to build an input form from a schema, and then merge a record that matches that schema into the form in an elegant way. It will be very common to only store a partially-completed record and so I don't want to build the form from the record itself. That's a topic for a different question, though.
Based on all the comments so far Xml is still the leading candidate. Separate tables may be as good or better, but I have the feeling that my manager would see that as not different or generic enough compared to what we're currently doing.
There is no simple, generic solution to a complex, meticulous problem. You can't have both simple storage and simple app logic at the same time. Either the database structure must be complex, or else your app must be complex as it interprets the data.
I outline five solution to this general problem in "product table, many kind of product, each product have many parameters."
For your situation, I would lean toward Concrete Table Inheritance or Serialized LOB (the XML solution).
The reason that XML might be a good solution is that:
You don't need to use SQL to pick out individual fields; you're always going to display the whole form.
Your XML can annotate fields for data type, user interface control, etc.
But of course you need to add code to parse and validate the XML. You should use an XML schema to help with this. In which case you're just replacing one technology for enforcing data organization (RDBMS) with another (XML schema).
You could also use an RDF solution instead of an RDBMS. In RDF, metadata is queriable and extensible, and you can model entities with "facts" about them. For example:
Payment code XYZ contains attribute TradeCredit (Net-30, Net-60, etc.)
Attribute TradeCredit is of type CalendarInterval
Type CalendarInterval is displayed as a drop-down
.. and so on
Re your comments: Yeah, I am wary of any solution that uses XML. To paraphrase Jamie Zawinski:
Some people, when confronted with a problem, think "I know, I'll use XML." Now they have two problems.
Another solution would be to invent a little Domain-Specific Language to describe your forms. Use that to generate the user-interface. Then use the database only to store the values for form data instances.
Why do you say "this concept has already failed or I wouldn't be building a new system in the first place"? Is it because you suspect there must be a scheme for handling them in common?
Else I'd say to continue the existing philosophy, and establish additional tables. At least it would be sharing an existing pattern and maintaining some consistency in that respect.
Do a web search on "generalized specialized relational modeling". You'll find articles on how to set up tables that store the attributes of each kind of code, and the attributes common to all codes.
If you’re interested in object modeling, just search on “generalized specialized object modeling”.
We are in evaluating technologies that we'll use to store data that we gather during the analysis of C/C++ code. In the case of C++, the amount of data can be relatively large, ~20Mb per TU.
After reading the following SO answer it made me consider that HDF5 might be a suitable technology for us to use. I was wondering if people here could help me answer a few initial questions that I have:
Performance. The general usage for the data will be write once and read "several" times, similar to the lifetime of a '.o' file generated by a compiler. How does HDF5 compare against using something like an SQLite DB? Is that even a reasonable comparison to make?
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format. After reading the user guide I understand that HDF5 is similar to XML or a DB, in that information is associated with a tag/column and so a tool built to read an older structure will just ignore the fields that it is not concerned with? Is my understanding on this correct?
A significant chunk of the information that we wish to write out will be a tree type of structure: scope hierarchy, type hierarchy etc. Ideally we would model scopes as having parents, children etc. Is it possible to have one HDF5 object "point" to another? If not, is there a standard technique to solve this problem using HDF5? Or, as is required in a DB, do we need a unique key that would "link" one object to another with appropriate lookups when searching for the data?
Many thanks!
How does HDF5 compare against using something like an SQLite DB?
Is that even a reasonable comparison to make?
Sort of similar but not really. They're both structured files. SQLite has features to support database queries using SQL. HDF5 has features to support large scientific datasets.
They're both meant to be high performance.
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format.
If you store data in structured form, the data types of those structures are also stored in the HDF5 file. I'm a bit rusty as to how this works (e.g. if it includes innate backwards compatibility), but I do know that if you design your "reader" correctly it should be able to handle types that are changed in the future.
Is it possible to have one HDF5 object "point" to another?
Absolutely! You'll want to use attributes. Each object has one or more strings describing the path to reach that object. HDF5 groups are analogous to folders/directories, except that folders/directories are hierarchical = a unique path describes each one's location (in filesystems w/o hard links at least), whereas groups form a directed graph which can include cycles. I'm not sure whether you can store a "pointer" to an object directly as an attribute, but you can always store an absolute/relative path as a string attribute. (or anywhere else as a string; you could have lookup tables galore if you wanted.)
We produce HDF5 data on my project, but I don't directly deal with it usually. I can take a stab at the first two questions:
We use a write once, read many times model and the format seems to handle this well. I know a project that used to write both to an Oracle database and HDF5. Eventually they removed the Oracle output since performance suffered and no one was using it. Obviously, SQLite is not Oracle, but the HDF5 format was better suited for the task. Based on that one data point, a RDBMS may be better tuned for multiple inserts and updates.
The readers our customers use are robust when we add new data types. Some of the changes are anticipated, but we don't have to worry about breaking thing when adding more data fields. Our DBA recently wrote a Python program to read HDF5 data and populate KMZ files for visualization in Google Earth. Since it was a project he used to learn Python, I'd say it's not hard to build readers.
On the third question, I'll bow to Jason S's superior knowledge.
I'd say HDF5 is a completely reasonable choice, especially if you are already interested in it or plan to produce something for the scientific community.