Our product currently deals with Documents, uses Saxon to run xquery/xpath expressions on them. For better performance we are looking to shift to using TinyTree instead of DOM.
We have a lot of operations involving saving the Documents to a database, so wanted to know whats the best way to save Saxons Tiny Tree to a database? The use case here is to save the tiny tree to database from one instance of the process, and load it into another instance running on another machine.
We tried to find a way to serialize/deserialize tiny tree, but unable to find anything other than xml serialization posted at https://saxonica.plan.io/boards/3/topics/4630, that would work across different process instances.
Are there any other suggestions, that can save space?
Well, it rather depends on the database. Is this an XML database, or a relational database? Generally you're going to have to serialize the document as lexical XML, unless the database product in question offers some other interface (for example, SAX or DOM or StAX) in which case Saxon has APIs to supply the TinyTree in that format.
If you're trying to get XML from one machine to another, then typically serializing and re-parsing is the way to do it. I'm not sure where the database comes into this.
Note that the forum post you're citing is 17 years old and things have moved on a little bit... The architecture hasn't changed much, though.
If space is your main concern, you might look at EXI (or other binary XML representations). These invariably offer SAX-to-binary and binary-to-SAX interfaces, so they integrate easily with Saxon.
To get XML out of a TinyTree into something that offers a SAX ContentHandler interface, use Processor.writeXdmValue(node, destination) where node is an XdmNode that wraps the TinyTree, and destination is a SAXDestination that wraps the supplied ContentHandler.
Related
I am not asking for opinions but more on documentations.
We have a lot of data files (XML, CSV, Plantext, etc...), and need to process them, data mine them.
The lead database person suggested using stored procedure to accomplish the task. Basically we have a staging table where the file get serialized, and saved into a clob, or XML column. Then from there he suggested to further use stored procedure to process the file.
I'm an application developer with db background, more so on application development, and I might be bias, but using this logic in the DB seems like a bad idea and I am unable to find any documentation to prove or disapprove what I refer to as putting a car on a train track to pull a load of freights.
So my questions are:
How well does the DB (Oracle, DB2, MySQL, SqlServer) perform when we talking about regular expression search, search and replace of data in a clob, dom traversal, recursion? In comparison to a programming language such as Java, PHP, or C# on the same issues.
Edit
So what I am looking for is documentation on comparison / runtime analysis of a particular programming language compare to a DBMS, in particular for string search and replace, regular expression search and replace. XML Dom traversal. Memory usage on recursive method calls. And in particular how well they scale when encountered with 10 - 100's of GB of data.
It sounds like you are going to throw business logic into the storage layer. For operations like you describe, you should not use the database. You may end up in trying to find workarounds for showstoppers or create quirky solutions because of inflexibility.
Also keep maintainability in mind. How many people will later be able to maintain the solution?
Speaking about speed, choosing the right programming language you will be able to process data in multiple threads. At the end, your feeling with the car n the train is right;)
It is better to pull the processing logic out of data layer.Profiling your implementation in Database will be difficult.
You get the freedom and option to choose between libraries and comparing their performance if the implementation is done with any language.
Moreover you can choose frameworks like (Spring-Batch for Java) to process bulk volume of data as batch process.
I need to upgrade an old system based on Zope, I need to be able to export the data to something like SQL Server...does anyone know of a way I can open the Zope DB in .NET or directly export it to SQL Server?
Thanks,
Kieron
I am a Plone web developer, and Jason Coombs is correct. The ZODB is an Object Data Base, and contains python objects. These objects can be python code, data, meta-data, etc and are stored in a hierarchy. This is very different from the world of SQL tables and stored procedures. (The growing NoSQL movement, shows that Zope is not the only one doing this.) Additionally, since these are complex python objects you really want to be working on the ZODB with the version of python it was created with, or make sure that you can do a proper migration. I don't think that you will be able to do this with IronPython.
Without knowing what you are trying to get out of the ZODB, it is hard to give specific advice. As Jason suggested, trying WebDAV/FTP access to the ZODB might be all that you need. This allows to pull out the basic content of pages, or image files, but you may loose much of the more complex data (for example an event page may not have all of its datetime data included) and you will loose much of of the meta-data.
Here is how someone migrated from Plone to Word press:
http://www.len.ro/2008/10/plone-to-wordpress-migration/
There are a number of articles on migrating from one Plone version to another. Some of this information maybe useful to you. stackoverflow is not allowing me to post more links but search for:
"when the plone migration fails doing content migration only"
"Plone Product ContentMigration"
The first important thing to note is that the Zope Object Database (ZODB) stores Python objects in its hierarchy. Therefore, getting "data" out of the ZODB generally does not make sense outside of the Python language. So to some extent, it really depends on the type of data you want to get out.
If the data you're seeking is files (such as HTML, documents, etc), you might be able to stand up a Zope server and turn on something like WebDAV or FTP and extract the files that way.
The way you've described it, however, I suspect the data you seek is more fine-grained data elements (like numbers or accounts or some such thing). In that case, you'll almost certainly need Python of some kind to extract the data and transform it into some format suitable to import into SQL Server. You might be able to stay inside the .NET world by using IronPython, but to be honest, I would avoid that unless you can find evidence that IronPython works with the ZODB library.
Instead, I suggest making a copy of your Zope installation and zope instance (so you don't break the running system), and then use the version of Python used by Zope (often installed together) to mount the database, and manipulate it into a suitable format. You might even use something like PyODBC to connect to the SQL Server database to inject the data -- or you may punt, export to some file format, and use tools you're more familiar with to import the data.
It's been a while since I've interacted with a ZODB, but I remember this article was instrumental in helping me interact with the ZODB and understand its structure.
Good luck!
I'm writing a CAD (Computer-Aided Design) application. I'll need to ship a library of 3d objects with this product. These are simple objects made up of nothing more than 3d coordinates and there are going to be no more than about 300 of them.
I'm considering using a relational database for this purpose. But given my simple needs, I don't want any thing complicated. Till now, I'm leaning towards SQLite. It's small, runs within the client process and is claimed to be fast. Besides I'm a poor guy and it's free.
But before I commit myself to SQLite, I just wish to ask your opinion whether it is a good choice given my requirements. Also is there any equivalent alternative that I should try as well before making a decision?
Edit:
I failed to mention earlier that the above-said CAD objects that I'll ship are not going to be immutable. I expect the user to edit them (change dimensions, colors etc.) and save back to the library. I also expect users to add their own newly-created objects. Kindly consider this in your answers.
(Thanks for the answers so far.)
The real thing to consider is what your program does with the data. Relational databases are designed to handle complex relationships between sets of data. However, they're not designed to perform complex calculations.
Also, the amount of data and relative simplicity of it suggests to me that you could simply use a flat file to store the coordinates and read them into memory when needed. This way you can design your data structures to more closely reflect how you're going to be using this data, rather than how you're going to store it.
Many languages provide a mechanism to write data structures to a file and read them back in again called serialization. Python's pickle is one such library, and I'm sure you can find one for whatever language you use. Basically, just design your classes or data structures as dictated by how they're used by your program and use one of these serialization libraries to populate the instances of that class or data structure.
edit: The requirement that the structures be mutable doesn't really affect much with regard to my answer - I still think that serialization and deserialization is the best solution to this problem. The fact that users need to be able to modify and save the structures necessitates a bit of planning to ensure that the files are updated completely and correctly, but ultimately I think you'll end up spending less time and effort with this approach than trying to marshall SQLite or another embedded database into doing this job for you.
The only case in which a database would be better is if you have a system where multiple users are interacting with and updating a central data repository, and for a case like that you'd be looking at a database server like MySQL, PostgreSQL, or SQL Server for both speed and concurrency.
You also commented that you're going to be using C# as your language. .NET has support for serialization built in so you should be good to go.
I suggest you to consider using H2, it's really lightweight and fast.
When you say you'll have a library of 300 3D objects, I'll assume you mean objects for your code, not models that users will create.
I've read that object databases are well suited to help with CAD problems, because they're perfect for chasing down long reference chains that are characteristic of complex models. Perhaps something like db4o would be useful in your context.
How many objects are you shipping? Can you define each of these Objects and their coordinates in an xml file? So basically use a distinct xml file for each object? You can place these xml files in a directory. This can be a simple structure.
I would not use a SQL database. You can easy describe every 3D object with an XML file. Pack this files in a directory and pack (zip) all. If you need easy access to the meta data of the objects, you can generate an index file (only with name or description) so not all objects must be parsed and loaded to memory (nice if you have something like a library manager)
There are quick and easy SAX parsers available and you can easy write a XML writer (or found some free code you can use for this).
Many similar applications using XML today. Its easy to parse/write, human readable and needs not much space if zipped.
I have used Sqlite, its easy to use and easy to integrate with own objects. But I would prefer a SQL database like Sqlite more for applications where you need some good searching tools for a huge amount of data records.
For the specific requirement i.e. to provide a library of objects shipped with the application a database system is probably not the right answer.
First thing that springs to mind is that you probably want the file to be updatable i.e. you need to be able to drop and updated file into the application without changing the rest of the application.
Second thing is that the data you're shipping is immutable - for this purpose therefore you don't need the capabilities of a relational db, just to be able to access a particular model with adequate efficiency.
For simplicity (sort of) an XML file would do nicely as you've got good structure. Using that as a basis you can then choose to compress it, encrypt it, embed it as a resource in an assembly (if one were playing in .NET) etc, etc.
Obviously if SQLite stores its data in a single file per database and if you have other reasons to need the capabilities of a db in you storage system then yes, but I'd want to think about the utility of the db to the app as a whole first.
SQL Server CE is free, has a small footprint (no service running), and is SQL Server compatible
A database file system is a file system that is a database instead of a hierarchy. Not too complex an idea initially but I thought I'd ask if anyone has thought about how they might do something like this? What are the issues that a simple plan is likely to miss? My first guess at an implementation would be something like a filesystem to for a Linux platform (probably atop an existing file system) but I really don't know much about how that would be started. Its a passing thought that I doubt I'd ever follow through on but I'm hoping to at least satisfy my curiosity.
DBFS is a really nice PoC implementation for KDE. Instead of implementing it as a file system directly, it is based on indexing on a traditional file system, and building a new user interface to make the results accessible to users.
The easiest way would be to build it using fuse, with a database back-end.
A more difficult thing to do is to have it as a kernel module (VFS).
On Windows, you could use IFS.
I'm not really sure what you mean with "A database file system is a file system that is a database instead of a hierarchy".
Probably, using "Filesystem in Userspace" (FUSE), as mentioned by Osama ALASSIRY, is a good idea. The FUSE wiki lists a lot of existing projects about databased-backed filesystems as well as filesystems in which you can search by SQL-like queries.
Maybe this is a good starting point for getting an idea how it could work.
It's a basic overview of the Firebird architecture.
Firebird is an opensource RDBMS, so you can have a real deep insight look, too, if you're interested.
Its been a while since you asked this. I'm surprised no one suggested the obvious. Look at mainframes and minis, especially iSeries-OS (now called IBM-i used to be called iOS or OS/400).
How to do an relational database as a mass data store is relatively easy. Oracle and MySQL both have these. The catch is it must be essentially ubiquitous for end user applications.
So the steps for an app conversion are:
1) Everything in a normal hierarchical filesystem
2) Data in BLOBs with light metadata in the database. File with some catalogue information.
3) Large data in BLOBs with extensive metadata and complex structures in the database. File with substantial metadata associated with it that can be essentially to understanding the structure.
4) Internal structures of the BLOB exposed in an object <--> Relational map with extensive meta-data. While there may be an exportable form, the application naturally works with the database, the notion of the file as the repository is lost.
We have a set of applications that work with multiple database engines including Sql Server and Access. The schemas for each are maintained separately and are not stored in text form making source control difficult. We are interested in moving to a system where the schema is stored in some text-based format (such as XML or YAML) with descriptions of field data types, foreign key relationhsips, etc.
When all is said and done, we want to have a single text file in source control that can be used to generate a clean database that works with both SQL Server, Access at least (and preferably is capable of working with Oracle, DB2 and other engines).
I'm certain that there are tools or libraries out there that can get us at least part of the way there. For one, I've found Altova MapForce that looks like it may do the trick but I'm interested in hearing about any alternative tools or libraries or even entirely different solutions for those in the same predicament.
Note: The applications are written in C++ and ORM solutions are both not readily available in C++ and would take far too long to integrate into our aging products.
If you don't use a object relational mapper that does this (and many other things for you) the easiest way might be to whip up a few structures to define your tables and attributes in some form of (static) code and write little generators to create actual databases from that description.
That makes it easy for source control, and if you're careful when designing those structures, you can easily re-use them for other DBs if need arises.
The consensus when I asked a similar (if rather more naive) question seem to be to use raw SQL, and to manage the RDMS dependencies with an additional layer. Good luck.
Tool you're looking for is liquibase. No support for Access though...