Dataset size limit as an xml file - dataset

We are currently using DataSet for loading and saving our data to an xml file using Dataset and there is a good possibility that the size of the xml file could get very huge.
Either way we are wondering if there is any limit on the size for an xml file so the Dataset would not run into any issues in the future due to the size of it. Please advise.
Thanks
N

well, OS max file size in one thing to consider.
(although modern os won't have this problem).
old OS support only 2 GB per file if I recall right.
also - the time that you will need to waist on updating the file is enourmouse.
if you're going for a very very large file, use a small DB (mysql or sqlexpress or sqLite)

DataSets are stored in memory so the limit should be somewhere around the amount of memory the OS can address for custom processes.

While doing work for a prior client of reading and parsing large XML files over 2 gig per XML file, the system choked while trying to use an XML reader. While working w/Microsoft, we ultimately were passed on to the person who wrote the XML engine. His recommendation was for us to read and process it in smaller chunks, that it couldn't handle loading the entire thing into memory at one time. However, if you are trying to WRITE XML as a stream to a final output .XML file, you should be good to go on most current OS that support over 2 gig.

Related

Implementing a database in a single file

This question is about creating a new single file database format. I am new to this!
I wonder how SQLite does this- for databases larger than the available memory, SQLite must be reading from certain parts of the file somehow, i.e. reading at position n?
Is this possible at sub-linear runtime complexity? I assume that when SQLite fetches a particular row, it uses a O(logn) index lookup first- so it doesn't fetch the entire index- and then it fetches the row from a particular location in the file. All of this involves not reading the whole file into memory- but FS methods appear not to provide this functionality.
Is fs.skip(n) [pseudocode] done in O(n) or does the OS skip straight to position n? Theoretically this should be possible because in the OS files are divided into blocks- and inodes reference 1-3 levels of array-like structures that locate the blocks, so fetching a particular block in a file should be possible in sub-linear time- without reading in the entire file.
I wonder how SQLite does this- for databases larger than the available memory, SQLite
must be reading from certain parts of the file somehow, i.e. reading at position n?
Yes. Almost every programming language has documentation that explains how to position the read on a file.
All of this involves not reading the whole file into memory- but FS methods appear not to
provide this functionality.
Every file system access API that I know of does support this, and it is explained in the documentation. Examples range from memory-mapped files in Windows (which are "quite" advanced and not supported if you plan to go OS-agnostic), down to something simple like the fseek() method in C that positions a file stream.
I suggest brushing up on your knowledge of file-system access methods in your programming language of choice.

Loadrunner: Store many of xml file contents in an array and one by one call array

I have five hundred xml files, I want to store file contents into a array or list. use Loadrunner call array and send xml files to application server, But I am not familiar with c.
example:
Result01.xml,Result02.xml,Result03.xml,Result4.xml,Result05.xml,....Result500.xml,
Thanks!
CPU, Disk, Memory, Network. This is your finite resource pool. Attempting, for every single virtual user, to pull all XML files into memory is going to drive the memory requirements for each virtual user through the roof and you will very likely wind up in a swap of death model on your load generator.
Consider storing all of the names of your files and their location on a very fast SSD local to the load generator in a parameter file. Select from that file randomly for the filename. Read it from disk and then submit it as appropriate. This will limit your in memory need to the size of your largest XML file, which you can free() as soon as you are done using the file. This does introduce a disk dependency, but note it is a read only dependency and the recommendation to used SSD for the storage is because of the absurdly high read IOPS on that media to reduce the window of conflict to an absolute minimum.
This is also a good time to brush up on your C programming. There are lots of great books out there. You need to be proficient with the language of your tool, no matter what the tool is, if you are going to be effective with the tool.

Streaming larger files in java

We are streaming the files form server in zip format and writning into oracle blob object using pipedstreams.It is working fine to me some 300MB size.But i have the requirement to stor e the gatter than 2GB data.When i tried to store 1GB data it is failing.Please suggest me the better way to stream the larger size files in java.
--Thanks in Adv
if your code fails around 300MB you most certainly have created faulty code - my guess is your JVM heap size is set to ~512MB and you only got ~300MB of free memory for your own purposes -- that is more than enough, just stream your file in small chunks (maybe about 1KiB or even 1MiB if you want) and you'll be good to go :
https://stackoverflow.com/a/55788/351861

Strategy for mass storage of small files

What is the good strategy for mass storage for millions of small files (~50 KB on average) with auto-pruning of files older than 20 minutes? I need to write and access them from the web server.
I am currently using ext4, and during delete (scheduled in cron) HDD usage spikes up to 100% with [flush-8:0] showing up as the process that creates the load. This load is interferes with other applications on the server. When there are no deletes, max HDD utilisation is 0-5%. Situation is same with nested and non-nested directory structures. The worst part is that it seems that mass-removing during peak load is slower than the rate of insertions, so amount of files that need to be removed grows larger and larger.
I have tried changing schedulers (deadline, cfq, noop), it didn't help. I have also tried setting ionice to removing script, but it didn't help either.
I have tried GridFS with MongoDB 2.4.3 and it performs nicely, but horrible during mass delete of old files. I have tried running MongoDB with journaling turned off (nojournal) and without write confirmation for both delete and insert (w=0) and it didn't help. It only works fast and smooth when there are no deletes going on.
I have also tried storing data in MySQL 5.5, in BLOB column, in InnoDB table, with InnoDB engine set to use innodb_buffer_pool=2GB, innodb_log_file_size=1GB, innodb_flush_log_on_trx_commit=2, but the perfomance was worse, HDD load was always at 80%-100% (expected, but I had to try). Table was only using BLOB column, DATETIME column and CHAR(32) latin1_bin UUID, with indexes on UUID and DATETIME columns, so there was no room for optimization, and all queries were using indexes.
I have looked into pdflush settings (Linux flush process that creates the load during mass removal), but changing the values didn't help anything so I reverted to default.
It doesn't matter how often I run auto-pruning script, each 1 second, each 1 minute, each 5 minutes, each 30 minutes, it is disrupting server significantly either way.
I have tried to store inode value and when removing, remove old files sequentially by sorting them with their inode numbers first, but it didn't help.
Using CentOS 6. HDD is SSD RAID 1.
What would be good and sensible solution for my task that will solve auto-pruning performance problem?
Deletions are kind of a performance nuisance because both the data and the metadata need to get destroyed on disk.
Do they really need to be separate files? Do the old files really need to get deleted, or is it OK if they get overwritten?
If the answer is "no" to the second of these questions, try this:
Keep a list of files that's roughly sorted by age. Maybe chunk it by file size.
When you want to write to a new file, find an old file that's preferably bigger than what you're replacing it with. Instead of blowing away the old file, truncate() it to the appropriate length and then overwrite its contents. Make sure you update your old-files list.
Clean up the really old stuff that hasn't been replaced explicitly once in a while.
It might be advantageous to have an index into these files. Try using a tmpfs full of symbolic links to the real file system.
You might or might not get a performance advantage in this scheme by chunking the files in to manageably-sized subdirectories.
If you're OK with multiple things being in the same file:
Keep files of similar sizes together by storing each one as an offset into an array of similarly-sized files. If every file is 32k or 64k, keep a file full of 32k chunks and a file full of 64k chunks. If files are of arbitrary sizes, round up to the next power of two.
You can do lazy deletes here by keeping track of how stale each file is. If you're trying to write and something's stale, overwrite it instead of appending to the end of the file.
Another thought: Do you get a performance advantage by truncate()ing all of the files to length 0 in inode order and then unlink()ing them? Ignorance stops me from knowing whether this can actually help, but it seems like it would keep the data zeroing together and the metadata writing similarly together.
Yet another thought: XFS has a weaker write ordering model than ext4 with data=ordered. Is it fast enough on XFS?
If mass-removing millions of files results in performance problem, you can resolve this problem by "removing" all files at once. Instead of using any filesystem operation (like "remove" or "truncate") you could just create a new (empty) filesystem in place of the old one.
To implement this idea you need to split your drive into two (or more) partitions. After one partition is full (or after 20 minutes) you start writing to second partition while using the first one for reading only. After another 20 minutes you unmount first partition, create empty filesystem on it, mount it again, then start writing to first partition while using the second one for reading only.
The simplest solution is to use just two partitions. But this way you don't use disk space very efficiently: you can store twice less files on the same drive. With more partitions you can increase space efficiency.
If for some reason you need all your files in one place, use tmpfs to store links to files on each partition. This requires mass-removing millions of links from tmpfs, but this alleviates performance problem because only links should be removed, not file contents; also these links are to be removed only from RAM, not from SSD.
If you don't need to append to the small files, I would suggest that you create a big file and do a sequential write of the small files right in it, while keeping records of offsets and sizes of all the small files within that big file.
As you reach the end of the big file, start writing from its beginning again, while invalidating records of the small files in the beginning that you replace with new data.
If you choose the big file size properly, based new files saving rate, you can get auto-pruning of files older than ~20 minutes almost as you need.

Embedded database specialized for tiny size data & almost no writes?

I am looking for a lightweight embedded database to store (and rarely modify) a few kilobytes of data (5kb to 100kb) in Java applications (mostly Android but also other platforms).
Expected characteristics:
fast when reading, but not necessarily fast when writing
almost no size overhead (kilobytes used even when there is no data), but not necessarily very compact (kilobytes used per kilobyte of actual data)
very small database client library JAR file size
Open Source
QUESTION: Is there a database format specialized for those tiny cases?
Text-based solutions acceptable too.
If relevant: it will be this kind of data.
Stuff it in an object and serialize it out to a file. Write the new file on save, rename it on top of the old one to "commit" it so you don't have to worry about corrupting it if the write fails. No DB, no nothing. Simple.
If you can use flat (text) files, you could keep the file on disk and read/seek around. Never read it all in at once. If you need e.g. a faster index, maybe you can build the index and a record number and use the index to find the right "rows" and use the record number to get the rest of the data from a constant size field database or as a line number in a text file.
I don't know about this Java and that static initializer message, but that sounds to me like a code size limit, not data? Why would the runtime data affect bytecode?
Can't suggest specific libraries. Maybe there's some Berkeley DB, DSV or xBase style library around.

Resources