As far as I can tell, it is not really possible to "update" a single portion of a file. One must overwrite the entire thing or simply append. A database, however, usually has update functionality. How would one design a database to not append - because that causes tombstones - but rather update?
Files can be overwritten it just can be a bit of a tedious process. You will have to know the beginning index of whatever you want to update and set the file pointer to that index in the file before starting to write to that file.
Databases are easier to update because they are a combination of many data structures (Linked lists, Trees, Heaps, etc.) that all contain specific data and can be iterated through. For these data structures you just need to know which node in the structure you need to update and navigate to it and overwrite the data.
Related
I have a large blob (azure) file with 10k json objects in a single array. This does not perform because of its size. As I look to re-architect it, I can either create multiple files with a single array in each of 500-1000 objects or I could keep the one file, but burst the single array into an array of arrays-- maybe 10 arrays of 1000 objects each.
For simplicity, I'd rather break into multiple files. However, I thought this was worth asking the question and seeing if there was something to be learned in the answers.
I would think this depends strongly on your use-case. The multiple files or multiple arrays you create will partition your data somehow: will the partitions be used mostly together or mostly separate? I.e. will there be a lot of cases in which you only read one or a small number of the partitions?
If the answer is "yes, I will usually only care about a small number of partitions" then creating multiple files will save you having to deal with most of your data on most of your calls. If the answer is "no, I will usually need either 1.) all/most of my data or 2.) data from all/most of my partitions" then you probably want to keep one file just to avoid having to open many files every time.
I'll add: in this latter case, it may well turn out that the file structure (one array vs an array-of-arrays) doesn't change things very much, since a full scan is a full scan is a full scan etc. If that's the case, then you may need to start thinking about how to move to the prior case where you partition your data so that your calls fall neatly within few partitions, or how to move to a different data format.
I am going to convert a text file in the SQLite db form; I am concerned about these points because giving any effort to write code for it:
Will both text file or its corresponding sqlite db be of same size?
SQLite would take less space than text file?
Or text file db is the one with lowest space?
"Hardware is cheap" - I'd strongly recommend not worrying about size differences, which will likely be insignificant anyway, and instead pick whichever solution best meets the rest of your needs. A text file can work just fine for simple projects, but a database has many more features that can help you organize, backup, and query your data much more efficiently and robustly.
For a more in-depth look at the pros and cons of both options, check out: database vs. flat files
Some things to keep in mind:
(NOTE about this answer: Files here references to internal/external storage, not SharedPrefs)
SQL:
Databases have overheads, which does take up size
If the database or a table goes corrupt, all data is lost(how bad this is depends on your app. Losing several thousand pictures: bad. Losing deletion log: not very bad)
Databases can be compressed(see this)
You can split up data into different tables, if you have issues with ID(or whatever way you identify row X), meaning one database can have several tables for each object where object X have identification conflicts with object Y. That basically means you can keep everything in one file, and still avoid conflicts with names. (Read more at the bottom of the answer)
Files:
Every single file has to be defined as its own separate file, which takes up space(name of the file)
You cannot store all attributes in one file without having to set up an advanced reader that determines the different types of data. If you don't do that, and have one file for each attribute, you will use a lot of space.
Reading thousands of lines can go slow, especially if you have several(say 100+) very big files
The OS uses space for each file, excluding the content. The name of the file for an instance, that takes up space. But something to keep in mind is you can keep all the data of an app in a single file. If you have an app where objects of two different types may have naming issues, you create a new database.
Naming conflicts
Say you have two objects, object X and Y.
Scenario 1'
Object X stores two variables. The file names are(x and y are in this case coordinates):
x.txt
y.txt
But in a later version, object Y comes in with the same two files.
So you have to assign an ID to object X and Y:
0-x.txt
0-y.txt
Every file uses 3 chars(7 total, including extension) on the name alone. This grows bigger the more complex the setup is. See scenario 2
But saving in the database, you get the row with ID 0 and find column X or Y.
You do not have to worry about the file name.
Further, if every object saves a lot of files, the reference to load or save each file will take up a lot of space. And that affects your APK file, and slowly pushes you up towards the 50 MB limit(google play limit).
You can create universal methods, but you can do the same with SQL and save space in the APK file. But compared to text files, SQL does save some space in terms of name.
Note though, that if you save 2-3 files(just to take a number) those few bytes going to names aren't going to matter
It is when you start saving hundreds of files, long names to avoid naming conflicts, that is when SQL saves you space. And if the table gets too big, you can compress it. You can zip text files to maybe save some space, but with one-liner files, there is not much to save.
Scenario 2
Object X and Y has three children each.
Every child has 3 variables it saves to the file system. If there was only one object with 3 children, it could have saved it like this
[id][variable name].txt
But because there is another parent with 3 children(of the same type, and they save the same files) the object's children who get saved last are the ones that stay saved. The first 3 get overwritten.
So you have to add the parent ID:
[parent ID][child ID][variable name].txt
And keep in mind, these examples are focused on a few objects. The amount of space saved is low, but when you save hundreds, if not thousands of files, that is when you start to save space.
Now, if you create a table, you can store your main objects(X and Y in this case). Then, you can either create the first table in a way that makes it recognisable whether the object is the parent or child, or you can create a second table. The second table have two ID values; One to identify the parent and one to identify the child. So if you want to find all the children of object 436, you simply write this query:
SELECT * FROM childrentable WHERE `parent_id`='436'
And that will give you all the attributes for all the children with object 436 as its parent.
And everything is stored in the Cursor when returned.
If you were to do the same with a file, this line(where Saver is the file saving and loading class):
Saver.load("0-436-file_name", context);
It is, of course, possible to use a for-loop to cycle the children ID(the 0 at the start), but you would also have to save how many children there are: You cannot get the files as easily, so you have to store values about thee amount of objects and the objects children.
This meaning you have to save more values in more files to be able to get the files you saved in the first place. And this is a really hard way to do things. A database would help you not have to write files to keep track of how many files you saved. The database would return [x] results on each query. So if object 436 has no children, SQL returns 0 rows. But in files, you would have to save 0 as the amount of children. Guessing file names lead to I/O exceptions.
I would expect the text file to be smaller as it has no overhead: all the things a Database gives you have a cost in terms of space.
It sounds like space is the only thing that matters to you, and that you expect to change the contents of the text file often (you call it a 'text file db'). Please note that there is no such thing as a 'text file db'. Reading and writing to it will be very slow compared to a proper db (such as SQLite). Adding different record types (Tables in a db) will complicate your like and I wouldn't want to try to maintain any sort of referential links between record types in a text file.
Hope that helps -
I am new to databases and so am having a little trouble thinking in terms of DB design. If this is not the correct place or way to ask this question, I will be happy to move it to the correct place if told.
I am working on a problem where I go through multiple drives on multiple user machines and store a list of directories I want to use and another list I don't want to use for each volume. These directories can be quickly ascertained/verified so these will be modified over time. This is just a starting point to the actual problem I have to solve, which is doing more stuff on the files in the "to use" list above. The code that actually acts on these directories is a different service and so needs this list to use from somewhere.
I am kind of stuck coming up with a way to store this in a DB on the backend. Since these directory lists are different for different users/volumes I do not know how long or how short this list will be. So do I store these in a DB or somewhere else?
EDIT: To make the question a little more generic, if you have a variable number of data (strings) to be shared between two services during computation how do you store them? How do you create a table to store a variable number of items? In code its easy to just store them in a array you can malloc as the need changes. Its much easier to store them to a file but it does not work across services because these services can be located anywhere.
Thanks for any pointers.
For a project, I want to store the contents of my file in a database.
I am aware that using CLOB is one of the options for storing large file contents. But I have heard that it is not an efficient way to do so.
Are there other alternatives.
Thank you for your answers.
CLOBs are inefficient because every access returns the entire contents of the field, and every modification rewrites the entire contents of the field. It also makes searching on the data difficult and inefficient. If you can break the data up into smaller units to save in multiple rows in a table, that can lead to better, more efficient programs.
That said, those inefficiencies come from misusing the feature. It sounds like what you have in mind is probably just fine (provided, as you say, that you can't know where the file will end up getting stored; typically in this case what I would do would be to store a path to the file in the database rather than the contents of the file itself).
i am using a text file to store my data records. the data is stored in the following format.
Antony|9876543210
Azar|9753186420
Branda|1234567890
David|1357924680
John|6767676767
Thousands of records are stored in that file. i want to delete a particular record, say "David|1357924680". I am using C, how to delete the particular record efficiently? currently i am using a temporary file to copy the records to that temp file by omitting the record i want to delete. and after copying to temp file, i copy the contents of the temp file to original file by truncating all the contents of the original file. i don't think that i am doing it efficiently. Help me.
Add a column to your data indicating it is either a valid ( 1 ) or deleted ( 0 ) row:
Antony|9876543210|1
Azar|9753186420|1
Branda|1234567890|1
David|1357924680|1
John|6767676767|1
When you want to delete a record, overwrite the single byte:
Antony|9876543210|1
Azar|9753186420|1
Branda|1234567890|0
David|1357924680|1
John|6767676767|1
Branda is now deleted.
Then add a data file compression function which can be used to rewrite the file excluding deleted rows. This could be done during times of low or no usage so it doesn't interfere with regular operations.
Edit
The validity column should probably be the first column so you can skip deleted rows more easily.
I think your approach is a little bit wrong. If you really want to do it efficiently use a database, for example sqlite. It is a simple to use database in a simple file. But it offers a lot of power of sql and is very efficient. So adding new entries and deleting wont be a problem (also searching will be easy). So check it out: http://www.sqlite.org/ .
Here is a 3minutes tutorial which will explain by example how to do everything you are trying to accomplish here: http://www.sqlite.org/quickstart.html .
Some simple ideas to improve efficiency a little bit:
You could not copy the temp file back into the original but delete the original after renaming the new one as the original (supposing they are in the same dir)
Use an in-memory data structure to copy the files instead of a support temp file (but by doing so you maybe shall limit its size and use it only as a buffer)
Mark some records as deleted but do not remove them from the file, then after a certain amount of delete operations you can provide to delete physically the records marked this way (but you shall rewrite your other operations on the file to ignore the marked records)
I would tell a similar solution that "Robert S. Barnes" gave.
I woud modify David|1357924680 to |--------------- (equal amount of bytes).
No need for extra bytes (not much benefit)
The data is really deleted. It is useful when needed by security concepts.
Sometime later (daily, weekly, ...) do the same / similar as you do now.
Three suggestions:
1. Do it the way you describe, but instead of copying the temporary file back to the original, just delete the original and rename the temporary file. This should run twice as fast.
2. Overwrite the record with 'XXXXXXX' or whatever. This is very fast, but it may not be suitable for your project.
3. Use a balanced binary tree. This is the 'professional' solution. If possible, avoid programming it from scratch!
Since direct editing of a file isn't possible, you have to resort to a method similar to what you are dong now.
As mentioned by few others, maintaining a proper data structure and only writing back at intervals would improve efficiency.