More efficient to store text as file or in DB? - database

Imagine you're dealing with many strings of text that are about 10,000 characters long entered by users. Would it be more efficient to write those automatically onto pages or input them onto a table in a database? I hope that question is clear enough...

It depends on what sort of "efficiency" you're aiming for.
Here's what I mean:
will you be changing the content of your text strings?
what sorts of searches will you be doing?
when you extract the text do what do you do with it?
My opinion is that provided you're not going to change the content much, nor perform much analysis, you're better off with the database.

10k isn't particularly large, so either is fine. I would personally use the database, as it will allow you to easily search though.

Depends how you're accessing them, but normally using the FS would result in better performance. That's for the obvious reason the DB is another layer built on top of the FS, and using the FS directly, assuming no extra heavy processing (for example, have 100s of named files instead of one big bloated file ordered in a special order you need to parse), would save you the DBMS operations.

I'm wondering if SQLite would be the best of both worlds, or at least, the best database for that size of job.

The real answer her is what you're going to do with these strings.
Databases are meant to be able to quickly return specific records. If you're just going to SELECT * FROM Table and then concat it all together, there's no point in using a database.
However, if you have a relation between your data that you want to be able to search, then a database will likely be more efficient.
E.G., do you want to be able to pull up all the text records from a set of users on a set of dates? Find all records from users who match some records?
These kinds of loads will likely be more efficient than a naive implementation, and still probably faster than a decent one, even if it does avoid some access layers.

There are a lot of considerations. As others said - either approach would work fine for a small number of 10k rows (thousands).
But what's the rest of your app do? If it does everything in the database, then I'd be inclined to put this there as well; the opposite is true as well.
And how will you be selecting these? Do you need to do complex text searches? If so, a database might not be the best. Or, would you be adding new attributes, searching on those attributes - or matching them against data in other tables? In this common case a database would be better.
And if your data is really vast (many millions of 10k rows) and your performance requirements aren't terribly high - you may want to compress them and store them in the file system.
Lastly, how important is data quality? Given the features of a good database it's much easier to guarantee good data quality with a database.

Related

Choosing the right DBM-like C++ library for sequential data

I am trying to choose a database for a newly developing application. There are so many alternatives and it’s so easy to choose a wrong one. First of all, there is a requirement to not use database servers. A required database should be a static or dynamic C++ library. The data that needs to be stored is an array of records. They vary but are fixed for a given dataset (so they can be stored in a table). The information in each row could be from several hundred bytes up to several megabytes. And a number of rows may be millions for now and expected to grow.
The index of the row could be used as a key. No need to maintain a separate key column.
Data is inserted sequentially. Read access will be performed only by iterating all the data or some segment of it sequentially (May need to iterate with steps like each 5th).
I don’t think that relational DBs are good feet for many reasons.
a. They are mostly server-based. I know about SQLite but as far as I know, it stores data in one file which I assume may lead to issues related to maximum file size.
b. We don’t need the power that SQL provides instead we would like to have more flexibility in stored data types.
There are Key/Value non-SQL dbms like BerkeleyDB, RocksDB, or something like luxio for lighter alternatives. The functionality they provide is more than enough for the task. And this might be the right choice however I don’t know how well they are optimized for such case where we have continuous integer keys. The associative key access (which is not required for us) may have some overhead in performance.
I know there are some type of non-SQL databases called “wide-column” which I am not familiar with. However, the name sounds like it is perfect for our task. All databases I can find are server of claud based. If you know dbm-like library for such type of database please advise.
I am not experienced in databases so please correct me if I am wrong in any of 3 above stamens.
If your row data can grow to megabytes, and you're talking about only millions of records, maybe just figure out a way to lay it out in a filesystem? If you need a more database-like index, use SQLite for the keys, and have the data records point to a location on the filesystem. This kind of thing will be far quicker to implement and get right than trying to do it all in one giant database.

Database Design Questions - Need Clarifications

i m designing a database using sql server 2005
main concept of our side is to import xml feeds from suppliers
different supplier can have different representation of data
the problem is i need to design table to store imported information
some of the columns are fixed means all supplier products must have similar data coming from the feed like , name, code, price, status, etc
but some product have optional details like
one product have might color property other might dont.
what is the best way to store these kind of scenario into the database.
should i create a table for mandatory columns and other tables to hold optional column.
or i should i list down all the column first and put them into the one table. (there might a lot of null values)
there will thousands of products and database speed is very essential .
we will be doing a lot of product comparison from different supplier
our database will be something like www.pricerunner.co.uk
i hope i explain the concept well
Thousands of products (so thousands of rows.) Thats really not many at all, so you could normalize the the optional data to a few separate tables without having a dramatic effect on query time.
I would say put your indexes in the correct place, optimize your queries, make sure you have filegroups split up nicely, etc (just the usual regular old database stuff) and you should be good.
Depends on how you want to access it.
As you say, speed is important - but what are you going t do with those extra, optional, bits of information? Do you need to store them at all? Assuming you do, how often do you need to access them?
Essentially, if you will always need to at least check if they're there, probably better to put them into one table. If you need to check anyway, might as well get it over with as part of the initial query.
If, on the other hand, you can usually run without bothering to check for these extra pieces, and only need to bother when specilly requested, then it might be better to put them into a different table. The join (or subsequent lookup) will be expensive - much more expensive than pulling nulls for empty columns - but if it's very infrequent, would probably cost less in runtime execution in the long run.
Also bear in mind the tradeoff in storage and transport terms - storing lots of empty fields does take some space, and sending back lots of empty fields takes network bandwidth.
If disk space is not a concern, but bandwidth is, make the application is carfully designed to minimse unecessary lookups, and then with tight queries you can store the extra (optional) data, but not pass it back unless it's requested.
So, it really all depends on what's important to you. Once you know what your overriding design concerns are, you will know which compromises to make to address those concerns at the expense of others. A balancing act.

Is it advisable to store things such as list of cities on the db?

Hi I'm using CakePHP and I'm wondering if it's advisable to store things that don't change a lot in the database lik the list of cities?
If your application already needs a database, why would you keep data anywhere else?
If the list doesn't change (per installation) and it's reasonably small and frequently used, then it might be worth reading it once on initialization and caching the result to improve performance and reduce the load on the database.
You get all sorts of queries and retrievals out of the box, the same way you access any other of your data. Databases are as cheap as flat files today, but you get a full service.
I see this question has had an answer accepted - I still want to chime in with my $0.02
The way I typically do for arrays of static data (country list, timezone list, immutable sets you would use enum for...) is to use this array datasource.
It allows you to map relationships between db models and array based models and to use the usual find syntax / Containable on the relationships.
http://github.com/jrbasso/array_datasource
If it is pretty much a static list, then you can store it either in the db or a file, but keep it in memory for use. In other words, load it once whether from db or file. What you don't want to do is keep taking a hit loading it. Especially if you use it on most page views. Those little bits of time add up if you have a large number of visitors.
The flip side, of course, is if you find yourself doing this for large lists or lots and lots of little lists. Then you could run into problems of keeping too much in memory.
Bill the Lizard is right about it being important whether or not the list links to other tables. If it does, then you will need it in the db if you need queries that will include it.

database vs flat file, which is a faster structure for "regex" matching with many simultaneous requests

which structure returns faster result and/or less taxing on the host server, flat file or database (mysql)?
Assume many users (100 users) are simultaneously query the file/db.
Searches involve pattern matching against a static file/db.
File has 50,000 unique lines (same data type).
There could be many matches.
There is no writing to the file/db, just read.
Is it possible to have a duplicate the file/db and write a logic switch to use the backup file/db if the main file is in use?
Which language is best for the type of structure? Perl for flat and PHP for db?
Addition info:
If I want to find all the cities have the pattern "cis" in their names.
Which is better/faster, using regex or string functions?
Please recommend a strategy
TIA
I am a huge fan of simple solutions, and thus prefer -- for simple tasks -- flat file storage. A relational DB with its indexing capabilities won't help you much with arbitrary regex patterns at all, and the filesystem's caching ensures that this rather small file is in memory anyway. I would go the flat file + perl route.
Edit: (taking your new information into account) If it's really just about finding a substring in one known attribute, then using a fulltext index (which a DB provides) will help you a bit (depending on the type of index applied) and might provide an easy and reasonably fast solution that fits your requirements. Of course, you could implement an index yourself on the file system, e.g. using a variation of a Suffix Tree, which is hard to be beaten speed-wise.
Still, I would go the flat file route (and if it fits your purpose, have a look at awk), because if you had started implementing it, you'd be finished already ;) Further I suspect that the amount of users you talk about won't make the system feel the difference (your CPU will be bored most of the time anyway).
If you are uncertain, just try it! Implement that regex+perl solution, it takes a few minutes if you know perl, loop 100 times and measure with time. If it is sufficiently fast, use it, if not, consider another solution. You have to keep in mind that your 50,000 unique lines are really a low number in terms of modern computing. (compare with this: Optimizing Mysql Table Indexing for Substring Queries )
HTH,
alexander
Depending on how your queries and your data look like a full text search engine like Lucene or Sphinx could be a good idea.

Store static data in an array, or in a database?

We always have some static data which can be stored in a file as an array or stored in a database table in our web based project. So which one should be preferred?
In my opinion, arrays have some advantages:
More flexible (it can be any structure, which specifies a really complex relation)
Better performance (it will be loaded in memory, which will have better read/write performance compared with a database's I/O operations)
But my colleague argued that he preferred DB approach, since it can keep a uniform data persistence interface, and be more flexible.
So which should be preferred? Or how can we choose? Or we should prefer one in some scenario and another in other scenarios? what are the scenarios?
EDIT:
Let me clarify something. Truly just as Benjamin made the change to the title, the data we want to store in an array(file) won't change so frequently, which means the code won't change the value of the array in the runtime. If the data change very frequently I will use DB undoubtedly. That's why I made such a post.
And sometimes it's hard to store some really complex relations like:
Task = {
"1" : {
"name" : "xx",
"requirement" : {
"level" : 5,
"money" : 100,
}
...
}
Just like the above code sample(a python dict or you can think it as an array), the requirement field is hard to store in DB(store a structure like pickled object directly in DB? not so good I think). So in such condition, I will prefer arrays.
So what's your idea? In such scenario, we should prefer arrays to DB, right?
Regards.
Lets be pragmatic/objetive:
Do you write to your data on runtime? Yes: Db, No: File
Do you update your data more than once per week? Yes: Db, No: File
It's a pain to release an updated data file? Yes: Db, No: File,
Do you read that data often? Yes: File/Cache, No: Db
It is a pain to update that data file and you need extra tools? Yes: db, No: File
For sure I've forgotten other points, but I guess the basics are there.
The "flexiable" array in a file is fraught with a zillion issues already delt with by using a DB. Unless you can prove that the DB is really going to way slower than using the other approach use a DB. Move on and start solving business problems.
Edit
Comment from OP asks what the issues with using a file might be, here are a handful (pause to take a deep breath).
Concurrency: You have to manage the situation where multiple requests may be trying to write back to the file. Not too hard but it becomes a bottleneck.
Performance: Yes modifying an in-memory array is quicker but how do you determine how much and when the array needs to be persisted to a file. Note that using a DB doesn't pre-clude the use of an appropriate in-memory cache. Writing a file back each time a small modification is made isn't going to perform that well.
Scalability: Really a function of the first two. In order to acheive any scalable goals you need to be able to quickly modify small bits of the data that is persisted. IWO if you don't use a DB you would end up writing one. If you find you need more than one webserver to support growing demand where are you going to store the file(s)? Now you've got file I/O over a network (ableit likely a very quick one).
Structure: Your code will be responsible for managing the structure of data, querying it etc if you use an array. How will you do that in way which acheives greater "flexibility" than using a DB? All manner of choices and complexity are needed here.
Reliability: You need to ensure the integrity of your persisted data. In the event of some failure your array/file code would need to ensure that data is at least not so corrupt that the application can continue.
Your colleague is correct, BUT there's where you need to put aside the comp sci textbook and be pragmatic. How often will you be accessing this data from your application? If it's fairly frequently then don't incur the costs of access overhead. Instead of reading from a flat file you could still gain the advantages of a db, but use a caching strategy in your application. Depending on your development language you could look at something like memcache or jtreecache.
It depends on what kind of data you are looking at, and whether or not it needs to be updated regularly.
I tend to keep most things (non-config data) in the database, even if the data isn't going to be repeating (e.g. thosands of rows). Databases will scale so much easier than a flat file, if your system starts to grow fast your flat file might become a burden to your system.
If the data doesn't change very oftern, and your programming in Java, why not use Spring to hold the values?
They can be injected into your bean, and changed easly.
but thats if you'r developing in Java.
Yeah I agree with your implied assessment that databases are overused and basic flat files may work in multitude of scenarios. If your application is read-only (and writes are done by the admin when app restarts) I would definitely go with the file. Even if application writes to the file, but only in append mode (vs random inserts/updates) in one thread, I would also use file. Anything else -- need a real database with random updates, queries, concurrency control etc.

Resources