I am working on a program that requires me to input values for 12 objects, each with 4 arrays, each with 100 values. (4800) values. The 4 arrays represent possible outcomes based on 2 boolean values... i.e. YY, YN, NN, NY and the 100 values to the array are what I want to extract based on another inputted variable.
I previously have all possible outcomes in a csv file, and have imported these into sqlite where I can query then for the value using sql. However, It has been suggested to me that sqlite database is not the way to go, and instead I should populate using arrays hardcoded.
Which would be better during run time and for memory management?
If you only need to query the data (no update/delete/insert), I won't suggset to use sqlite. I think the hardcode version beat sqlite both in run time and memory efficiency.
Most likely sqlite will always be less efficient then hardcoded variables, but sqlite would offer other advantages down the road, potentially making maintenance of the code easier. I would think that it would be difficult, for the amount of data that you are talking about from really noticing a difference between 4800 values being stored in the code or being stored in a database.
sqlite would easily beat your CSV though as far as processing time, and memory management would depending a lot on how efficient your language of choice handles .csv versus sqlite connectivity.
Usually a database is used when you want to handle many data (or potentially you could handle many data), and you want a faster way to search part of the data.
If you are just need to save few values, then you probably don't need a database engine.
Related
I am working with an organisation that is using Excel as their data storage center right now, their data might scale up in the near future, so I was wondering when the data size is too large for Excel to handle efficiently. The data stored is just strings and integers of limited lengths (below 50 char and integer values between 1 and 100000).
Is there some kind of general guideline to how big an Excel sheet can be before it becomes inefficient and you should use a database like SQL Server instead?
I have usually found that once the Excel file size crosses 30MB it becomes very slow and unresponsive. In order to manipulate higher data size i started using a add-on called Powerpivot which is very very fast ! See instructions on how to enable it here - https://support.office.com/en-us/article/Start-the-Power-Pivot-in-Microsoft-Excel-add-in-a891a66d-36e3-43fc-81e8-fc4798f39ea8
Efficiency is subjective and depends on your goals. The thing is that an RDBMS like SQL Server will always be more efficient, because you will get optimized searches instead of reading a textual file. Also, relations between tables are handled well and if you design your DB scheme well, then you will have benefits like handling concurrent access, avoiding redundancy and inconsistency. It is worth the effort on the long run.
Currently I am storing JSON in my database as VARCHAR(max) that contains some transformations. One of our techs is asking to store the original JSON it was transformed from.
I'm afraid that if I add another JSON column it is going to bloat the page size and lead to slower access times. On the other hand this table is not going to be real big (about 100 rows max with each JSON column taking 4-6 K bytes) and could get accessed as much as 4 or 5 times a minute.
Am I being a stingy gatekeeper mercilessly abusing our techs or a sagacious architect keeping the system scalable?
Also, I'm curious about the (relatively) new filestream/BLOBs type. From what I've read I get the feeling that BLOBs are stored in some separate place such that relational queries aren't slowed down at all. Would switching varchar to filestream help?
Generally BLOB is preferred for Objects that are being stored are, on average, larger than 1 MB.
I think you should be good with keeping them in same database. 100 rows are not much for a database.
Also, what is the usecase of keeping the original as well as transformed JSON. If original JSON is not going to be used as part of normal processing and is just needed to keep for references, I would suggest keep a separate table and dump original JSON there with a reference key and use original only when needed.
Your use case doesn't sound to have too much demand. 4-6KB and less than 100 or even 1000 rows for that matter is still pretty light. Though I know expected use case almost never ends up being actual use case. If people use the table for things other than the JSON field you might not want them pulling back the JSON because of the potential size and unnecessary bloat.
Good thing SQL has some other lesser complex options to help us out. https://msdn.microsoft.com/en-us/library/ms173530(v=sql.100).aspx
I would suggest looking at the table option of Large Value Types out of Row as it is future compatible and the text in row option is deprecated. Essentially these options store those large text fields off of the primary page, allowing the correct data to live where it needs to live and the extra STUFF to have a different home.
I'm trying to figure out how to manage and serve a lot of numerical data. Not sure an SQL database is the right approach. Scenario as follows:
10000 sets of time series data collected per hour
5 floating point values per set
About 5000 hours worth of data collected
So that gives me about 250 million values in total. I need to query this set of data by set ID and by time. If possible, also filter by one or two of the values themselves. I'm also continuously adding to this data.
This seems like a lot of data. Assuming 4 bytes per value, that's 1TB. I don't know what a general "overhead multiplier" for an SQL database is. Let's say it's 2, then that's 2TB of disk space.
What are good approaches to handling this data? Some options I can see:
Single PostgreSQL table with indices on ID, time
Single SQLite table -- this seemed to be unbearably slow
One SQLite file per set -- lots of .sqlite files in this case
Something like MongoDB? Don't even know how this would work ...
Appreciate commentary from those who have done this before.
Mongo is a key-value store; might work for your data but I don't have much experience.
I can tell you that PostgreSQL will be a good choice. It will be able to handle that kind of data. SQLite is definitely not optimized for those use-cases.
I am designing a new laboratory database. For some tests, I have several waveforms with ~10,000 data points acquired simultaneously. In the application (written in C), the waveforms are stored as an array of floats.
I believe I would like to store each waveform as a BLOB.
Questions:
Can the data in a BLOB be structured in such a way that Oracle can work with the data itself using only SQL or PL/SQL?
Determine max, min, average, etc
Retrieve index when value first exceeds 500
Retrieve 400th number
Create BLOB which is a derivative of first BLOB
NOTE: This message is a sub-question of Storing Waveforms in Oracle.
Determine max, min, average, etc
Retrieve index when value first
exceeds 500
Retrieve 400th number
The relational data model was designed for this kind of analysis - and Oracle's SQL is more than capable of doing this, if you model your data correctly. I recommend you focus on transforming the array of floats into tables of numbers - I suspect you'll find that the time taken will be more than compensated for by the speed of performing these sorts of queries in SQL.
The alternative is to try to write SQL that will effectively do this transformation at runtime anyway - every time the SQL is run; which will probably be much less efficient.
You may also wish to consider the VARRAY type. You do have to work with the entire array (no retreival of subsets, partial updates, etc.) but you can define a max length and Oracle will store only what you use. You can declare VARRAYs of most any datatype, including BINARY_FLOAT or NUMBER. BINARY_FLOAT will minimize your storage, but suffers from some minor precision issues (although important in financial applications). It is in IEEE 754 format.
Since you're planning to manipulate the data with PL/SQL I might back off from the BLOB design. VARRAYs will be more convenient to use. BLOBs would be very convenient to store an array of raw C floats for later use in another C program.
See PL/SQL Users Guide and Reference for how to use them.
I think that you could probably create PL/SQL functions that take the blob as a parameter and return information on it.
If you could use XMLType for the field, then you can definitely parse in PL/SQL and write the functions you want.
http://www.filibeto.org/sun/lib/nonsun/oracle/11.1.0.6.0/B28359_01/appdev.111/b28369/xdb10pls.htm
Of course, XML will be quite a bit slower, but if you can't parse the binary data, it's an alternative.
Suppose you have a dense table with an integer primary key, where you know the table will contain 99% of all values from 0 to 1,000,000.
A super-efficient way to implement such a table is an array (or a flat file on disk), assuming a fixed record size.
Is there a way to achieve similar efficiency using a database?
Clarification - When stored in a simple table / array, access to entries are O(1) - just a memory read (or read from disk). As I understand, all databases store their nodes in trees, so they cannot achieve identical performance - access to an average node will take a few hops.
Perhaps I don't understand your question but a database is designed to handle data. I work with database all day long that have millions of rows. They are efficiency enough.
I don't know what your definition of "achieve similar efficiency using a database" means. In a database (from my experience) what are exactly trying to do matters with performance.
If you simply need a single record based on a primary key, the the database should be naturally efficient enough assuming it is properly structure (For example, 3NF).
Again, you need to design your database to be efficient for what you need. Furthermore, consider how you will write queries against the database in a given structure.
In my work, I've been able to cut query execution time from >15 minutes to 1 or 2 seconds simply by optimizing my joins, the where clause and overall query structure. Proper indexing, obviously, is also important.
Also, consider the database engine you are going to use. I've been assuming SQL server or MySql, but those may not be right. I've heard (but have never tested the idea) that SQLite is very quick - faster than either of the a fore mentioned. There are also many other options, I'm sure.
Update: Based on your explanation in the comments, I'd say no -- you can't. You are asking about mechanizes designed for two completely different things. A database persist data over a long amount of time and is usually optimized for many connections and data read/writes. In your description the data in an array, in memory is for a single program to access and that program owns the memory. It's not (usually) shared. I do not see how you could achieve the same performance.
Another thought: The absolute closest thing you could get to this, in SQL server specifically, is using a table variable. A table variable (in theory) is held in memory only. I've heard people refer to table variables as SQL server's "array". Any regular table write or create statements prompts the RDMS to write to the disk (I think, first the log and then to the data files). And large data reads can also cause the DB to write to private temp tables to store data for later or what-have.
There is not much you can do to specify how data will be physically stored in database. Most you can do is to specify if data and indices will be stored separately or data will be stored in one index tree (clustered index as Brian described).
But in your case this does not matter at all because of:
All databases heavily use caching. 1.000.000 of records hardly can exceed 1GB of memory, so your complete database will quickly be cached in database cache.
If you are reading single record at a time, main overhead you will see is accessing data over database protocol. Process goes something like this:
connect to database - open communication channel
send SQL text from application to database
database analyzes SQL (parse SQL, checks if SQL command is previously compiled, compiles command if it is first time issued, ...)
database executes SQL. After few executions data from your example will be cached in memory, so execution will be very fast.
database packs fetched records for transport to application
data is sent over communication channel
database component in application unpacks received data into some dataset representation (e.g. ADO.Net dataset)
In your scenario, executing SQL and finding records needs very little time compared to total time needed to get data from database to application. Even if you could force database to store data into array, there will be no visible gain.
If you've got a decent amount of records in a DB (and 1MM is decent, not really that big), then indexes are your friend.
You're talking about old fixed record length flat files. And yes, they are super-efficient compared to databases, but like structure/value arrays vs. classes, they just do not have the kind of features that we typically expect today.
Things like:
searching on different columns/combintations
variable length columns
nullable columns
editiablility
restructuring
concurrency control
transaction control
etc., etc.
Create a DB with an ID column and a bit column. Use a clustered index for the ID column (the ID column is your primary key). Insert all 1,000,000 elements (do so in order or it will be slow). This is kind of inefficient in terms of space (you're using nlgn space instead of n space).
I don't claim this is efficient, but it will be stored in a similar manner to how an array would have been stored.
Note that the ID column can be marked as being a counter in most DB systems, in which case you can just insert 1000000 items and it will do the counting for you. I am not sure if such a DB avoids explicitely storing the counter's value, but if it does then you'd only end up using n space)
When you have your primary key as a integer sequence it would be a good idea to have reverse index. This kind of makes sure that the contiguous values are spread apart in the index tree.
However, there is a catch - with reverse indexes you will not be able to do range searching.
The big question is: efficient for what?
for oracle ideas might include:
read access by id: index organized table (this might be what you are looking for)
insert only, no update: no indexes, no spare space
read access full table scan: compressed
high concurrent write when id comes from a sequence: reverse index
for the actual question, precisely as asked: write all rows in a single blob (the table contains one column, one row. You might be able to access this like an array, but I am not sure since I don't know what operations are possible on blobs. Even if it works I don't think this approach would be useful in any realistic scenario.