I have two database in excel being extracted not in matlab. How can I compare them within matlab? Is the right way, storing them in a structure ? how I can find the similarities and differences?
Each database has 4 columns, and database has around 400 rows of data.
If you are interested in the structure, you could create a struct for each database where each fieldname is equal to a column name.
In this case one can use visdiff.
However if you are going to compare a lot of numbers this is not practical.
To confirm that they are equal, one can use something like isequal
To see how they are different, plot them or plot the difference of them.
To see whether the data behaves differently calculate some basic statistics like max, min, mean, std, you may also be interested in the correlation between columns of the two datasets.
Related
So, I have astronomical spectroscopy data in the following format:
{
"molecule": "CO2",
"blahblah":
"5 more simple fields"
"arrayofvalues": [lengths can go up to 2 million]
}
of this data, I have 600,000 files, so that means that there are 1 trillion individual datapoints that I want to search through, and do computations with.
So can someone please direct me to a source of maybe bigData or bigQueries on how I can efficiently lookup this data for computations and graphing? I want to like for example search certain molecules, under certain condition, what data they show etc.
I want to make a website where people can pick some variables, and a value range, and get graphical or textual data.
Now I tried to put some of this stuff on PostgresQL, but obviously when I do a get request, (and store even just 5 files) it will crash Postman, because its too much data.
Without knowing more details, you can take advantage of the data modeling options available in bigquery, such as:
nested data
arrays and structs
partitioned tables
clustering
Take a look at the data types: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
And also the partitioning and clustering techniques.
https://towardsdatascience.com/how-to-use-partitions-and-clusters-in-bigquery-using-sql-ccf84c89dd65?gi=cd1bc7f704cc
I am looking for a good solution to represent a 2-D array in a database. With a few catches of course. For instance, the array could grow in terms of columns and rows (which could be inserted at any point). Also important is that people (users) can manipulate specific cells in this array (and hopefully without having to update the entire array). Also there may be multiple arrays to store.
I have thought about using a json but that would require always writing the entire json to database when an update is made on a a specific cell which would not be idea especially when multiple people could manipulate the array at once.
The typical data structure would be three columns:
row
column
value
Of course, you might have other columns if you have a separate array for each user (say a userid or arrayid column). Or, you might have multiple values for each cell.
For data assurance, my task is two compare two datasets from different databases. Currently I am performing the cell-by-cell value comparison which is a Brute Force method and is consuming lot of time.
I would like to know if there are any methods which would save my time and memory space, which is able to provide a result indication "Tables are identical" or "Tables are not identical".
Thank you for your assistance.
How about creating a checksum for each table and compare the them?
Something like:
SELECT CHECKSUM_AGG(CHECKSUM(*)) FROM TableX
This might need a ORDER BY to be more precise.
If they are from different sources, there is no other way than comparing them cell by cell AFAIK. However I can suggest you something that will probably increasing comparison speed by many folds. If your DataTables have identical structures, which they hopefully should since you're already comparing them cell by cell, try comparing ItemArray of each pair of rows instead of accessing them by column index or column names (or row properties if you're using strongly-typed DataSets). This will hopefully give you much better results.
If you're using .NET 3.5 or above, this line should do it:
Enumerable.SequenceEqual(row1.ItemArray, row2.ItemArray);
Context: SQL Server 2008, C#
I have an array of integers (0-10 elements). Data doesn't change often, but is retrieved often.
I could create a separate table to store the numbers, but for some reason it feels like that wouldn't be optimal.
Question #1: Should I store my array in a separate table? Please give reasons for one way or the other.
Question #2: (regardless of what the answer to Q#1 is), what's the "best" way to store int[] in database field? XML? JSON? CSV?
EDIT:
Some background: numbers being stored are just some coefficients that don't participate in any relationship, and are always used as an array (i.e. never a value is being retrieved or used in isolation).
Separate table, normalized
Not as XML or json , but separate numbers in separate rows
No matter what you think, it's the best way. You can thank me later
The "best" way to store data in a database is the way that is most conducive to the operations that will be performed on it and the one which makes maintenance easiest. It is this later requirement which should lead you to a normalized solution which means storing the integers in a table with a relationship. Beyond being easier to update, it is easier for the next developer that comes after you to understand what and how the information is stored.
Store it as a JSON array but know that all accesses will now be for the entire array - no individual read/writes to specific coefficients.
In our case, we're storing them as a json array. Like your case, there is no relationship between individual array numbers - the array only make sense as a unit and as a unit it DOES has a relationship with other columns in the table. By the way, everything else IS normalized. I liken it to this: If you were going to store a 10 byte chunk, you'd save it packed in a single column of VARBINARY(10). You wouldn't shard it into 10 bytes, store each in a column of VARBINARY(1) and then stitch them together with a foreign key. I mean you could - but it wouldn't make any sense.
YOU as the developer will need to understand how 'monolithic' that array of int's really is.
A separate table would be the most "normalized" way to do this. And it is better in the long run, probably, since you won't have to parse the value of the column to extract each integer.
If you want you could use an XML column to store the data, too.
Sparse columns may be another option for you, too.
If you want to keep it really simple you could just delimit the values: 10;2;44;1
I think since you are talking about sql server that indicates that your app may be a data driven application. If that is the case I would keep definately keep the array in the database as a seperate table with a record for each value. It will be normalized and optimized for retreival. Even if you only have a few values in the array you may need to combine that data with other retreived data that may need to be "joined" with your array values. In which case sql is optimized for by using indexes, foreign keys, etc. (normalized).
That being said, you can always hard code the 10 values in your code and save the round trip to the DB if you don't need to change the values. It depends on how your application works and what this array is going to be used for.
I agree with all the others about the best being a separate normalized table. But if you insist in having it all in the same table don't place the array in one only column. In instead create the 10 columns and store each array value in a different column. It will save you the parsing and update problems.
I'm looking for a database with good support for set operations (more specifically: unions).
What I want is something that can store sets of short strings and calculate the union of such sets. For example, I want to add A, B, and C to a set, then D, and A to another and then get the cardinality of the union of those sets (4), but scaled up a million times or so.
The values are 12 character strings and the set sizes range from single elements to millions.
I have experimented with Redis, and it's fantastic in every respect except that for the amount of data I have it's tricky with something that is memory-based. I've tried using the VM feature, but that makes it use even more memory, it something more geared towards large values and I have small values (so say the helpful people on the Redis mailing list). The jury is still out, though, I might get it to work.
I've also sketched on implementing it on top of a relational database, which would probably work, but what I'm asking for is something that I wouldn't have to hack to work. Redis would be a good answer, but as I mentioned above, I've tried it.
My current, Redis-based, implementation works more or less like this: I parse log files and for each line I extract an API key, a user ID, and the values of a number of properties like site domain, time of day, etc. I then formulate a keys that looks somewhat like this (each line results in many keys, one for each property):
APIKEY:20101001:site_domain:stackoverflow.com
the key points to a set, and to this set I add the user ID. When I've parsed all the log files I want to know the total number of unique user IDs for a property over all time and so I ask Redis for the cardinality of the union of all keys that match
APIKEY:*:site_domain:stackoverflow.com
Is there a database, besides Redis, that has good support for this use case?
it sounds like you need something like boost::disjoint_set which is a datastructure specifically optimized for taking unions or intersections of large sets.