Reducing the file size of MATLAB variables [closed] - arrays

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Process of extracting data,
I am analyzing 4000 to 8000 DICOM files using matlab codes. DICOM files are read using dicomread() function. Each DICOM file contains 932*128 photon count data coming from 7 detectors. While reading DICOM files, I convert data into double and stored in 7 cell array variables (from seven detectors). So each cell contains 128*128 photon counting data and cell array contain 4000 to 8000 cells.
Question.
When I save each variable separately, size of each variable is 3GB. So for 7 variables it will be 21GB, Saving them and reading back takes awful lot of time. (RAM of my computer is 4GB)
Is there a way to reduce the size of variable?
Thanks.

Different data type will help. You can save data as float instead of double, as DICOM files have it as float too (from http://northstar-www.dartmouth.edu/doc/idl/html_6.2/DICOM_Attributes.html; Graphic Data). This halves size at no loss. You might want to expand to double when doing operations on data to avoid inaccuracies creeping up.
Additional compression by saving it as uint16 (additional x2 space saving) or even uint8 (x4) might be possible, but I would be wary of this - it might work great in all test cases but make problems when you least expect it.
Cell array is not problematic in terms of speed or size - you will not gain (much) by switching to something else. Your data gobbles up memory, not the cell array itself. If you wish, you can save data in a 128x128x7x8000 float array - it should work just fine too.
But if the number of images (this 4000-8000) can increase at any point, rescaling the array will be a pretty costly operation in terms of space and time. Cell arrays are much easier to extend - 8k values to move around instead of 8k*115k=900M values.
Another option is to separate data in chunks. You probably don't need to be working on all 4000 images at once. You can load 500 images, finish your work on them, move on to next 500 images etc. Batch size obviously depends on your hardware and what processing you do with data, but I guess about 500 could be a pretty reasonable starting point.

Related

Array or Map is better in database? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last month.
Improve this question
I am using AWS dynamoDB.
option 1
[{"id":"01","avaliable":true},
{"id":"02","avaliable":true},
{"id":"03","avaliable":false},
{"id":"04","avaliable":true}
{"id":"05","avaliable":false}]
option 2
"avaliable":[true,true,false,true,false]
id will always in sequence and start with 0 so I think it is a waste to include "id" as attribute. I can just update avaliabe in option 2 using {id-1} as array index. But I am not sure will there be any other issue if I use option 2. I am orginally using option 1 and will check whether the id correct before update. I am afraid option 2 will have mistake easily.
Which structure do you think is better?
Personally I prefer to use Map type in DynamoDB because it allows you to update on a key versus guessing what index you need in an array. However that would be option 3:
"mymap":{
"id01":{"avaliable":true},
"id02":{"avaliable":true},
"id03":{"avaliable":true},
"id04":{"avaliable":true},
"id05":{"avaliable":true}
}
This allows you up modify elements without first trying to figure out what position in an array it might be, which sometimes requires you to first read the item and can cause concurrency issues.
I do notice you mention that you equate the position of the item in the array, however I feel this is a more fool-proof way for general implementation. For example, if you need to remove a value from the middle of the list, it would not cause any issues.
That is one thing that can influence your decision, the other 2 being item size and total storage.
If your item size is substantially less than 1KB then you will have no issue using option 1 or 3 which will increase your item size slightly compared to option 2. As long as the extra characters do not push your average item size over the nearest 1KB value as that will mean that you will have increased your capacity consumption for write requests.
The other being the total storage size. DynamoDB provides a free tier of 25GB of storage. If you have millions of items causing you to increase your storage size substantially, then you may decide to use option 2.

How to stuff any number of values in 8-10 bytes of data for n number of 16 bit values? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I am working on algorithm where i can have any number of 16 bit values(For instance i have 1000 16 bit values , and all are sensor data, so no particular series or repetition). I want to stuff all of this data into an 8 or a 10 byte array(each and every value of the 1000 16 bits numbers should be inside the 10 byte array) . The information should be such that i can also easily decode to read each and every value from the 1000 values.
I have thought of using sin function by dividing the values by 100 so every data point would always be in 8 bits(0-1 sin value range) , but that only covers up small range of data and not huge number of values.
Pardon me if i am asking for too much. I am just curious if its possible or not.
The answer to this question is rather obvious with a little knowledge in information sciences. It is not possible to store that much information in so little memory, and the data you are talking about just contains too much information.
Some data, like repetitive data or data which is following some structure (like constantly rising values), contains very little information. The task of compression algorithms is to figure out the structure or repetition and instead of storing the pure data to store the structure or rule how to reproduce the data instead.
In your case, the data is coming from sensors and unless you are willing to lose a massive amount of information, you will not be able to generate a compressed version of it with a compression factor in the magnitude your are talking about (1000 × 2 bytes into 10 bytes). If your sensors more or less produce the same values all the time with just a little jitter, a good compression can be achieved (but for this your question is way to broad to be answered here) but it will probably never be in the range of reducing your 1000 values to 10 bytes.

Improving performance when looping in a big data set

I am making some spatio-temporal analysis (with MATLAB) on a quite big data set and I am not sure what is the best strategy to adopt in terms of performance for my script.
Actually, the data set is split in 10 yearly arrays of dimension (latitude,longitude,time)=(50,60,8760).
The general structure of my analysis is:
for iterations=1:Big Number
1. Select a specific site of spatial reference (i,j).
2. Do some calculation on the whole time series of site (i,j).
3. Store the result in archive array.
end
My question is:
Is it better (in terms of general performance) to have
1) all data in big yearly (50,60,8760) arrays as global variables loaded for once. At each iteration the script will have to extract one particular "site" (i,j,:) from those arrays for data process.
2) 50*60 distinct files stored in a folder. Each file containing a particular site time series (a vector of dimension (Total time range,1)). The script will then have to open, data process and then close at each iteration a specific file from the folder.
Because your computations are computed on the entire time series, I would suggest storing the data that way in a 3000x8760 vector and doing the computations that way.
Your accesses then will be more cache-friendly.
You can reformat your data using the reshape function:
newdata = reshape(olddata,50*60,8760);
Now, instead of accessing olddata(i,j,:), you need to access newdata(sub2ind([50 60],i,j),:).
After doing some experiments it is clear that the second proposition with 3000 distinct files is much slower than having to manipulate big arrays loaded in workspace. But I didn't try to load all the 3000 files in workspace before computing (A tad to much).
It looks like Reshaping data help's a little bit.
Thanks to all contributors for your suggestions.

clustering a SUPER large data set [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am working on a project as a part of my class curriculum . Its a project for Advanced Database Management Systems and it goes like this.
1)Download large number of images (1000,000) --> Done
2)Cluster them according to their visual Similarity
a)Find histogram of each image --> Done
b)Now group (cluster) images according to their visual similarity.
Now, I am having a problem with part 2b. Here is what I did:
A)I found the histogram of each image using matlab and now have represented it using a 1D vector(16 X 16 X 16) . There are 4096 values in a single vector.
B)I generated an ARFF file. It has the following format. There are 1000,000 histograms (1 for each image..thus 1000,000 rows in the file) and 4097 values in each row (image_name + 4096 double values to represent the histogram)
C)The file size is 34 GB. THE BIG QUESTION: HOW THE HECK DO I CLUSTER THIS FILE???
I tried using WEKA and other online tools. But they all hang. Weka gets stuck and says "Reading a file".
I have a RAM of 8 GB on my desktop. I don't have access to any cluster as such. I tried googling but couldn't find anything helpful about clustering large datasets. How do I cluster these entries?
This is what I thought:
Approach One:
Should I do it in batches of 50,000 or something? Like, cluster the first 50,000 entries. Find as many possible clusters call them k1,k2,k3... kn.
Then pick the the next 50,000 and allot them to one of these clusters and so on? Will this be an accurate representation of all the images. Because, clustering is done only on the basis of first 50,000 images!!
Approach Two:
Do the above process using random 50,000 entries?
Any one any inputs?
Thanks!
EDIT 1:
Any clustering algorithm can be used.
Weka isn't your best too for this. I found ELKI to be much more powerful (and faster) when it comes to clustering. The largest I've ran are ~3 million objects in 128 dimensions.
However, note that at this size and dimensionality, your main concern should be result quality.
If you run e.g. k-means, the result will essentially be random because of you using 4096 histogram bins (way too much, in particular with squared euclidean distance).
To get good result, you need to step back an think some more.
What makes two images similar. How can you measure similarity? Verify your similarity measure first.
Which algorithm can use this notion of similarity? Verify the algorithm on a small data set first.
How can the algorithm be scaled up using indexing or parallelism?
In my experience, color histograms worked best on the range of 8 bins for hue x 3 bins for saturation x 3 bins for brightness. Beyond that, the binning is too fine grained. Plus it destroys your similarity measure.
If you run k-means, you gain absolutely nothing by adding more data. It searches for statistical means and adding more data won't find a different mean, but just some more digits of precision. So you may just as well use a sample of just 10k or 100k pictures, and you will get virtually the same results.
Running it several times for independent sets of pictures results in different cluster clusters which are difficult to merge. Thus two similar images are placed in different clusters. I would run the clustering algorithm for a random set of images (as large as possible) and use these cluster definitions to sort all other images.
Alternative: Reduce the compexity of your data, e.g. to a histogram of 1024 double values.

Loading tiles for a 2D game

Im trying to make an 2D online game (with Z positions), and currently im working with loading a map from a txt file. I have three different map files. One contains an int for each tile saying what kind of floor there is, one saying what kind of decoration there is, and one saying what might be covering the tile. The problem is that the current map (20, 20, 30) takes 200 ms to load, and I want it to be much much bigger. I have tried to find a good solution for this and have so far come up with some ideas.
Recently I'v thought about storing all tiles in separate files, one file per tile. I'm not sure if this is a good idea (it feels wrong somehow), but it would mean that I wouldn't have to store any unneccessary tiles as "-1" in a text file and I would be able to just pick the right tile from the folder easily during run time (read the file named mapXYZ). If the tile is empty I would just be able to catch the FileNotFoundException. Could anyone tell me a reason for this being a bad solution? Other solutions I'v thought about would be to split the map into smaller parts or reading the map during startup in a BackgroundWorker.
Try making a much larger map in the same format as your current one first - it may be that the 200ms is mostly just overhead of opening and initial processing of the file.
If I'm understanding your proposed solution (opening one file per X,Y or X,Y,Z coordinate of a single map), this is a bad idea for two reasons:
There will be significant overhead to opening so many files.
Catching a FileNotFoundException and eating it will be significantly slower - there is actually a lot of overhead with catching exceptions, so you shouldn't rely on them to perform application logic.
Are you loading the file from a remote server? If so, that's why it's taking so long. Instead you should embed the file into the game. I'm saying this because you probably take 2-3 bytes per tile, so the file's about 30kb and 200ms sounds like a reasonable download time for that size of file (including overhead etc, and depending on your internet connection).
Regarding how to lower the filesize - there are two easy techniques I can think of that will decrease the filesize a bit:
1) If you have mostly empty squares and only some significant ones, your map is what is often referred to as 'sparse'. When storing a sparse array of data you can use a simple compression technique (formally known as 'run-length encoding') where each time you come accross empty squares, you specify how many of them there are. So for example instead of {0,0,0,0,0,0,0,0,0,0,1,1,2,3,0,0,0,0,0,0,0,0,0,0,0,0,1} you could store {10 0's, 1, 1, 2, 3, 12 0's, 1}
2) To save space, I recommend that you store everything as binary data. The exact setup of the file mainly depends on how many possible tile types there are, but this is a better solution than storing the ascii characters corresponding to the base-10 representation of the numers, separated by delimiters.
Example Binary Format
File is organized into segments which are 3 or 4 bytes long, as explained below.
First segment indicates the version of the game for which the map was created. 3 bytes long.
Segments 2, 3, and 4 indicate the dimensions of the map (x, y, z). 3 bytes long each.
The remaining segments all indicate either a tile number and is 3 bytes long with an MSB of 0. The exception to this follows.
If one of the tile segments is an empty tile, it is 4 bytes long with an MSB of 1, and indicates the number of empty tiles including that tile that follow.
The reason I suggest the MSB flag is so that you can distinguish between segments which are for tiles, and segments which indicate the number of empty tiles which follow that segment. For those segments I increase the length to 4 bytes (you might want to make it 5) so that you can store larger numbers of empty tiles per segment.

Resources