HDF5 input dataset usage in NVIDIA DIGITS

HDF5 input dataset usage in NVIDIA DIGITS - artificial-intelligence

I am new with using NVIDIA DIGITS. My train dataset has following structure and its format is .hdf5 .
crops Dataset {27482, 3, 128, 192}
labels Dataset {27482, 12}
mean Dataset {3, 128, 192}
pids Dataset {27482}
I know how to feed model with simpler formats like .txt or .jpg. My question is how can i feed my model with .hdf5 format in NVIDIA DIGITS

HDF5 datasets are only used for image classification datasets in DIGITS, and even then the support isn't very full-featured.
Why?
Caffe doesn't support HDF5 nearly as well as it supports LMDBs:
For large datasets, you have to break them up into separate files (see here)
Data is not prefetched - the whole dataset is read into memory at once (see here)
Data transformations are not supported with the HDF5Data layer (see here)
Since DIGITS is primarily Caffe-based for now, our main dataset format is LMDB. If/when we support more backend frameworks, we may decide to standardize on a more generic format like HDF5 or zipfiles.

Related

Can the .fbx file format or the .dwf file format hold Product Manufacturing Information (PMI) data?

Is there somewhere I can find information about these 3D file formats or other ones? I am particularly interested in seeing which file formats can hold PMI data?

Product Manufacturing Information (PMI) is usually defined by the following components:
GD&T (geometrical dimensioning and tolerancing).
3D annotations / symbols.
Captured views.
These might be also represented in two ways in format:
Graphical representation (how to actually display information in 3D viewer - usually in form of pre-tessellated geometry).
Semantic representation (just marks / annotates real geometry with extra information; it is up to the application how to use / display this information to user).
There are two key ISO standardized file formats that mentions PMI explicitly:
STEP (most noticeably AP242). Here is a live example of GD&T in STEP.
JT (mostly promoted by a single vendor).
Proprietary and native file formats (specific to one product) might also support PMI (like Catia V5 / SOLIDWORKS), but they are not expected to be used for data exchange. I don't think that FBX has anything directly related to PMI.
There are might be more formats supporting PMI (3D PDF?), but I haven't seen much information about that.

Finding a handwritten dataset with an already extracted features

I want to test my clustering algorithms on data of handwritten text, so I'm searching for a dataset of handwritten text (e.g. words) with already extracted features (the goal is to test my clustering algorithms on, not to extract features). Does anyone have any information on that ?
Thanks.

There is a dataset of images of handwritten digits : http://yann.lecun.com/exdb/mnist/ .

Texmex has 128d SIFT vectors
"to evaluate the quality of approximate
nearest neighbors search algorithm on different kinds of data and varying database sizes",
but I don't know what their images are of; you could try asking the authors.

Meaning of samples in Libsvm dataset format (Mnist in particular)

I downloaded Mnist dataset from Libsvm's dataset page.
All samples are like the following:
5 153:3 154:18 155:18 156:18 157:126 ...
Does anyone knows what that means? 5 is the class label, but what is 153:3 pair for example? Also I couldn't find the meaning from mnist's own web page.

This is the way libsvm encodes (sparse) vectors. As you said 5 is the label, and the following pairs i:v say that the i-th entry of the vector is v. So you would encode a 3-dim vector (a,b,c) as
1:a 2:b 3:c
Which is inefficient for dense vectors but a good and established format for sparse data. As it is plain text, the storage space is not optimal, but good enough for most applications. Whereas the files are easy to write and to read.

Most efficient way to store a big DNA sequence?

I want to pack a giant DNA sequence with an iOS app (about 3,000,000,000 base pairs). Each base pair can have a value A, C, T or G. Storing each base pair in one bytes would give a file of 3 GB, which is way too much. :)
Now I though of storing each base pair in two bits (four base pairs per octet), which gives a file of 750 MB. 750 MB is still way too much, even when compressed.
Are there any better file formats for efficiently storing giant base pairs on disk? In memory is not a problem as I read in chunks.

I think you'll have to use two bits per base pair, plus implement compression as described in this paper.
"DNA sequences... are not random; they contain
repeating sections, palindromes, and other features that
could be represented by fewer bits than is required to spell
out the complete sequence in binary...
With the proposed algorithm, sequence will be compressed by 75%
irrespective of the number of repeated or non-repeated
patterns within the sequence."
DNA Compression Using Hash Based Data Structure, International Journal of Information Technology and Knowledge Management
July-December 2010, Volume 2, No. 2, pp. 383-386.
Edit: There is a program called GenCompress which claims to compress DNA sequences efficiently:
http://www1.spms.ntu.edu.sg/~chenxin/GenCompress/
Edit: See also this question on BioStar.

If you don't mind having a complex solution, take a look at this paper or this paper or even this one which is more detailed.
But I think you need to specify better what you're dealing with. Some specifics applications can lead do diferent storage. For example, the last paper I cited deals with lossy compression of DNA...

Base pairs always pair up, so you should only have to store one side of the strand. Now, I doubt that this works if there are certain mutations in the DNA (like a di-Thiamine bond) that cause the opposite strand to not be the exact opposite of the stored strand. Beyond that, I don't think you have many options other than to compress it somehow. But, then again, I'm not a bioinformatics guy, so there might be some pretty sophisticated ways to store a bunch of DNA in a small space. Another idea if it's an iOS app is just putting a reader on the device and reading the sequence from a web service.

Use a diff from a reference genome. From the size (3Gbp) that you post, it looks like you want to include a full human sequences. Since sequences don't differ too much from person to person, you should be able to compress massively by storing only a diff.
Could help a lot. Unless your goal is to store the reference sequence itself. Then you're stuck.

consider this, how many different combinations can you get? out of 4 (i think its about 16 )
actg = 1
atcg = 2
atgc = 3 and so on, so that
you can create an array like [1,2,3] then you can go one step further,
check if 1 is follow by 2, convert 12 to a, 13 = b and so on...
if I understand DNA a bit it means that you cannot get a certain value
as a must be match with c, and t with g or something like that which reduces your options,
so basically you can look for a sequence and give it a something you can also convert back...

You want to look into a 3d space-filling curve. A 3d sfc reduces the 3d complexity to a 1d complexity. It's a little bit like n octree or a r-tree. If you can store your full dna in a sfc you can look for similar tiles in the tree although a sfc is most likely to use with lossy compression. Maybe you can use a block-sorting algorithm like the bwt if you know the size of the tiles and then try an entropy compression like a huffman compression or a golomb code?

You can use the tools like MFCompress, Deliminate,Comrad.These tools provides entropy less than 2.That is for storing each symbol it will take less than 2 bits

Any strategies for assessing the trade-off between CPU loss and memory gain from compression of data held in a datastore model's TextProperty?

Are very large TextProperties a burden? Should they be compressed?
Say I have a information stored in 2 attributes of type TextProperty in my datastore entities.
The strings are always the same length of 65,000 characters and have lots of repeating integers, a sample appearing as follows:
entity.pixel_idx = 0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,5,5,5,5,5,5,5,5,5,5,5,5....etc.
entity.pixel_color = 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,...etc.
So these above could also be represented using much less storage memory by compressing say using only each integer and the length of its series ( '0,8' for '0,0,0,0,0,0,0,0') but then its takes time and CPU to compress and decompress?
Any general ideas?
Are there some tricks for testing different attempts to the problem?

If all of your integers are single-digit numbers (as in your example), then you can reduce your storage space in half by simply omitting the commas.
The Short Answer
If you expect to have a lot of repetition, then compressing your data makes sense - your data is not so small (65K) and is highly repetitive => it will compress well. This will save you storage space and will reduce how long it takes to transfer the data back from the datastore when you query for it.
The Long Answer
I did a little testing starting with the short example string you provided and that same string repeated to 65000 characters (perhaps more repetitive than your actual data). This string compressed from 65K to a few hundred bytes; you may want to do some additional testing based on how well your data actually compresses.
Anyway, the test shows a significant savings when using compressed data versus uncompressed data (for just the above test where compression works really well!). In particular, for compressed data:
API time takes 10x less for a single entity (41ms versus 387ms on average)
Storage used is significantly less (so it doesn't look like GAE is doing any compression on your data).
Unexpectedly, CPU time is about 50% less (130ms versus 180ms when fetching 100 entities). I expected CPU time to be a little worse since the compressed data has to be uncompressed. There must be some other CPU work (like decoding the protocol buffer) which is even more CPU work for the much larger uncompressed data.
These differences mean wall clock time is also significantly faster for the compressed version (<100ms versus 426ms when fetching 100 entities).
To make it easier to take advantage of compression, I wrote a custom CompressedDataProperty which handles all of the compressing/decompressing business so you don't have to worry about it (I used it in the above tests too). You can get the source from the above link, but I've also included it here since I wrote it for this answer:
from google.appengine.ext import db
import zlib
class CompressedDataProperty(db.Property):
"""A property for storing compressed data or text.
Example usage:
>>> class CompressedDataModel(db.Model):
... ct = CompressedDataProperty()
You create a compressed data property, simply specifying the data or text:
>>> model = CompressedDataModel(ct='example uses text too short to compress well')
>>> model.ct
'example uses text too short to compress well'
>>> model.ct = 'green'
>>> model.ct
'green'
>>> model.put() # doctest: +ELLIPSIS
datastore_types.Key.from_path(u'CompressedDataModel', ...)
>>> model2 = CompressedDataModel.all().get()
>>> model2.ct
'green'
Compressed data is not indexed and therefore cannot be filtered on:
>>> CompressedDataModel.gql("WHERE v = :1", 'green').count()
0
"""
data_type = db.Blob
def __init__(self, level=6, *args, **kwargs):
"""Constructor.
Args:
level: Controls the level of zlib's compression (between 1 and 9).
"""
super(CompressedDataProperty, self).__init__(*args, **kwargs)
self.level = level
def get_value_for_datastore(self, model_instance):
value = self.__get__(model_instance, model_instance.__class__)
if value is not None:
return db.Blob(zlib.compress(value, self.level))
def make_value_from_datastore(self, value):
if value is not None:
return zlib.decompress(value)

I think this should be pretty easy to test. Just create 2 handlers, one that compresses the data, and one that doesn't, and record how much cpu each one uses (using the appstats package for whichever language you are developing with.) You should also create 2 entity types, one for the compressed data, one for the uncompressed.
Load in a few hundred thousand or a million entities (using the task queue perhaps). Then you can check the disk space usage in the administrator's console, and see how much each entity type uses. If the data is compressed internally by app engine, you shouldn't see much difference in the space used (unless their compression is significantly better than yours) If it is not compressed, there should be a stark difference.
Of course, you may want to hold off on this type of testing until you know for sure that these entities will account for a significant portion of your quota usage and/or your page load time.
Alternatively, you could wait for Nick or Alex to pop in and they could probably tell you whether the data is compressed in the datastore or not.