According to specification, AllJoyn limits the array size to ALLJOYN_MAX_ARRAY_LEN = 2^17 bytes when DBus has a greater limit of 2^26.
I have a case where I need to send images as array of bytes ('ay' signature) that, even compressed, can be bigger than 2^17 bytes.
There are different ways to achieve this, for example splitting the array and adding an array ID and number of chunk, rebuilding the correct array on the other side.
I was wondering if anyone came into a similar problem and what approach would work best.
It looks like a Bulk Data Transfer interface has been specified and is being reviewed by the AllSeen Alliance interface review board. It does not look like it has been approved yet. Also it does not look like anyone has created an implementation of the interface at this time.
https://git.allseenalliance.org/gerrit/#/c/5977
Related
I was trying to collect statistics of a 6D vector and plot a 1D histogram for each coordinate. I get 729000000 different copies of this vector (each 6 dimensional). For this I create an array of zeros of size 729000000x6 before I get any of the actual W's and this seems to be a problem in matlab since it says:
Error using zeros
Requested 729000000x6 (32.6GB) array exceeds maximum array size preference. Creation of arrays
greater than this limit may take a long time and cause MATLAB to become unresponsive. See array
size limit or preference panel for more information.
The reason I did this at first was because it was easy to fill W_history and then just feed it to the histogram plotter:
histogram(W_history(:,d),nbins,'Normalization','probability')
however filling W_history seemed impossible for high number of copies of W. Is there a way to do this in matlab automatically? It feels that there should be and didn't want to re-invent the wheel.
I am sure I could potentially create for each coordinate some array of counters where I count how many times a specific value of the coordinate W falls. However, implementing that and having the checks for in which bin each one should fall seemed inefficient or even unnecessary. Is this really the only solution or what do matlab experts people recommend? Is this re-inventing the wheel? Seems also inefficient if I implement it myself?
Also, I thought I could manually have matlab put thing in memory then bring them back etc (as in store W_history in disk as it fills and then put more back in disk as it fills and eventually somehow plug it in to the histogram plotter), that seemed overwork. I hope I can avoid a solution like this one. It feels a wrong solution since it should be "easy" and high level to use matlab and going down to disk and memory doesn't seem to me what matlab is intended.
Currently through the comment that was given the best solution that I have so far is using histcounts as follow:
for i=2:iter+1
%
W = get_new_W(W)
%
[W_hist_counts_current, edges2] = histcounts(W,edges);
W_hist_counts = W_hist_counts + W_hist_counts_current;
end
however, after this it seems difficult to convert W_hist_counts to pdf/probability or other values since it seems they have to be processed manually. Is there no official way to do this processing without the user having to implement the normalizations again?
I am using Redis to store some information and detect changes in that information over time (for example, think users and locations). What is the value to using a longer or shorter keyname? Using a longer key is clearer, but is there much cost for memory or performance to using longer keyname?
Here are examples:
SET L:123456 "<name> <latitude> <longitude> ..."
HSET U:987654321 loc 123456 time <epoch>
or
SET loc:{123456} "<name> <latitude> <longitude> ..."
HSET user:{U987654321} loc 123456 time <epoch>
It all depends on how you are going to use it.
If every byte counts, for example when you have to pay for each kB transferred to a cloud service, you can calculate the costs. The maths is simple; a byte is a byte 'on the wire'. Inside redis, for larger values it is equally simple. For smaller values, Redis does some memory optimization.
In your HSET example, you split out the members, which only makes sense if you need them separated from eachother most of the time. A better approach -might- be: HSET user:data 987654321 '{"loc": "123456", "time": "2014-01-01T13:00:00"}'. Separate keys/members 'cost' a lot more than longer strings, performance wise. You can even put a whole table or dataset in one member if it's only going to be used as one complete semi-static entity.
Speed and Size: There is a notable difference between keys and values.
Keys:
Shorter is generally more memory efficient as well as speed efficient. If you use a redis Sorted Set you can even use 'numbers' as keys (sorted set 'members' plus 'scores'). I say 'numbers' because a score is technically a float64, but to be used as an ID it has to be between -999999999999999 and 999999999999999 including (that's 15 digits), without any fractional part. This can be really helpful, since Redis does fast and scalable O(log(n)) on-the-fly sorting of Sorted Sets (using skiplists, simplified).
Values:
The MsgPack format (uncompressed) takes up the least space, especially if you store the definitions once and the values many. JSON is a bit less memory efficient, but is ofcourse such a common IPC format that it should not be left out. Raw strings, character separated, fixed length (ugh), whatever your desire, it's possible to use. You can always compress your data before storing it in Redis. So far memory efficiency. When it comes to speed, it's less simple. If you want to use Lua server-side scripting (which you should), you can't do anything with compressed data. JSON and MsgPack can be deserialized, but only 'as a whole'. Which is fine in mosts scenarios. Most flexible is storing separate values (for example as members of a HSET), but this comes at a price as well (most of the time: too high a price). You also can combine all these. What we use most: a prefix of two or three delimiter-separated values, followed by a MsgPack payload.
My general advice is: start with using only HSET's and ZSET's, don't split out data that belongs together, use descriptive PascalCased names for your keys between 10-25 chars, use ':' if you need delimiters in your keys (namespaces), serialize as JSON (for simplicity, but code for easy switching to MsgPack), use Lua scripting (even if you don't know Lua, the subset you use in Redis is tiny).
I wouldn't worry about it too much in the startup phase of your project, you can always change it later on and do some A/B comparisons as soon as you have some interpolatable data.
Hope this helps, TW
Now that Redis v3.2 is almost here, you should consider switching to the new geo hashing functionality: http://redis.io/commands/geoadd
I came across some existing code in our production environment given to us by our vendor. They use a string to store comma seperated values to store filtered results from a DB. Keep in mind that this is for a proprietary scripting language called PowerOn that interfaces with a database residing on an AIX system, but it's a language that supports strings, integers, and arrays.
For example, we have;
Account
----------------
123
234
3456
28390
The psuedo code might look like;
Define accounts As String
For Each Account
accounts=accounts + CharCast(Account) + ","
End
as opposed to something I would expect to see like
Define accounts As Integer Array(99)
Define index as Integer=0
For Each Account
accounts(index)=Account
index=index+1
End
By the time the loop is done, accounts will look like; 123,234,3456,28390,. The string is later used to test if a specific instance exists like so
If CharSearch("28390", accounts) > 0 Then Call DoSomething
In the example, the statement evaluates to true and DoSomething gets called. Given the option of arrays, why would want to store integer values whithin a string of comma seperated values? Every language I've come across, it's almost always more expensive to perform string based operations than integer based operations.
Considering I haven't seen this technique before and my experience is somewhat limitted, is there a name for this? Is this common practice or is this just another example of being too stringly typed? To extend the existing code, should I continue using string method? Did we get cruddy code from our vendor?
What I put in the comment still holds but my real answer is: It's probably a design decision with respect to compatibility/portability. In your integer-array case (and a low enough level of the API) you'd typically find yourself asking questions like, what's a safe guess of the size of an integer on "today"'s machines. What about endianness.
The most portable and most flexible of all data formats always has been and always will be printed representation. It may not be as fast to process that but that's where adapters/converters or so kick in. I wouldn't be surprised to find (human-readable) printed representation of something especially in database APIs like you describe.
If you want something fast, just take whatever is given to you, convert it to a more efficient internal format, do you processing and convert it back.
There's nothing inherently wrong with using comma-separated strings instead of arrays. Sure you can't readily access a random n's element of such a collection, but if such random access is not needed then there's no penalty for it, right?
As far as I know Oracle DB stores NUMBER values as strings (and if my memory is correct - for DATEs as well) for very practical reasons.
In your specific example looks like using strings is an overkill when dealing with passing data around without crossing the process boundaries. But could it be that the choice of string data type makes more sense when sending data over wire or storing on disk?
I've been looking around for the best way to store a large string value (like a blog post, or a text description, etc.) with Fluent nHibernate and the answer I keep seeing is to use an nvarchar(MAX). Which, if my reading is correct (which it very often isn't) is 4000+. So I have a field like so...
Map(x => x.Description)
.Column("[description]")
.Length(4001)
.Access.Property()
.Not.Nullable();
In theory, this should do it, right? I'm a little confused about this though. In school, we were taught pretty clearly that you want to make each column size as small as possible.
If I make that column max size, doesn't that go against that very principle and make the table very large, and wasteful? Can anyone shed some very clear, stupid, blonde-proof logic on this for me? I've been left with a lot of confusion over the whole ordeal.
Have a look at this, maybe be it will help: http://www.crankingoutcode.com/post/Storing-large-strings-with-Fluent-NHibernate-(automapping).aspx
And Gotcha's: http://ayende.com/blog/1969/nhibernate-and-large-text-fields-gotchas
Note that max means that you can store characters upto 2^31-1 bytes of data. However it will consume space based on the actual length of the data
I want to pack a giant DNA sequence with an iOS app (about 3,000,000,000 base pairs). Each base pair can have a value A, C, T or G. Storing each base pair in one bytes would give a file of 3 GB, which is way too much. :)
Now I though of storing each base pair in two bits (four base pairs per octet), which gives a file of 750 MB. 750 MB is still way too much, even when compressed.
Are there any better file formats for efficiently storing giant base pairs on disk? In memory is not a problem as I read in chunks.
I think you'll have to use two bits per base pair, plus implement compression as described in this paper.
"DNA sequences... are not random; they contain
repeating sections, palindromes, and other features that
could be represented by fewer bits than is required to spell
out the complete sequence in binary...
With the proposed algorithm, sequence will be compressed by 75%
irrespective of the number of repeated or non-repeated
patterns within the sequence."
DNA Compression Using Hash Based Data Structure, International Journal of Information Technology and Knowledge Management
July-December 2010, Volume 2, No. 2, pp. 383-386.
Edit: There is a program called GenCompress which claims to compress DNA sequences efficiently:
http://www1.spms.ntu.edu.sg/~chenxin/GenCompress/
Edit: See also this question on BioStar.
If you don't mind having a complex solution, take a look at this paper or this paper or even this one which is more detailed.
But I think you need to specify better what you're dealing with. Some specifics applications can lead do diferent storage. For example, the last paper I cited deals with lossy compression of DNA...
Base pairs always pair up, so you should only have to store one side of the strand. Now, I doubt that this works if there are certain mutations in the DNA (like a di-Thiamine bond) that cause the opposite strand to not be the exact opposite of the stored strand. Beyond that, I don't think you have many options other than to compress it somehow. But, then again, I'm not a bioinformatics guy, so there might be some pretty sophisticated ways to store a bunch of DNA in a small space. Another idea if it's an iOS app is just putting a reader on the device and reading the sequence from a web service.
Use a diff from a reference genome. From the size (3Gbp) that you post, it looks like you want to include a full human sequences. Since sequences don't differ too much from person to person, you should be able to compress massively by storing only a diff.
Could help a lot. Unless your goal is to store the reference sequence itself. Then you're stuck.
consider this, how many different combinations can you get? out of 4 (i think its about 16 )
actg = 1
atcg = 2
atgc = 3 and so on, so that
you can create an array like [1,2,3] then you can go one step further,
check if 1 is follow by 2, convert 12 to a, 13 = b and so on...
if I understand DNA a bit it means that you cannot get a certain value
as a must be match with c, and t with g or something like that which reduces your options,
so basically you can look for a sequence and give it a something you can also convert back...
You want to look into a 3d space-filling curve. A 3d sfc reduces the 3d complexity to a 1d complexity. It's a little bit like n octree or a r-tree. If you can store your full dna in a sfc you can look for similar tiles in the tree although a sfc is most likely to use with lossy compression. Maybe you can use a block-sorting algorithm like the bwt if you know the size of the tiles and then try an entropy compression like a huffman compression or a golomb code?
You can use the tools like MFCompress, Deliminate,Comrad.These tools provides entropy less than 2.That is for storing each symbol it will take less than 2 bits