I'm trying to understand the internals of Redis. It uses a simple implementation of a dictionary as data-storage in memory. Moreover data transferred from the client to the server is serialized by its own RESP protocol.
What I didn't figure out so far is how the data is stored in redis. Does it store the according RESP value as simple dynamic string (sds) or does it first parse the value from the RESP eg. as an integer and stores it as an int (possibly from the shared integers array), which then is a sds again? I'm getting curious since in dict.c e.g. int dictAdd(dict *d, void *key, void *val){...} data is used as void *, which could indicate that data is stored as string, int or anything else, but tracing it down I didn't find any piece of code converting sds into objects.
But if it stores the data as sds, how does it store lists and sets?
Each data type in Redis has its own encoding, and most of them have several encodings for different scenarios. Even sds strings (and yes, string keys are usually sds strings) can have multiple encodings.
Sets, sorted sets, lists and hashes use a compact "ziplist" encoding in memory when they are small, but move to a memory wasteful yet faster encoding when they grow.
The most complex object is the sorted set, which is a combination of a skiplist and a hash table. And the new streams object also has a very interesting representation.
In RDB though, they get serialized into a compact representation and not kept as they are in memory.
Related
I want to store byte arrays (less than 1 MB) as a field value. I know about ByteArrayDocument and storing binary data as an independent non-JSON object.
To store a field as a byte array, do I just use com.couchbase.client.core.utils.Base64 to build a string value?
Or is some other approach recommended?
If you want to store it as an attribute in your JSon document, base64 would be the right approach.
However, unless your document contains only metadata about the file itself, I don't recommend using this strategy. Documents are automatically cached, and if your document is big, the cache memory will be filled quite easily.
Well, somehow even after reading in a lot of textbooks (really a lot) and on the Internet for a long while I still can’t completely comprehend what the difference between the two mentioned things is.
To simplify the question, according to let’s say Wikipedia a data type is:
a classification identifying one of various types of data, such as real, integer or Boolean, that determines the possible values for that type; the operations that can be done on values of that type; the meaning of the data; and the way values of that type can be stored.
with it being mostly the implementation of some Abstract data type like the real numbers or the whole numbers.
All good, then comes the data structure:
is a particular way of organizing data in a computer so that it can be used efficiently.[1][2] Data structures can implement one or more particular abstract data types., which are the means of specifying the contract of operations and their complexity. In comparison, a data structure is a concrete implementation of the contract provided by an ADT.
So a data structure is an implementation of an ADT like say a stack or queue.
But wouldn’t that make it a data type as well??
All that I can truly see is that a data type could range from really simple things without any structural organization to sophisticated structures of data what really counts is that they are an implementation of an ADT mirroring the important aspects of it and that they could be envisioned as a single entity like ( a list or tree), but data structures must contain at least some sort of logical or mathematical organization to classify as a data structure, but sadly this difference would make many entities both a data structure and a data type simultaneously.
So what is the solid difference between a simple plain (data type) and a (data structure)?
I would gladly accept an answer with specifying a specific book about this topic which goes deep enough to explain all this matters, also if someone can recommend me some good books about data structures in C.
In C, a data type is a language-level construct. There are a limited number of predefined types (int, char, double, etc.), and a practically unlimited number of derived types (array types, structure types, union types, function types, pointer types, atomic types (the latter are new in C11)).
Any type may be given a simple name via a typedef declaration. For any type other than a function type or an incomplete type, you can have objects of that type; each object occupies a contiguous region of memory.
The types that can exist in C are completely described in section 6.2.5 of the C standard; see, for example, the N1570 draft.
A data structure, on the other hand, is a construct defined by your own code. The language does not define the concept of a linked list, or a binary tree, or a hash table, but you can implement such a data structure, usually by building it on top of derived data types. Typically there is no such thing as an object that's a linked list. An instance of a linked list data structure consists of a collection of related objects, and only the logic of your code turns that collection into a coherent entity. But you'll typically have an object of some data type that your program uses to refer to the linked list data structure, perhaps a structure or a pointer to a structure.
You'll typically have a set of functions that operate on instances of a data structure. Whether those functions are part of the data structure is a difficult question that I won't try to answer here.
An array, for example, can be considered both a data type and a data structure; more precisely, you can think of it as a data structure implemented using the existing array type.
Referring >=C99:
The are two kinds of data types:
intrinsic: char, int, float, double, _Complex, _Bool, void (for some of them there a variation to long and unsigned around)
derived: arrays, structures, unions, pointers, functions
The latter are build from the former and/or the latter.
So to answer your question:
So what is the solid difference between a simple plain (data type) and a (data structure)?
A "data structure [type]" is derived from "simple plain data type"(s) and/or other "data structure [type]"(s).
A data type specifies the values and operations allowed for a single expression or object; a data structure is a region of storage and algorithms that organize objects in that storage.
An example of a data type is int; objects of this type may store whole number values in at least the range [-32767, 32767], the usual arithmetic operations may be performed these objects (although the result of integer division is also an integer, which trips people up the first time around). You can't use the subscript operator [] on an int, nor may you use the function call () operator on an int object.
For an example of a data structure, we can look at a simple stack. We'll use an array as our region of storage. We'll define an additional integer item to serve as a stack pointer - it will contain the index of the element most recently added to the array. We define two algorithms - push and pop - that will add and remove items to the stack in a specific order.
push: if sp is less than stack size then
add 1 to sp
write input to array[sp]
else
stack overflow
end if
pop: if sp is greater than 0 then
get value from array[sp]
subtract 1 from sp
return value
else
stack underflow
end if
Our stack data structure stores a number of objects of some data type such that the last item added is always the first item removed, a.k.a. a last-in first-out (LIFO) queue. If we push the values 1, 2, and 3 onto the stack, they will be popped off in the order 3, 2, and 1.
Note that it's the algorithms that distinguish data structures from each other, not the type of storage. If your algorithms add an item to one end of the array and pull them from the other end, you have a first-in first-out (FIFO) queue. If your algorithms add items to the array such that for each element i in the array a[i] >= a[2*i] and a[i] >= a[2*i+1] are both true, you have a heap.
Basically,
A Data type defines a certain domain of values and it defines the operations allowed on those values. All the basic data types defined by the compiler are called Primitive Data Types
A Data structure is rather an User defined data type and is the systematic way to organize data so that it can be used efficiently. The operations and values of these are not specified in the language itself , but it is specified by the user.
The Book to learn more about this is "The C Programming Language - Dennis Ritchie"
A data type represents the type of data that is going to be stored in a variable. It specifies that a variable will only assign values of a specific type.
A data structure is a collection that holds various from of data.
I'm writing a Dart library in which I'm very regularly dealing with byte arrays or byte strings. Since Dart doesn't have a byte type nor an array type, I'm using List for all byte arrays.
Is this a good practice to do? I only recently found out about the existence of Uint8List in the dart:typed_data package. It's clear that this class aims to by the go-to implementation for byte arrays.
But does it have any direct advantages?
I can imagine that it does always perform checks on new items so that the user can make sure no non-byte-value integers are inside the list. But are there other advantages or differences?
There also is a class named ByteArray, but it seems to be a quite inefficient alternative for List...
The advantage should be that the Uint8List consumes less memory than a normal List, because it is known from the beginning that each elements size is a single byte.
Uint8List can also be mapped directly to underlying optimized Uint8List types (e.g. in Javascript).
Copies of list slices are also easier to perform, because all bytes are laid-out continguos in memory and therefore the slice can be directly copied in a single operation to another Uint8List (or equivalent) type.
However if this advantage is fully used depends on how good the implementation of Uint8List in Dart is.
John Mccutchan of the Dart team explains that the Dart VM relies on 3 different integer representations — pretty like the Three Musketeer's, there is the small machine integer (smi), the medium (mint) and the big heavy integer (bint). The VM takes care to switch automatically between the three depending on the size of the integer in play.
Within the smi range, which depends on the CPU architecture, integers fit in a register, therefore can be loaded and stored directly in the field instead of being fetched from memory. They also never require memory allocation. Which leads to the performance side of the story: within the smi range, storing an integer in object lists is faster than putting them in a typed list.
Typed lists would have to tag and untags, steps which refer to the VM set of operations to box and unbox smi values without allocation memory or loading the value from a object. The leaner, the better.
On the other hand, typed list have two big capabilities to consider. The garbage collection is very low as typed lists can store never store object references, only numbers. Typed list can also be much more dense therefore an Int8List would require much less memory and make better use of CPU's cache. The smi range principle applies also in typed lists, so playing with numbers within that range provides the best performance.
All in all, what remains of this is that we need to benchmark each approach to find which work the best depending on the situation.
My model has different entities that I'd like to calculate once like the employees of a company. To avoid making the same query again and again, the calculated list is saved in Memcache (duration=1day).. The problem is that the app is sometimes giving me an error that there are more bytes being stored in Memcache than is permissible:
Values may not be more than 1000000 bytes in length; received 1071339 bytes
Is storing a list of objects something that you should be doing with Memcache? If so, what are best practices in avoiding the error above? I'm currently pulling 1000 objects. Do you limit values to < 200? Checking for an object's size in memory doesn't seem like too good an idea because they're probably being processed (serialized or something like that) before going into Memcache.
David, you don't say which language you use, but in Python you can do the same thing as Ibrahim suggests using pickle. All you need to do is write two little helper functions that read and write a large object to memcache. Here's an (untested) sketch:
def store(key, value, chunksize=950000):
serialized = pickle.dumps(value, 2)
values = {}
for i in xrange(0, len(serialized), chunksize):
values['%s.%s' % (key, i//chunksize)] = serialized[i : i+chunksize]
return memcache.set_multi(values)
def retrieve(key):
result = memcache.get_multi(['%s.%s' % (key, i) for i in xrange(32)])
serialized = ''.join([v for k, v in sorted(result.items()) if v is not None])
return pickle.loads(serialized)
I frequently store objects with the size of several megabytes on the memcache. I cannot comment on whether this is a good practice or not, but my opinion is that sometimes we simply need a relatively fast way to transfer megabytes of data between our app engine instances.
Since I am using Java, what I did is serializing my raw objects using Java's serializer, producing a serialized array of bytes. Since the size of the serialized object is now known, I could cut into chunks of 800 KBs byte arrays. I then encapsulate the byte array in a container object, and store that object instead of the raw objects.
Each container object could have a pointer to the next memcache key where I could fetch the next byte array chunk, or null if there is no more chunks that need to be fetched from the memcache. (i.e. just like a linked list) I then re-merge the chunks of byte arrays into a large byte array and deserialize it using Java's deserializer.
Do you always need to access all the data which you store? If not then you will benefit from partitioning the dataset and accessing only the part of data you need.
If you display a list of 1000 employees you probably are going to paginate it. If you paginate then you definitely can partition.
You can make two lists of your dataset: one lighter with just the most essential information which can fit into 1 MB and other list which is divided into several parts with full information. On the light list you will be able to apply the most essential operations for example filtering through employees name or pagination. And then when needed load the heavy dataset you will be able to load only parts which you really need.
But well these suggestions takes time to implement. If you can live with your current design then just divide your list into lumps of ~300 items or whatever number is safe and load them all and merge.
If you know how large will the objects be you can use the memcached option to allow larger objects:
memcached -I 10m
This will allow objects up to 10MB.
I want to store data in C in tabular format. I am having difficulty in relating the following. Can someone help?
For example:
I want to store the follwong entries, then what should be the ideal way of storing in C?
IP Address Domain Name
1.) 10.1.1.2 www.yahoo.com
2.) 20.1.1.3 www.google.com
Should i use structures? Say for example?
struct table
{
unsigned char ip address;
char domain_name[20];
};
If not, please clarify?
You probably mixing two different questions:
How to organize data in your program (in-memory) - this is the part about using structures.
How to serialize data, that is to store it in external storage e.g. in file. This is the part about "tabular" format that implies text with fields delimited by tabs.
If IP and domain often come together in your program then it is reasonable to use structure or class (in C++) for that. Regarding your example I do not know restrictions on domain name lenght but "20" would be definitely insufficient. I'd suggest using dynamically allocated strings here. For storing IP (v4) address you may use 32 bit unsigned int - char is insufficient. Do you intend to support IP v6 also? then you need 128 bit for address.
In C (and C++) there is no built-in serialization facility like one in virtually every dynamic (or "managed") language like C#, Java, Python. So by defining a structure you do not automatically get methods for writing/reding your data. So you should use some library for serialization or write your own for reading/writing your data.
The method of storage depends at least partially on what you're going to do with the information. If it's simply to read it in and then print it out again, you could process it strictly as text.
However, network programs often make use of this type of data. See the structures in the system header files netinet/in.h, arpa/inet.h, and sys/socket.h Or see the man page for inet_aton()
Structures are the way to go. Use sufficiently sized arrays. IPV4 addresses take 16 chars and domain names take a maximum of 255 chars.
struct table
{
char ip_addr[16];
char domain_name[255];
};
Unfortunately I cannot make comments. But in respect to Amarghosh's answer, this problem would be perfectly solved using fixed length arrays for the fields since both sets (if domain is top-level only) of data are of limited length (15 characters for the ip address [assuming IPv4], and there is a 63 character ascii limit per label for domain names.)
There are two issues in representing tabular data:
1. Representing a row
2. Representing many rows.
In your example, the row can be represented by:
struct Table_Record
{
unsigned char ip_address[4];
char domain_name[MAX_DOMAIN_LENGTH];
};
I've decided to use a fixed field length for the domain name. This will make processing simpler.
The next question is how to structure the rows. This is a decision you will have to make. The simplest suggestion is to use an array. However, an array is a fixed size and needs to be reallocated if there are more entries than the array size:
struct Table_Record table[MAX_ROWS];
Another data structure for the table is a list (single or double, your choice). Unfortunately, the C language does not provide a list data structure so you will have either write your own or obtain a library.
Alternative useful data structures are maps (associative arrays) and trees (although many maps are implemented using trees). A map would allow you to retrieve the value for a given key. If the key is the IP address, the map would return the domain name.
If you are going to read and write this data using files, I suggest using a database rather than writing your own. Many people recommend SQLite.