Let's we have a lot of such classes(millions)
class WordInfo
{
string Value;
string SomeOtherFeatures;
List<Point> Points;
}
And following code
private Dictionary<string, WordInfo> _dict;
public void ProcessData(IEnumerable<Tuple<string,int,int> words)
{
foreach(var word in words)
{
if(_dict.ContainsKey(word.Item1))
{
_dict[word.Item1].Points.Add(new Point(word.Item2,word.Item3));
}
else
{
_dict.Add(word.Item1, new WordInfo(....))
}
}
}
Main()
{
while(true)
{
IEnumerable<Tuple<string,int,int> data = GetDataSomewhere();
ProcessData(data);
}
}
As you can see this code must work 24\7. The main problem is that i donnt know how to represent _dict (place where i store information) in database. I need to process 1000-5000 words per second. Relational db is not good for my task, right? What about NoSQL? I need fast UPDATE and INSERT operations. Also i need fast check is word exists(SELECT) in db. Because of i have millions records it's also not trivial. What do you can suggest? May be write my custom solution based on files?
A relational database should be able to insert/update 1000-5000 words per second easily, assuming you don't create too many transactions.
Transactions are ACID and "D" means durable: when the client receives a notification that the transaction is committed, it is guaranteed that the effects of the transaction are already in the permanent storage (so even if a power cut happens at that exact moment, the transaction won't be "erased"). In practice, this means the DBMS must wait for the disk to finish the physical write.
If you wrap each and every insert/update in its own transaction, you'll also have to perform this wait for each and every one of them. OTOH, if you wrap many inserts/updates in a single transaction, you'll have to pay this price only once per whole "chunk".
Also, checking for the existence of a specific row within millions of others is a task databases are very good at, thanks to the power of B-Tree indexes.
As for the database structure, you'd need something similar to this:
And you'd process it like this (pseudocode):
BEGIN TRANSACTION;
foreach(var word in words)
{
try {
INSERT INTO WORD (WORD_VALUE, SOME_OTHER_FEATURES) VALUES (word.Item1, ...);
}
catch (PK violation) {
// Ignore it.
}
try {
INSERT INTO POINT (WORD_VALUE, X, Y) VALUES (word.Item1, word.Item2, word.Item3);
}
catch (PK violation) {
// Ignore it.
}
}
COMMIT;
(NOTE: I'm assuming you never update the SOME_OTHER_FEATURES after it has been initially inserted. If you do, the logic above would be more complicated.)
If your DBMS supports it, consider making both of these tables clustered (aka. index-organized). Also, if your DBMS supports it, compress the leading edge of the POINT's primary index (WORD_VALUE), since all points related to the same word contain same value there.
BTW, the model above uses so-called identifying relationships and natural keys. An alternate model that uses surrogate keys and non-identifying relationships is possible, but would complicate the kind of processing you need.
Related
I am building a distributed KV store just to learn a little bit more about distributed systems and concurrency. The implementation of the KV store I am building is completely transactional, with an in-memory transaction log. The storage is also completely in-memory just for simplicity. The API exposes the GET, INSERT, UPDATE, REMOVE. Note that all endpoints operate on a single key, not a range of keys.
I am managing concurrency via locks. However, I have a single global lock that locks the entire datastore. This sound terribly inefficient because if I want to read the value for K1 while I update K2, I must wait for k2 to finish updating despite being unrelated.
I know that there are DBs that use more granular locking. For example, in MySQL servers there are row-level locks. How can key-level locks be implemented?
I have
type Storage struct {
store map[string]int32
}
Should I add something like this?:
type Storage struct {
store map[string]int32
locks map[string]mutex.Lock
}
The issue if I do this is that locks has to be kept in sync with the store. Another option would be to combine the two maps, but even then I come into the issue of deleting an entry in the map while the lock is being held if a REMOVE request comes before a GET.
Concetual part
Transactions
First, transaction logs are not needed for strong consitency. Transaction logs are useful for upholding ACID properties.
Transactions are also not strictly required for strong consistency in a database, but they can be a useful tool for ensuring consistency in many situations.
Strong consistency refers to the property that ensures that all reads of a database will return the most recent write, regardless of where the read operation is performed. In other words, strong consistency guarantees that all clients will see the same data, and that the data will be up-to-date and consistent across the entire system.
You can use a consensus algorithm, such as Paxos or Raft, to assure strong consistency. When storing data, you can store data with a version, and use that as the ID in Paxos.
Locking in KV Stores
In a key-value (KV) store, keys are typically locked using some kind of locking mechanism, such as a mutex or a reader-writer lock (as suggested by #paulsm4). This allows multiple threads or processes to access and modify the data in the KV store concurrently, while still ensuring that the data remains consistent and correct.
For example, when a thread or process wants to read or modify a particular key in the KV store, it can acquire a lock for that key. This prevents other threads or processes from concurrently modifying the same key, which can lead to race conditions and other problems. Once the thread or process has finished reading or modifying the key, it can release the lock, allowing other threads or processes to access the key.
The specific details of how keys are locked in a KV store can vary depending on the implementation of the KV store. Some KV stores may use a global lock (as you were already doing, which is sometimes inefficient) that locks the entire data store, while others may use more granular locking mechanisms, such as row-level or key-level locks, to allow more concurrent access to the data.
So tldr; conceptually, you were right. The devil is in the details of the implementation of the locking.
Coding
To strictly answer the question about locking, one can consider Readers-writers locks as suggested by #paulsm4. In golang, a similar lock is RWMutex. It is used in sync.Map.
Here is a short example:
type Storage struct {
store sync.Map // a concurrent map
}
// GET retrieves the value for the given key.
func (s *Storage) GET(key string) (int32, error) {
// Acquire a read lock for the key.
v, ok := s.store.Load(key)
if !ok {
return 0, fmt.Errorf("key not found: %s", key)
}
// Return the value.
return v.(int32), nil
}
// INSERT inserts the given key-value pair into the data store.
func (s *Storage) INSERT(key string, value int32) error {
// Acquire a write lock for the key.
s.store.Store(key, value)
return nil
}
// UPDATE updates the value for the given key.
func (s *Storage) UPDATE(key string, value int32) error {
// Acquire a write lock for the key.
s.store.Store(key, value)
return nil
}
// REMOVE removes the key-value pair for the given key from the data store.
func (s *Storage) REMOVE(key string) error {
// Acquire a write lock for the key.
s.store.Delete(key)
return nil
}
You would need Paxos on top of this to ensure consistency accross replicas.
When running a CREATE OR REPLACE TABLE AS statement in one session, are other sessions able to query the existing table, before the transaction opened by CORTAS is committed?
From reading the usage notes section of the documentation, it appears this is the case. Ideally I'm looking for someone who's validated this in practice and at scale, with a large number of read operations on the target table.
Using OR REPLACE is the equivalent of using DROP TABLE on the existing table and then creating a new table with the same name; however, the dropped table is not permanently removed from the system. Instead, it is retained in Time Travel. This is important to note because dropped tables in Time Travel can be recovered, but they also contribute to data storage for your account. For more information, see Storage Costs for Time Travel and Fail-safe.
In addition, note that the drop and create actions occur in a single atomic operation. This means that any queries concurrent with the CREATE OR REPLACE TABLE operation use either the old or new table version.
Recreating or swapping a table drops its change data. Any stream on the table becomes stale. In addition, any stream on a view that has this table as an underlying table, becomes stale. A stale stream is
I have not "proving it via performance tests to prove it happens" but we did run for 5 years, where we read from tables of on set of warehouses and rebuilts underlying tables of overs and never noticed "corruption of results".
I always thought of snowflake like double buffer in computer graphics, you have the active buffer, that the video signal is reading from (the existing tables state) and you are writing to the back buffer while a MERGE/INSERT/UPDATE/DELETE is running, and when that write transaction is complete the active "current page/files/buffer" is flipped, all going forward reads are now from the "new" state.
Given the files are immutable, the double buffer analogy holds really well (aka this is how time travel works also). Thus there is just a "global state of what is current" maintained in the meta data.
To the CORTAS / Transaction, I would assume as that is an DDL operation, if you had any transactions that it completes them, like all DDL operations do. So perhaps prior in my double buffer story, that is a hiccup to understand.
Can i use elasticsearch to store statistic informations with less overhead?
It should be used for example how often is a function call was madeand how much time has it taken.
Or how many requests has been made to a specific endpoint and also the time it has taken, and so on.
My idea would be to store a key, timestamp and takenTime.
And i can query results in different manner.
simply handled by functions like profile_start and profile_done
void endpointGetUserInformation()
{
profile_start("requests.GetUserInformation");
...
// profile_done stores to the database
profile_done("requests.GetUserInformation");
}
In a normal sql database i would make a table witch holds all keys, and a second table that holds key_ids, timestamp and timeTaken. This storage would need less space on disk.
When i store to elasticsearch it stores a lot of data additionally and also the key is redundant is there a solutuion to store it also simplier.
What I'm doing is creating a transaction where
1) An entity A has a counter updated to +1
2) A new entity B is written to the datastore.
It looks like this:
WrappedBoolean result = ofy().transact(new Work<WrappedBoolean>() {
#Override
public WrappedBoolean run() {
// Increment count of EntityA.numEntityBs by 1 and save it
entityA.numEntityBs = entityA.numEntityBs +1;
ofy().save().entity(entityA).now();
// Create a new EntityB and save it
EntityB entityB = new EntityB();
ofy().save().entity(entityB).now();
// Return that everything is ok
return new WrappedBoolean("True");
}
});
What I am doing is keeping a count of how many EntityB's entityA has. The two operations need to be in a transaction so either both saves happen or neither happen.
However, it is possible that many users will be executing the api method that contains the above transaction. I fear that I may run into problems of too many people trying to update entityA. This is because if multiple transactions try to update the same entity, the first one to commit wins but all the others fail.
This leads me to two questions:
1) Is the transaction I wrote a bad idea and destined to cause writes not being made if a lot of calls are made to the API method? Is there a better way to achieve what I am trying to do?
2) What if there are a lot of updates being made to an entity not in a transaction (such as updating a counter the entity has) - will you eventually run into a scaling problem if a lot of updates are being made in a short period of time? How does datastore handle this?
Sorry for the long winded question but I hope someone could shed some light on how this system works for me with the above questions. Thanks.
Edit: When I mean a lot of updates being made to an entity over a short period of time, consider something like Instagram, where you want to keep track of how many "likes" a picture has. Some users have millions of followers and when they post a new picture, they can get something like 10-50 likes a second.
The datastore allows about 1 write/second per entity group. What might not appear obvious is that standalone entities (i.e. entities with no parent and no children) still belong to one entity group - their own. Repeated writes to the same standalone entity is thus subject to the same rate limit.
Exceeding the write limit will eventually cause write ops to fail with something like TransactionFailedError(Concurrency exception.)
Repeated writes to the same entity done outside transactions can overwrite each-other. Transactions can help with this - conflicting writes would be automatically retried a few times. Your approach looks OK from this prospective. But it only works if the average write rate remains below the limit.
You probably want to read Avoiding datastore contention. You need to shard your counter to be able to count events at more than 1/second rates.
Let’s say below is the GAE data store kind, where the need is to store two different players info into a single entity, how the data correctness can be maintained?
#Entity
public class Game
{
id
player1Id
player2Id
score1
score2
}
Let say player1 wants to modify score1, he first reads the entity and modifies score1 and saves back into datastore.
Similarly player2 wants to modify score2, he as well first reads the entity and modifies score2 and saves back into datastore.
So having this functionality, there is a chance that both player1 and player2 try to modify the same entity and they could end up modifying the data incorrectly due to a dirty read.
Is there a way to avoid the dirty reads with GAE Data store to make sure the data correctness (or) avoid the concurrent modification?
Transactions:
https://developers.google.com/appengine/docs/java/datastore/transactions
When two or more transactions simultaneously attempt to modify
entities in one or more common entity groups, only the first
transaction to commit its changes can succeed; all the others will
fail on commit.
You need to use Transactions, and then you have two options:
Implement your own logic for transaction failures (how many times to retry, etc.)
Instead of writing to the datastore directly, create a task to modify an entity. Run a transaction inside a task. If it fails, the App Engine will retry this task until it succeeds.