I am aware about how concurrency can easily create issues in a program. I have a program that requires an asynchronous task that will save information to a database. In order to prevent many calls to the database, I cache the data and only save to the database as needed. Before I commit to adding this functionality, I just want a bit of input on this method.
My question is: Is it safe to read static-data in an Asynchronous task, or is it more inclined to produce bugs? Should I go about doing this another way?
My apologies if this is a novice question. I searched for this question and couldn't find the information I needed.
Yes, it's pretty safe if done right. Just try to follow several rules:
Use Readers-Writer lock if you want your collection be consistent at every moment. The benefit of this lock type is that it allows readers to read data almost without performance penalties.
Have only one entry point to modify the collection or actual data. If you will use any locking technique, then, of course, don't allow read data from entries that do not use locking.
If you use any locking techniques, use one lock per one resource (collection, variable, property). Don't expose this lock to the outside world, use it privately.
Related
My goal is to achieve the highest performance available for copying a block of data from the database into a C-function to be processed and returned as the result of a query.
I am new to PostgreSQL and I am currently researching possible ways to move the data. Specifically, I am looking for nuances or keywords related specifically to PostgreSQL to move big data fast.
NOTE:
My ultimate goal is speed, so I am willing to accept answers outside of the exact question I have posed as long as it gets big performance results. For example, I have come across the COPY keyword (PostgreSQL only), which moves data from tables to files quickly; and vice versa. I am trying to stay away from processing that is external to the database, but if it provides a performance improvement that out-weighs the obvious drawback of external processing, then so be it.
It sounds like you probably want to use the server programming interface (SPI) to implement a stored procedure as a C language function running inside the PostgreSQL back-end.
Use SPI_connect to set up the SPI.
Now SPI_prepare_cursor a query, then SPI_cursor_open it. SPI_cursor_fetch rows from it and SPI_cursor_close it when done. Note that SPI_cursor_fetch allows you to fetch batches of rows.
SPI_finish to clean up when done.
You can return the result rows into a tuplestore as you generate them, avoiding the need to build the whole table in memory. See examples in any of the set-returning functions in the PostgreSQL source code. You might also want to look at the SPI_returntuple helper function.
See also: C language functions and extending SQL.
If maximum speed is of interest, your client may want to use the libpq binary protocol via libpqtypes so it receives the data produced by your server-side SPI-using procedure with minimal overhead.
I am doing an information retrieval project in c++. What are the advantages of using a database to store terms, as opposed to storing it in a data structure such as a vector? More generally when is it a good idea to use a database rather than a data structure?
(Shawn): Whenever you want to keep the data beyond the length of the instance of the program. (persistence across time)
(Michael Kjörling): Whenever you want many instances of your program, either in the same computer or in many computers, like in a network or the Net, access and manipulate (share) the same data. (persistence across network space)
Whenever you have very big amount of data that do not fit into memory.
Whenever you have very complex data structures and you prefer to not rewrite code to manipulate them, e.g search, update them, when the db programmers have already written such code and probably much faster than the code you (or I)'ll write.
Whenever you want to keep the data beyond the length of the instance of the program?
In addition to Shawn pointing out persistence: whenever you want multiple instances of the program to be able to easily share data?
In-memory data structures are great, but they are not a replacement for persistence.
It really depends on the scope. For example if you're going to have multiple applications accessing the data then a database is better because you won't have to worry about file locks, etc. Also, you'd use a database when you need to do things like joining other data, sorting, etc... unless you like to implement Quicksort.
Using the net-snmp API and using mib2c to generate the skeleton code, is it possible to support delayed initialization of tables? What I mean is, the table would not be initialized until any of it's members were queried directly. The reason for this is that the member data is obtained from another server, and I'd like to be able to start the snmpd daemon without requiring the other server to be online/ready for requests. I thought of maybe initializing the table with dummy data that gets updated with the real values when a member is queried, but I'm not sure if this is the best way.
The table also has only one row of entries, so using mib2c.iterate.conf to generate table iterators and dealing with all of that just seems unnecessary. I thought of maybe just implementing the sequence defined in the MIB and not the actual table, but that's not usually how it's done in all the examples I've seen. I looked at /mibgroup/examples/delayed_instance.c, but that's not quite what I'm looking for. Using mib2c with the mib2c.create-dataset.conf config file was the closest I got to getting this to work easily, but this config file assumes the data is static and not external (both of which are not true in my case), so it won't work. If it's not easily done, I'll probably just implement the sequence and not the table, but I'm hoping there's an easy way. Thanks in advance.
The iterator method will work just fine. It won't load any data until it calls your _first and _next routines. So it's up to you, in those routines and in the _handler routine, to request the data from the remote server. In fact, by default, it doesn't cache data at all so it'll make you query your remote server for every request. That can be slow if you have a lot of data in the table, so adding a cache to store the data for N seconds is advisable in that case.
If two processes modify the same entity concurrently, but only modify different properties, can they potentially overwrite the changes made by the other process when calling DatastoreService.put?
Process A:
theSameEntity.setProperty ("foo", "abc");
DatastoreService.put (theSameEntity);
Process B:
theSameEntity.setProperty ("bar", 123);
DatastoreService.put (theSameEntity);
Yes, it's possible they'll overwrite each other's changes, since the entire entity is sent to the datastore (serialized using protocol buffers) with each write (not just a diff).
You'll need to use transactions if you want to avoid this.
Yes, I have observed this (though in my case the concurrent requests modified the same property).
I don't think transactions will help because they don't lock the datastore they guarantee, that the operations in the transaction will see the same data. I Would like to know if anyone has found a solution to this.
We always have some static data which can be stored in a file as an array or stored in a database table in our web based project. So which one should be preferred?
In my opinion, arrays have some advantages:
More flexible (it can be any structure, which specifies a really complex relation)
Better performance (it will be loaded in memory, which will have better read/write performance compared with a database's I/O operations)
But my colleague argued that he preferred DB approach, since it can keep a uniform data persistence interface, and be more flexible.
So which should be preferred? Or how can we choose? Or we should prefer one in some scenario and another in other scenarios? what are the scenarios?
EDIT:
Let me clarify something. Truly just as Benjamin made the change to the title, the data we want to store in an array(file) won't change so frequently, which means the code won't change the value of the array in the runtime. If the data change very frequently I will use DB undoubtedly. That's why I made such a post.
And sometimes it's hard to store some really complex relations like:
Task = {
"1" : {
"name" : "xx",
"requirement" : {
"level" : 5,
"money" : 100,
}
...
}
Just like the above code sample(a python dict or you can think it as an array), the requirement field is hard to store in DB(store a structure like pickled object directly in DB? not so good I think). So in such condition, I will prefer arrays.
So what's your idea? In such scenario, we should prefer arrays to DB, right?
Regards.
Lets be pragmatic/objetive:
Do you write to your data on runtime? Yes: Db, No: File
Do you update your data more than once per week? Yes: Db, No: File
It's a pain to release an updated data file? Yes: Db, No: File,
Do you read that data often? Yes: File/Cache, No: Db
It is a pain to update that data file and you need extra tools? Yes: db, No: File
For sure I've forgotten other points, but I guess the basics are there.
The "flexiable" array in a file is fraught with a zillion issues already delt with by using a DB. Unless you can prove that the DB is really going to way slower than using the other approach use a DB. Move on and start solving business problems.
Edit
Comment from OP asks what the issues with using a file might be, here are a handful (pause to take a deep breath).
Concurrency: You have to manage the situation where multiple requests may be trying to write back to the file. Not too hard but it becomes a bottleneck.
Performance: Yes modifying an in-memory array is quicker but how do you determine how much and when the array needs to be persisted to a file. Note that using a DB doesn't pre-clude the use of an appropriate in-memory cache. Writing a file back each time a small modification is made isn't going to perform that well.
Scalability: Really a function of the first two. In order to acheive any scalable goals you need to be able to quickly modify small bits of the data that is persisted. IWO if you don't use a DB you would end up writing one. If you find you need more than one webserver to support growing demand where are you going to store the file(s)? Now you've got file I/O over a network (ableit likely a very quick one).
Structure: Your code will be responsible for managing the structure of data, querying it etc if you use an array. How will you do that in way which acheives greater "flexibility" than using a DB? All manner of choices and complexity are needed here.
Reliability: You need to ensure the integrity of your persisted data. In the event of some failure your array/file code would need to ensure that data is at least not so corrupt that the application can continue.
Your colleague is correct, BUT there's where you need to put aside the comp sci textbook and be pragmatic. How often will you be accessing this data from your application? If it's fairly frequently then don't incur the costs of access overhead. Instead of reading from a flat file you could still gain the advantages of a db, but use a caching strategy in your application. Depending on your development language you could look at something like memcache or jtreecache.
It depends on what kind of data you are looking at, and whether or not it needs to be updated regularly.
I tend to keep most things (non-config data) in the database, even if the data isn't going to be repeating (e.g. thosands of rows). Databases will scale so much easier than a flat file, if your system starts to grow fast your flat file might become a burden to your system.
If the data doesn't change very oftern, and your programming in Java, why not use Spring to hold the values?
They can be injected into your bean, and changed easly.
but thats if you'r developing in Java.
Yeah I agree with your implied assessment that databases are overused and basic flat files may work in multitude of scenarios. If your application is read-only (and writes are done by the admin when app restarts) I would definitely go with the file. Even if application writes to the file, but only in append mode (vs random inserts/updates) in one thread, I would also use file. Anything else -- need a real database with random updates, queries, concurrency control etc.