Is it possibly to guarantee transactional integrity when storing information in a Sharepoint list (SP 2010)?
Underneath the covers a single SharePoint operation like adding a list item can involve multiple database operations and they will all be protected by a single database transaction. With that said, the product doesn't expose that transactional capability to you so that you can perform multiple SharePoint operations under the aegis of a single transaction. To be very safe, you'll need to implement very carefully coded error handlers.
According to this, SharePoint 2010 does not offer any transactional support out of the box.
The underlying database does support transactions, so a single insert will probably either succeed or fail, but if an error occurs during a complex routine involving multiple database operations, the data will end up being partially modified.
Sharepoint does not offer transaction support out of the box.
Here is a good resource on Building a System.Transactions resource manager for SharePoint
Though I would save the effort and store any critical data directly into a RDB
Related
I know SQL Server is very robust in this sense (transactions and locking), but how would that work with NoSQL databases like AWS DocumentDB with Mongo API?
There's no shortcut to diving in and learning each individual systems concurrency model and offerings :/
These guarentees can be found by searching for "Isolation Levels" Or "Default Isolation Levels" for your target database.
https://docs.mongodb.com/manual/core/read-isolation-consistency-recency/
https://www.postgresql.org/docs/7.2/xact-read-committed.html
https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-isolation-levels.html
One thing to note is that the MySQL and PostgreSQL default isolation level is "Read Committed". Which actually can lead to incorrect applications in concurrent environments for common types of queries.
For example if you have a multi threaded web application which allows users to set to their account balance. If both threads fetch the account balance this will result in a logical race where the last thread ends up overwriting the result of first thread. This is described in detail in each of the documents above.
In Amazon DocumentDB, all CRUD statements (findAndModify, update, insert, delete) guarantee atomicity and consistency, even for operations that modify multiple documents. For more information, see Implicit Transactions.
Additionally, reads from an Amazon DocumentDB cluster’s primary instance are strongly consistent under normal operating conditions and have read-after-write consistency. For more information, see Read Preference Options.
I am tasked with putting together a solution that can handle a high level of inserts into a database. There will be many AJAX type calls from web pages. It is not only one web site/page, but several different ones.
It will be dealing with tracking people's behavior on a web site, triggered by various javascript events, etc.
It is important for the solution to be able to handle the heavy database inserting load.
After it has been inserted, I don't mind migrating the data to an alternative/supplementary data store.
We are initial looking at using the MEAN stack with MongoDB and migrating some data to MySql for reporting purposes. I am also wondering about the use of some sort of queue-ing before insert into db or caching like memcached
I didn't manage to find much help on this elsewhere. I did see this post but it is now close to 5 years old, feels a bit outdated and don't quite ask the same questions.
Your thoughts and comments are most appreciated. Thanks.
Why do you need a stack at all? Are you looking for a web-application to do the inserting? Or do you already have an application?
It's doubtful any caching layer will outrun your NoSQL database for inserts, but you should probably confirm that you even need a NoSQL database. MySQL has pretty solid raw insert performance, as long as your load can be handled on a single box. Most NoSQL solutions scale better horizontally. This is probably worth a read. But realistically, if you already have MySQL in-house, and you separate your reporting from your insert instances, you will probably be fine with MySQL.
Some initial theory
To understand how you can optimize for the heavy insert workload, I suggest to understand the main overheads involved in inserting data in a database. Once the various overheads are understood, all kings of optimizations will come to you naturally. The bonus is that you will both have more confidence in the solution, you will know more about databases, and you can apply these optimizations to multiple engines (MySQL, PostgreSQl, Oracle, etc.).
I'm first making a non-exhaustive list of insertion overheads and then show simple solutions to avoid such overheads.
1. SQL query overhead: In order to communicate with a database you first need to create a network connection to the server, pass credentials, get the credentials verified, serialize the data and send it over the network, and so on.
And once the query is accepted, it needs to be parsed, its grammar validated, data types must be parsed and validated, the objects (tables, indexes, etc.) referenced by the query searched and access permissions are checked, etc. All of these steps (and I'm sure I forgot quite a few things here) represent significant overheads when inserting a single value. The overheads are so large that some databases, e.g. Oracle, have a SQL cache to avoid some of these overheads.
Solution: Reuse database connections, use prepared statements, and insert many values at every SQL query (1000s to 100000s).
2. Ensuring strong ACID guarantees: The ACID properties of a DB come at the cost of logging all logical and physical modification to the database ahead of time and require complex synchronization techniques (fine-grained locking and/or snapshot isolation). The actual time required to deal with the ACID guarantees can be several orders of magnitude higher than the time it takes to actually copy a 200B row in a database page.
Solution: Disable undo/redo logging when you import data in a table. Alternatively, you could also (1) drop the isolation level to trade off weaker ACID guarantees for lower overhead or (2) use asynchronous commit (a feature that allows the DB engine to complete an insert before the redo logs are properly hardened to disk).
3. Updating the physical design / database constraints: Inserting a value in a table usually requires updating multiple indexes, materialized views, and/or executing various triggers. These overheads can again easily dominate over the insertion time.
Solution: You can consider dropping all secondary data structures (indexes, materialized views, triggers) for the duration of the insert/import. Once the bulk of the inserts is done you can re-created them. For example, it is significantly faster to create an index from scratch rather than populate it through individual insertions.
In practice
Now let's see how we can apply these concepts to your particular design. The main issues I see in your case is that the insert requests are sent by many distributed clients so there is little chance for bulk processing of the inserts.
You could consider adding a caching layer in front of whatever database engine you end up having. I dont think memcached is good for implementing such a caching layer -- memcached is typically used to cache query results not new insertions. I have personal experience with VoltDB and I definitely recommend it (I have no connection with the company). VoltDB is an in-memory, scale-out, relational DB optimized for transactional workloads that should give you orders of magnitude higher insert performance than MongoDB or MySQL. It is open source but not all features are free so I'm not sure if you need to pay for a license or not. If you cannot use VoltDB you could look at the memory engine for MySQL or other similar in-memory engines.
Another optimization you can consider is to have a different database for doing the analytics. Most likely, a database with a high data ingest volume is quite bad at executing OLAP-style queries and the other way around. Coming back to my recommendation, VoltDB is no exception and is also suboptimal at executing long analytical queries. The idea would be to create a background process that reads all new data in the frontend DB (i.e. this would be a VoltDB cluster) and moves it in bulk to the backend DB for the analytics (MongoDB or maybe something more efficient). You can then apply all the optimizations above for the bulk data movement, create a rich set of additional index structures to speed up data access, then run your favourite analytical queries and save the result as a new set of tables/materialized for later access. The import/analysis process can be repeated continuously in the background.
Tables are usually designed with the implied assumption that queries will far outnumber DML of all sorts. So the table is optimized for queries with indexes and such. If you have a table where DML (particularly Inserts) will far outnumber queries, then you can go a long way just by eliminating any indexes, including a primary key. Keys and indexes can be added to the table(s) the data will be moved to and subsequently queried from.
Fronting your web application with a NoSQL table to handle the high insert rate then moving the data more or less at your leisure to a standard relational db for further processing is a good idea.
Reading on both it seems that they both have similar responsibilities of managing the sharing and integrity of resources as well as prioritizing execution but I cannot seem to find how they differ? Can someone clarify this misunderstanding.
Thank You
In addition to what Oded already said:
A transaction manager manages transactions - and a transaction can include/address other resources than just databases. I have given the example of a printer at some occasions before.
A database manager manages data - and not necessarily in a transactional way. There is a very popular SQL system whose 1.0 version did not have commit/rollback, iow, did not offer transactional functionality and thus did not offer much of support for data integrity.
The distinction is mostly rather obtuse, however, because:
a great many real-life transactions involve no other recoverable resources than just the database,
in order to guarantee data consistency, DBMS's cannot avoid having to offer most if not all of the functionality of transactions.
A transaction manager manages transactions - these can be distributed (i.e. involving several databases/systems).
A database manager deals with a single database - managing it on the disk, memory consumption, query parsing etc...
Just to ensure understanding:
Transaction Manager deals with multiple levels of control and the physical database.
Database Manager deals with the direct access of the physical database.
I would also like to add to both of these answers that the transaction manager is also responsible to enforce ACID (Atomicity, Consistency, Isolation, and Durability). I was pretty much confused as well.
Situation: Some Bank has an old legacy ABS (Automatic bank system).
Bank wants to:
notify old legacy CRM system about client's account changes (Publish operation).
check PIN codes of client cards (Request/Response operation) - in synchronious mode.
ABS is implemented in very old private technologies with StoredProcedures calls. So, I can connect to this system via database only.
Which ways of Java/.Net (ESB) application integration with old/legacy database system do you know?
Write/Publish operation
Any vendor's databse server:
Scan tables for new entries - too low speed.
Trigger (if they're supported) which handles SQL updates and inserts and writes event information to some table. And application listener should be checking this table for events.
Oracle serevr : PL/SQL TRIGGERS + Oracle AQ. And listener for JMS.
Reading operation
Just write result into tables of ABS - dangerous.
...
How to notify legacy database system about responses in synchronious mode??? How to implement Write/Read in synchronious mode???
Again, which ways of Java/.Net (ESB) application integration with old/legacy database system do you know?
Lot of vendors hype about DataServices. I think the most value of these products is when integrating different datasources.
I would consider making a simple "application" that exposes this data as a service
It depends on many factors; particularly read/write throughput and performance sensitivity of the database.
Databases tend to be kinda sensitive things and are often very fragile to general purpose access from arbitrary other systems when they are finely tuned for production use in a specific system; so often folks replicate the database to another read-only slave database that can be then used for doing integration work & querying and so forth.
You can then use triggers/polling/JMS based on whatever you need without impacting the original database.
Depending on the database replication technology used; you can then often install triggers in the replica database (which can afford to get a little behind from the master from time to time) - to minimise impact in the production database
I can propose you to use Mule as ESB in your bank (see also http://www.mulesource.org/display/MULE/Home).
It allows you to communicate to database directly (jdbc level which has to be OK with stored procedures as well as tables/views level). I have positive experience with it for integration core banking system (database level, Oracle) with standalone application (web services level).
Frankly, I din't got all your questions (your can ask me in Russian directly if you are prefere),but IMO Mule is your way - it can consume JMS, JDBC, file level and many others and process syncronouse and asyncornouse events as well (see also http://www.mulesource.org/display/MULE2USER/Available+Transports).
Reagrds.
P.S. To be more clear for English speaking audience, I can propose you use more standard term core banking system instead of ABS (which means the same in xUSSR countries).
I have a need to do auditing all database activity regardless of whether it came from application or someone issuing some sql via other means. So the auditing must be done at the database level. The database in question is Oracle. I looked at doing it via Triggers and also via something called Fine Grained Auditing that Oracle provides. In both cases, we turned on auditing on specific tables and specific columns. However, we found that Performance really sucks when we use either of these methods.
Since auditing is an absolute must due to regulations placed around data privacy, I am wondering what is best way to do this without significant performance degradations. If someone has Oracle specific experience with this, it will be helpful but if not just general practices around database activity auditing will be okay as well.
I'm not sure if it's a mature enough approach for a production
system, but I had quite a lot of success with monitoring database
traffic using a network traffic sniffer.
Send the raw data between the application and database off to another
machine and decode and analyse it there.
I used PostgreSQL, and decoding the traffic and turning it into
a stream of database operations that could be logged was relatively
straightforward. I imagine it'd work on any database where the packet
format is documented though.
The main point was that it put no extra load on the database itself.
Also, it was passive monitoring, it recorded all activity, but
couldn't block any operations, so might not be quite what you're looking for.
There is no need to "roll your own". Just turn on auditing:
Set the database parameter AUDIT_TRAIL = DB.
Start the instance.
Login with SQLPlus.
Enter the statement audit all;This turns on auditing for many critical DDL operations, but DML and some other DDL statements are still not audited.
To enable auditing on these other activities, try statements like these:audit alter table; -- DDL audit
audit select table, update table, insert table, delete table; -- DML audit
Note: All "as sysdba" activity is ALWAYS audited to the O/S. In Windows, this means the Windows event log. In UNIX, this is usually $ORACLE_HOME/rdbms/audit.
Check out the Oracle 10g R2 Audit Chapter of the Database SQL Reference.
The database audit trail can be viewed in the SYS.DBA_AUDIT_TRAIL view.
It should be pointed out that the internal Oracle auditing will be high-performance by definition. It is designed to be exactly that, and it is very hard to imagine anything else rivaling it for performance. Also, there is a high degree of "fine-grained" control of Oracle auditing. You can get it just as precise as you want it. Finally, the SYS.AUD$ table along with its indexes can be moved to a separate tablespace to prevent filling up the SYSTEM tablespace.
Kind regards,
Opus
If you want to record copies of changed records on a target system you can do this with Golden Gate Software and not incur much in the way of source side resource drain. Also you don't have to make any changes to the source database to implement this solution.
Golden Gate scrapes the redo logs for transactions referring to a list of tables you are interested in. These changes are written to a 'Trail File' and can be applied to a different schema on the same database, or shipped to a target system and applied there (ideal for reducing load on your source system).
Once you get the trail file to the target system there are some configuration tweaks you can set an option to perform auditing and if needed you can invoke 2 Golden Gate functions to get info about the transaction:
1) Set the INSERTALLRECORDS Replication parameter to insert a new record in the target table for every change operation made to the source table. Beware this can eat up a lot of space, but if you need comprehensive auditing this is probably expected.
2) If you don't already have a CHANGED_BY_USERID and CHANGED_DATE attached to your records, you can use the Golden Gate functions on the target side to get this info for the current transaction. Check out the following functions in the GG Reference Guide:
GGHEADER("USERID")
GGHEADER("TIMESTAMP")
So no its not free (requires Licensing through Oracle), and will require some effort to spin up, but probably a lot less effort/cost than implementing and maintaining a custom solution rolling your own, and you have the added benefit of shipping the data to a remote system so you can guarantee minimal impact on your source database.
if you are using oracle then there is feature called CDC(Capture data change) which is more performance efficient solution for audit kind of requirements.