SQL to text archiving using .net and parallel processing

SQL to text archiving using .net and parallel processing - sql-server

I have a table that has millions of rows. It has logging data. I want to move the data to text files. Each day's worth of data should go into its own text file. I'm in .net environment. What is the efficient way to achieve it ?
I want to use parallel processing because we have beefy servers with many cores. Some choices I can think of are :
Have parallel data readers. Each reader queries a portion of the data. How do I manage the total connections with this approach ? Also if I went this route, I will have to not disrupt the normal usage for the users. The other problem I can see with this approach is managing my own threads and setting an upper limit, whereas Parallel.ForEach would be much simpler.
Producer-consumer pattern: One thread reads the data and queues it in memory. Multiple writers consume the data from memory and write it out to text files.
I'm open to PetaPoco/NPoco. Ideally I want to use Parallel.ForEach without complicating the threading code too much.

Parallel processing helps when there is a lot of computing involved. However, here, you have mainly I/O involved. Harddisks can only write to one file at a time. So multithreading will not bring the hoped-speed growth. It could, in contrary, reduce speed, since the harddisk could be forced to move back and fourth when writing to the different files.

Related

MPI how to send and receive SQLite database

I have a big SQLite database to process, so I would like to use MPI for parallelization to accelerate the speed. What I want to do is sending a database from root to every slave, and sending the modified databases to root after slave add some table into it. I want to use MPI_Type_create_struct to create a datatype to store database, but the database is too complicated. IS there any other way to handle this situation? Thank you in advance!

I recently dealt with a similar problem - I have a large MPI application that uses SQLite as a configuration store. Handling multi-process writes is a challenge with an embedded SQL database. My experience with this involves a massively parallel application (running up to 65,535 ranks) with a shared filesystem.
Based on the FAQ from SQLite and some experience with database engines, there are a few ways to approach this problem. I am making the assumption that you are operating with a shared distributed file system, and multiple separate computers (a standard HPC cluster setup).
Since SQLite will block when multiple processes write to the database (but not read), reads will most likely not be an issue. Each process can run multiple SELECT commands at the same time without issue.
The challenge will be in the writing. Disk I/O is several orders of magnitude slower than computation, so generally this will be the bottleneck. Having said that, network communication may also be a significant slowdown, so how you approach the problem really depends on where the weakest link of your running environment will be.
If you have a fast network and slow disk speed, or if you want to implement this in the most straightforward way possible, your best bet is to have a single MPI rank in charge of writing to the database. Your compute processes would independently run SELECT commands until computation was complete, then send the new data to the MPI database process. The database control process would then write the new data to disk. I would not try to send the structure of the database across the network, rather I would send the data that should be written, along with (possibly) a flag that would identify what table/insert query the data should be written with. This technique is sort of similar to how a RDBMS works - while RDBMS servers do support concurrent writes, there is a "central" process in control of the ordering of write operations.
One thing to note is that if a process writes to the SQLite database, the file is locked for all processes that are trying to read or write to it. You will need to either handle the SQLITE_BUSY return code in your worker processes, register a callback to handle this, change the busy behavior, or use an alternate technique. In my application, I found that loading the database as an in-memory database, (https://www.sqlite.org/inmemorydb.html) for the readers provided a good workaround. Readers access the in-memory database, but sent results to the controlling process for writes. The downside is that you will have multiple copies of the database in memory.
Another option that might be less network intensive is to do the reads concurrently and have each worker process write out to their own file. You could write out to separate SQLite database files, or even export something like CSV (depending on the complexity of the data). When writes are complete, you would then have a single process merge the individual files into a single result database file - see How can I merge many SQLite databases?. This method has its own issues, but depending on where your bottlenecks are and how the system as a whole is laid out, this technique may work.
Finally, you might consider reading from the SQLite database and saving the data to a proper distributed file format, such as HDF5 (or using MPI IO). Once the computation is done, it would be pretty straightforward to write a script that would create a new SQLite database from this foreign file format.

How expensive is access to database? How often do we access to it?

I'm about to write an application for Android, and it will use Mysql.
I know that access to DB is really expensive in terms of time, and would like to know how often do applications like instant messaging, online gaming access to databases?
For example in a game, we would like to save the positions of a player in the world, when he's moving all the time.
Is the database access actually not expensive, and there is a way to be connected to it all the time and just do request that are actually not expensive?
Or is IT really expensive in anyway, and there are techniques to access to it for example every X interval of time, and saving it locally in the meantime?
I Know that my question is really general, and it depends always on what we need and want.
My question came out because i made a really simple login application that connects and does 1 request to database, and it takes 1 second (a lot!!) to get the result, so how online applications can be so fast?
Thank you

Before answering this I would recommend simulating the process as much as possible, benchmarking and you can work towards the best solution for your use case.
e.g. If I have an application submitting data to a database simulate the submission so I can easily run multiple submissions at the same time and see what the bottle neck is...and see how it compares when I using caching, replication, indexes, etc.
Also reading company blogs can be helpful as they often share success stories that support the usage of a particular approach
How expensive is access to database?
Accessing a database can be a pretty quick operation
SELECT 1; // 0.005 Secs :D
However there are situations that can lead to poor performance (slow reads, writes and updates) but there are some relatively simple ways to combat this
Indexes
The best way to improve the performance of SELECT operations is to
create indexes on one or more of the columns that are tested in the
query. The index entries act like pointers to the table rows, allowing
the query to quickly determine which rows match a condition in the
WHERE clause, and retrieve the other column values for those rows.
Replication
spreading the load among multiple slaves to improve performance. In
this environment, all writes and updates must take place on the master
server. Reads, however, may take place on one or more slaves. This
model can improve the performance of writes (since the master is
dedicated to updates), while dramatically increasing read speed across
an increasing number of slaves.
How often do we access to it?
If you are solely using a database you will access it every time you n position and every time you need to find out their position.
This is where you would explore options to prevent accessing the database.
Memory caches such as redis or memcache
Replication - Only read from slaves

It depends on your design and requirement.
1) Most of the applications manage Connection Pools to minimize the initialization time.
2) Most of the ORM frameworks have external Cache to improve the reading performance. So if you do heavy data reading in your application then don't worry about storing it in locally. The Cache will be effective in this case.
3) When you store locally either in File (or) some format, then it will also add extra performance delay.
4) If you keep the data in primary memory, then obviously Game performance would be better. That's why Gamers prefer high end graphics card, and huge RAM.

For most databases there is the option of batch insertions. Obviously even a small overhead will accumulate if you have to many connections over time. And performing single insertions will have a greater overhead than on batch. The only issue is how often?.... And you should test how often you wan't to insert and how much information you should store locally before doing a batch insertion.

What is the best way to do simultaneous file access on AS400?

I am trying to write a program which will be run in batch on AS400. This program is going to write a record into a file to reflect its processing status, say, when it is just submitted it adds a record saying it is currently running, and when it is done it updates the same record saying it has finished. If I want to submit this program into batch for multiple times, what is the best way to cope with this type of simultaneous file access to increase the efficiency? I don't want a job to lock the whole file and stops others from updating it in the same time. It can lock the record it needs and leave the rest to others. How to achieve this? RPGLE or QMQRY? Or any other methods?

RPG will not lock the entire file, only the record.

Personally, I'd recommend SQL for (pretty much) all file access, even through RPG. IBM hasn't been updating their Native I/O for a while, just concentrating on the SQL side of things.
Because, during normal use, record locks in RPG are released once the write or update has been performed, you should probably just have your SQL run WITH NC (no commit). You need a way to tie a processing job with the data it's processing anyways (assuming stuff is long-running enough that things are in files outside of QTEMP) - you want to be able to pick up where you left off, if your job dies (so, you can't rely on holding the lock as a control mechanism). So don't forget that you're going to need some sort of monitor job (that can at least report the status, if not resubmit things - look at the QUSRJOBI API).
If you're doing this because you're using all Native I/O, and processing huge sets of data (not huge, processor intensive calculations), consider re-writing everything to SQL. Seriously. You can get way better performance - we've taken a process that used to run for 25+ hours, to something that runs in 2.5ish.

What's more costly on every page view - Database Writes or File Writes?

What is the most efficient solution when you need to record some data on every page view in your application - should you write to a file or write to the database?
Or maybe neither - perhaps you should cache the data in memory or a file and only write it to the database (or file system if you use a memory cache) occasionally?

If it's purely recording a small amount of data with no subsequent lookups, straight file I/O is almost guaranteed to be more efficient. You're losing all the advantages of a DBMS though -- indexing, transactional integrity (really, ACID in general), concurrent access, etc..
It almost sounds like you're talking about what amounts to simple logging. If that's the case, and you don't need to do frequent complex queries on the resulting data, you're probably better off with straight file I/O if performance is a serious issue. Be careful of concurrent-write issues, though.
If the properties of an RDBMS are desirable, you might think about using SQLite, which for simplistic loads will get you better performance than most RDBMSs with less overhead, at the cost of some of the benefits (highly concurrent access and availability over the network to other machines are a couple of the "biggies"). It still wouldn't be as fast as straight file I/O in the general case, though.
Your later mention of it being for page view tracking causes me to ask: Are you incrementing a counter, rather than logging data about the page view? If so, I'd strongly suggest going with something like SQLite (doing something like UPDATE tbl SET counter = counter+1). You really don't want to get into the timing issues involved in doing this by hand -- if you don't do it right, you'll start losing counts on simultaneous access (A reads "100", B reads "100", A writes "101", B writes "101"; B should have written 102, but has no way of knowing that).

Conceptually, writing to the database is always slower than writing to a file.
The database has to write to a file too, with the extra overhead of communication to get the data to the database, so it can write it to a file. Therefore, it must be slower.
That said, databases do disk I/O very well, probably better than you will. Don't be surprised if you find out that a simple file logger is slower than writing it to a database. The database has a lot of I/O optimizations, and has some tricks available that you may not (depending on your web lanaguage and environment).
Don't be surprised if the answer changes over time. When your site is small, logging to a database is very fast. As your site grows, the logging table can become a major pain: It uses a lot of disk space, makes the backups take forever, and consumes all the disk I/O when you try to query it. This is why you should benchmark both methods yourself. Then you can re-test in the future, when conditions change.

Hitting the database is most likely going to be more expensive than writing to a file.
If your pageviews per second are high, and if the data doesn't need to be available in the database right away, then writing to a file and periodically loading the data into the DB will be a more optimal solution.
However it all depends on the nature of the data you're recording per page view and how critical it is to whatever business function it serves.

That highly depends on your needs for data safety. If you can afford to lose some data in case of a crash then keeping the data in memory and writing it periodically to a persistent store is certainly the most efficient way to go.
Edit: You mentioned pageviews. In that case I would keep the counters in memory and periodically update a database table (like every minute or so).

That depends.
Ands it really does: it depends on the DBMS and/or the OS+filesystem you use. In other words: your mileage varies.
If you just append data somewhere modern DBMS/OS+filesystems should handle this equally well and fast. Problems arise when you want to change data.
Caching - depends too on what kind of caching granularity you can afford (need to have every stepped logged crash-safe versus potential saving).

Use a hybrid solution like redis its designed for this sort of stuff

performance of web app with high number of inserts

What is the best IO strategy for a high traffic web app that logs user behaviour on a website and where ALL of the traffic will result in an IO write? Would it be to write to a file and overnight do batch inserts to the database? Or to simply do an INSERT (or INSERT DELAYED) per request? I understand that to consider this problem properly much more detail about the architecture would be needed, but a nudge in the right direction would be much appreciated.

By writing to the DB, you allow the RDBMS to decide when disk IO should happen - if you have enough RAM, for instance, it may be effectively caching all those inserts in memory, writing them to disk when there's a lighter load, or on some other scheduling mechanism.
Writing directly to the filesystem is going to be bandwidth-limited more-so than writing to a DB which then writes, expressly because the DB can - theoretically - write in more efficient sizes, contiguously, and at "convenient" times.

I've done this on a recent app. Inserts are generally pretty cheap (esp if you put them into an unindexed hopper table). I think that you have a couple of options.
As above, write data to a hopper table, if what ever application framework supports batched inserts, then use these, it will speed it up. Then every x requests, do a merge (via an SP call) into a master table, where you can normalize off data that has low entropy. For example if you are storing if the HTTP type of the request (get/post/etc), this can only ever be a couple of types, and better to store as an Int, and get improved I/O + query performance. Your master tables can also be indexed as you would normally do.
If this isn't good enough, then you can stream the requests to files on the local file system, and then have an out of band (i.e seperate process from the webserver) suck these files up and BCP them into the database. This will be at the expense of more moving parts, and potentially, a greater delay between receiving requests and them finding their way into the database
Hope this helps, Ace

When working with an RDBMS the most important thing is optimizing write operations to disk. Something somewhere has got to flush() to persistant storage (disk drives) to complete each transaction which is VERY expensive and time consuming. Minimizing the number of transactions and maximizing the number of sequential pages written is key to performance.
If you are doing inserts sending them in bulk within a single transaction will lead to more effecient write behavior on disk reducing the number of flush operations.
My recommendation is to queue the messages and periodically .. say every 15 seconds or so start a transaction ... send all queued inserts ... commit the transaction.
If your database supports sending multiple log entries in a single request/command doing so can have a noticable effect on performance when there is some network latency between the application and RDBMS by reducing the number of round trips.
Some systems support bulk operations (BCP) providing a very effecient method for bulk loading data which can be faster than the use of "insert" queries.
Sparing use of indexes and selection of sequential primary keys help.
Making sure multiple instances either coordinate write operations or write to separate tables can improve throughput in some instances by reducing concurrency management overhead in the database.

Write to a file and then load later. It's safer to be coupled to a filesystem than to a database. And the database is more likely to fail than the your filesystem.

The only problem with using the filesystem to back writes is how you extend the log.
A poorly implemented logger will have to open the entire file to append a line to the end of it. I witnessed one such example case where the person logged to a file in reverse order, being the most recent entries came out first, which required loading the entire file into memory, writing 1 line out to the new file, and then writing the original file contents after it.
This log eventually exceeded phps memory limit, and as such, bottlenecked the entire project.
If you do it properly however, the filesystem reads/writes will go directly into the system cache, and will only be flushed to disk every 10 or more seconds, ( depending on FS/OS settings ) which has a negligible performance hit compared to writing to arbitrary memory addresses.
Oh yes, and whatever system you use, you'll need to think about concurrent log appending. If you use a database, a high insert load can cause you to have deadlock conditions, and on files, you need to make sure that you're not going to have 2 concurrent writes cancel each other out.

The insertions will generally impact the (read/update) performance of the table. Perhaps you can do the writes to another table (or database) and have batch job that processes this data. The advantages of the database approach is that you can query/report on the data and all the data is logically in a relational database and may be easier to work with. Depending on how the data is logged to text file, you could open up more possibilities for corruption.

My instinct would be to only use the database, avoiding direct filesystem IO at all costs. If you need to produce some filesystem artifact, then I'd use a nightly cron job (or something like it) to read DB records and write to the filesystem.
ALSO: Only use "INSERT DELAYED" in cases where you don't mind losing a few records in the event of a server crash or restart, because some records almost certainly WILL be lost.

There's an easier way to answer this. Profile the performance of the two solutions.
Create one page that performs the DB insert, another that writes to a file, and another that does neither. Otherwise, the pages should be identical. Hit each page with a load tester (JMeter for example) and see what the performance impact is.
If you don't like the performance numbers, you can easily tweak each page to try and optimize performance a bit or try new solutions... everything from using MSMQ backed by MSSQL to delayed inserts to shared logs to individual files with a DB background worker.
That will give you a solid basis to make this decision rather than depending on speculation from others. It may turn out that none of the proposed solutions are viable or that all of them are viable...

Hello from left field, but no one asked (and you didn't specify) how important is it that you never, ever lose data?
If speed is the problem, leave it all in memory, and dump to the database in batches.

Do you log more than what would be available in the webserver logs? It can be quite a lot, see Apache 2.0 log information for example.
If not, then you can use the good old technique of buffering then batch writing. You can buffer at different places: in memory on your server, then batch insert them in db or batch write them in a file every X requests, and/or every X seconds.
If you use MySQL there are several different options/techniques to load efficiently a lot of data: LOAD DATA INFILE, INSERT DELAYED and so on.
Lots of details on insertion speeds.
Some other tips include:
splitting data into different tables per period of time (ie: per day or per week)
using multiple db connections
using multiple db servers
have good hardware (SSD/multicore)
Depending on the scale and resources available, it is possible to go different ways. So if you give more details, i can give more specific advices.

If you do not need to wait for a response such as a generated ID, you may want to adopt an asynchronous strategy using either a message queue or a thread manager.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight