What is the best way to do simultaneous file access on AS400? - file

I am trying to write a program which will be run in batch on AS400. This program is going to write a record into a file to reflect its processing status, say, when it is just submitted it adds a record saying it is currently running, and when it is done it updates the same record saying it has finished. If I want to submit this program into batch for multiple times, what is the best way to cope with this type of simultaneous file access to increase the efficiency? I don't want a job to lock the whole file and stops others from updating it in the same time. It can lock the record it needs and leave the rest to others. How to achieve this? RPGLE or QMQRY? Or any other methods?

RPG will not lock the entire file, only the record.

Personally, I'd recommend SQL for (pretty much) all file access, even through RPG. IBM hasn't been updating their Native I/O for a while, just concentrating on the SQL side of things.
Because, during normal use, record locks in RPG are released once the write or update has been performed, you should probably just have your SQL run WITH NC (no commit). You need a way to tie a processing job with the data it's processing anyways (assuming stuff is long-running enough that things are in files outside of QTEMP) - you want to be able to pick up where you left off, if your job dies (so, you can't rely on holding the lock as a control mechanism). So don't forget that you're going to need some sort of monitor job (that can at least report the status, if not resubmit things - look at the QUSRJOBI API).
If you're doing this because you're using all Native I/O, and processing huge sets of data (not huge, processor intensive calculations), consider re-writing everything to SQL. Seriously. You can get way better performance - we've taken a process that used to run for 25+ hours, to something that runs in 2.5ish.

Related

Multiple Biztalk host instances writing to single file

We have four Biztalk servers on production envionment. The sendport is configured to write incoming message in one textfile. This port receives thousands of messages in a day. So multiple host instances tries to write to file at single time, before one instance finishes writing complete record another instances starts writing new record causing data scattered all over the file.
What can we do resolve this issue?
...before one instance finishes writing complete record another instances starts writing new record causing data scattered all over the file.
What can we do resolve this issue?
The easy way is to only use a single Host Instance to write data to the file, however you may then start to experience throttling issues. Alternatively, you could explore using the 'Allow Cache on write' option on the File Adapter which may offer some improvements.
However, I think your approach is wrong. You cannot expect four separate and totally disconnected processes (across 4 servers no-less) to reliably append to a single file - IN ORDER.
I therefore think you should look re-architecting this solution:
As each message is received, write the contents of the message to a database table (a simple INSERT) with an 'unprocessed' flag. You can reliably have four Host Instances banging data into SQL without fear of them tripping over each other.
At a scheduled time, have BizTalk extract all of the records that are marked as unprocessed in that SQL Table (the WCF-SQL Adapter can help you here). Once you have polled the records, mark them as 'in-process'.
You should now have a single message containing all of the currently unprocessed records (as retrieved from SQL). Using a single (or multiple) Host Instance/s, write the message to disk, appending each of the records to the file in a single write. The key here is that you are only writing a single message to the one file, not lots and lots and lots :-)
If the write is successful, update each of the records in the SQL table with a 'processed' flag so they are not picked-up again on the next poll.
You might want to consider a singleton orchestration for this piece to ensure that there is only ever one poll-write-update process taking place at the same time.
If FIFO is important, BizTalk has ordered delivery mechanism (FILE adapter supported) but it comes at performance cost.
The better solution would be let instances writing to individual files and then have another scheduled process (or orchestration) to combine them in one file. You can enforce FIFO using timestamps. This would provide better performance and resource utilization vs. mentioned earlier singleton orchestration. Other option may be using any suitable implementation of a queue.
You can move to a database system instead of a file. That would be very simply solution and also very efficient.
If you don't want to go that way, you must implement file locking or a semaphore inside of your application so the new threads will wait for other threads to finish writing.

How often should I have my server sync to the database?

I am developing a web-app right now, where clients will frequently (every few seconds), send read/write requests on certain data. As of right now, I have my server immediately write to the database when a user changes something, and immediately read from the database when they want to view something. This is working fine for me, but I am guessing that it would be quite slow if there were thousands of users online.
Would it be more efficient to save write requests in an object on the server side, then do a bulk update at a certain time interval? This would help in situations where the same data is edited multiple times, since it would now only require one db insert. It would also mean that I would read from the object for any data that hasn't yet been synced, which could mean increased efficiency by avoiding db reads. At the same time though, I feel like this would be a liability for two reasons: 1. A server crash would erase all data that hasn't yet been synced. 2. A bulk insert has the possibility of creating sudden spikes of lag due to mass database calls.
How should I approach this? Is my current approach ok, or should I queue inserts for a later time?
If a user makes a change to data and takes an action that (s)he expects will save the data, you should do everything you can to ensure the data is actually saved. Example: Let's say you delay the write for a while. The user is in a hurry, makes a change then closes the browser. If you don't save right when they take an action that they expect saves the data, there would be a data loss.
Web stacks generally scale horizontally. Don't start to optimize this kind of thing unless there's evidence that you really have to.

Implementing multithreaded application under C

I am implementing a small database like MySQL.. Its a part of a larger project..
Right now i have designed the core database, by which i mean i have implemented a parser and i can now execute some basic sql queries on my database.. it can store, update, delete and retrieve data from files.. As of now its fine.. however i want to implement this on network..
I want more than one user to be able to access my database server and execute queries on it at the same time... I am working under Linux so there is no issue of portability right now..
I know i need to use Sockets which is fine.. I also know that i need to use a concept like Thread Pool where i will be required to create a maximum number of threads initially and then for each client request wake up a thread and assign it to the client..
As for now what i am unable to figure out is how all this is actually going to be bundled together.. Where should i implement multithreading.. on client side / server side.? how is my parser going to be configured to take input from each of the clients separately?(mostly via files i think?)
If anyone has idea about how i can implement this pls do tell me bcos i am stuck here in this project...
Thanks.. :)
If you haven't already, take a look at Beej's Guide to Network Programming to get your hands dirty in some socket programming.
Next I would take his example of a stream client and server and just use that as a single threaded query system. Once you have got this down, you'll need to choose if you're going to actually use threads or use select(). My gut says your on disk database doesn't yet support parallel writes (maybe reads), so likely a single server thread servicing requests is your best bet for starters!
In the multiple client model, you could use a simple per-socket hashtable of client information and return any results immediately when you process their query. Once you get into threading with the networking and db queries, it can get pretty complicated. So work up from the single client, add polling for multiple clients, and then start reading up on and tackling threaded (probably with pthreads) client-server models.
Server side, as it is the only person who can understand the information. You need to design locks or come up with your own model to make sure that the modification/editing doesn't affect those getting served.
As an alternative to multithreading, you might consider event-based single threaded approach (e.g. using poll or epoll). An example of a very fast (non-SQL) database which uses exactly this approach is redis.
This design has two obvious disadvantages: you only ever use a single CPU core, and a lengthy query will block other clients for a noticeable time. However, if queries are reasonably fast, nobody will notice.
On the other hand, the single thread design has the advantage of automatically serializing requests. There are no ambiguities, no locking needs. No write can come in between a read (or another write), it just can't happen.
If you don't have something like a robust, working MVCC built into your database (or are at least working on it), knowing that you need not worry can be a huge advantage. Concurrent reads are not so much an issue, but concurrent reads and writes are.
Alternatively, you might consider doing the input/output and syntax checking in one thread, and running the actual queries in another (query passed via a queue). That, too, will remove the synchronisation woes, and it will at least offer some latency hiding and some multi-core.

what is faster database querys or file writing/reading

I know that in normal cases is faster to read/write from a file, but if I created a chat system:
Would it be faster to write and read from a file or to insert/select data in a db and cahe results?
Database is faster. AND importantly for you, deals with concurrent access.
Do you really want a mechanical disk action every time someone types? Writing to disk is a horrible idea. Cache messages in memory. Clear the message once it is sent to all users in the room. The cache will stay small, most of the time empty. This is your best option if you don't need a history log.
But if you need a log....
If you write a large amount of data in 1 pass, I guarantee the file will smoke database insert performance. A bulk insert feature of the database may match the file, but it requires a file data source to begin with. You would need to queue up a lot of messages in memory, then periodically flush to the file.
For many small writes the gap will close and the database will pull ahead. Indexes will influence the insert speed. If thousands of users are inserting to a heavily indexed table you may have problems.
Do your own tests to prove what is faster. Simulate a realistic load, not a 1 user test.
Databases by far.
Databases are optimized for data storage which is constantly updated and changed as in your case. File storage is for long-term storage with few changes.
(even if files were faster I would still go with databases because it's easier to develop and maintain)
Since I presume your system would write/read data continuously (as people type their messages), writing them to a file would take longer time because of the file handling procedure, i.e.
open file for writing
lock file
write & save
unlock file
I would go with db.

performance of web app with high number of inserts

What is the best IO strategy for a high traffic web app that logs user behaviour on a website and where ALL of the traffic will result in an IO write? Would it be to write to a file and overnight do batch inserts to the database? Or to simply do an INSERT (or INSERT DELAYED) per request? I understand that to consider this problem properly much more detail about the architecture would be needed, but a nudge in the right direction would be much appreciated.
By writing to the DB, you allow the RDBMS to decide when disk IO should happen - if you have enough RAM, for instance, it may be effectively caching all those inserts in memory, writing them to disk when there's a lighter load, or on some other scheduling mechanism.
Writing directly to the filesystem is going to be bandwidth-limited more-so than writing to a DB which then writes, expressly because the DB can - theoretically - write in more efficient sizes, contiguously, and at "convenient" times.
I've done this on a recent app. Inserts are generally pretty cheap (esp if you put them into an unindexed hopper table). I think that you have a couple of options.
As above, write data to a hopper table, if what ever application framework supports batched inserts, then use these, it will speed it up. Then every x requests, do a merge (via an SP call) into a master table, where you can normalize off data that has low entropy. For example if you are storing if the HTTP type of the request (get/post/etc), this can only ever be a couple of types, and better to store as an Int, and get improved I/O + query performance. Your master tables can also be indexed as you would normally do.
If this isn't good enough, then you can stream the requests to files on the local file system, and then have an out of band (i.e seperate process from the webserver) suck these files up and BCP them into the database. This will be at the expense of more moving parts, and potentially, a greater delay between receiving requests and them finding their way into the database
Hope this helps, Ace
When working with an RDBMS the most important thing is optimizing write operations to disk. Something somewhere has got to flush() to persistant storage (disk drives) to complete each transaction which is VERY expensive and time consuming. Minimizing the number of transactions and maximizing the number of sequential pages written is key to performance.
If you are doing inserts sending them in bulk within a single transaction will lead to more effecient write behavior on disk reducing the number of flush operations.
My recommendation is to queue the messages and periodically .. say every 15 seconds or so start a transaction ... send all queued inserts ... commit the transaction.
If your database supports sending multiple log entries in a single request/command doing so can have a noticable effect on performance when there is some network latency between the application and RDBMS by reducing the number of round trips.
Some systems support bulk operations (BCP) providing a very effecient method for bulk loading data which can be faster than the use of "insert" queries.
Sparing use of indexes and selection of sequential primary keys help.
Making sure multiple instances either coordinate write operations or write to separate tables can improve throughput in some instances by reducing concurrency management overhead in the database.
Write to a file and then load later. It's safer to be coupled to a filesystem than to a database. And the database is more likely to fail than the your filesystem.
The only problem with using the filesystem to back writes is how you extend the log.
A poorly implemented logger will have to open the entire file to append a line to the end of it. I witnessed one such example case where the person logged to a file in reverse order, being the most recent entries came out first, which required loading the entire file into memory, writing 1 line out to the new file, and then writing the original file contents after it.
This log eventually exceeded phps memory limit, and as such, bottlenecked the entire project.
If you do it properly however, the filesystem reads/writes will go directly into the system cache, and will only be flushed to disk every 10 or more seconds, ( depending on FS/OS settings ) which has a negligible performance hit compared to writing to arbitrary memory addresses.
Oh yes, and whatever system you use, you'll need to think about concurrent log appending. If you use a database, a high insert load can cause you to have deadlock conditions, and on files, you need to make sure that you're not going to have 2 concurrent writes cancel each other out.
The insertions will generally impact the (read/update) performance of the table. Perhaps you can do the writes to another table (or database) and have batch job that processes this data. The advantages of the database approach is that you can query/report on the data and all the data is logically in a relational database and may be easier to work with. Depending on how the data is logged to text file, you could open up more possibilities for corruption.
My instinct would be to only use the database, avoiding direct filesystem IO at all costs. If you need to produce some filesystem artifact, then I'd use a nightly cron job (or something like it) to read DB records and write to the filesystem.
ALSO: Only use "INSERT DELAYED" in cases where you don't mind losing a few records in the event of a server crash or restart, because some records almost certainly WILL be lost.
There's an easier way to answer this. Profile the performance of the two solutions.
Create one page that performs the DB insert, another that writes to a file, and another that does neither. Otherwise, the pages should be identical. Hit each page with a load tester (JMeter for example) and see what the performance impact is.
If you don't like the performance numbers, you can easily tweak each page to try and optimize performance a bit or try new solutions... everything from using MSMQ backed by MSSQL to delayed inserts to shared logs to individual files with a DB background worker.
That will give you a solid basis to make this decision rather than depending on speculation from others. It may turn out that none of the proposed solutions are viable or that all of them are viable...
Hello from left field, but no one asked (and you didn't specify) how important is it that you never, ever lose data?
If speed is the problem, leave it all in memory, and dump to the database in batches.
Do you log more than what would be available in the webserver logs? It can be quite a lot, see Apache 2.0 log information for example.
If not, then you can use the good old technique of buffering then batch writing. You can buffer at different places: in memory on your server, then batch insert them in db or batch write them in a file every X requests, and/or every X seconds.
If you use MySQL there are several different options/techniques to load efficiently a lot of data: LOAD DATA INFILE, INSERT DELAYED and so on.
Lots of details on insertion speeds.
Some other tips include:
splitting data into different tables per period of time (ie: per day or per week)
using multiple db connections
using multiple db servers
have good hardware (SSD/multicore)
Depending on the scale and resources available, it is possible to go different ways. So if you give more details, i can give more specific advices.
If you do not need to wait for a response such as a generated ID, you may want to adopt an asynchronous strategy using either a message queue or a thread manager.

Resources