So I have a Drupal 7 database with 2 million users that need to move to Drupal 8 with a minimum of downtime (target is an hour). The Drupal migrate module appears to solve this problem, but it writes new rows one item at a time and in my tests, 4 thousand users + related data took 20 minutes on frankly beastly AWS instances. Extrapolating to the full dataset, it would take me 7 days to run the migration, and that amount of downtime is not reasonable.
I've made a feature request against Drupal core but I also wanted to see if the community has any ideas that I missed. Also, I want to spawn some discussion about this issue.
If anyone still cares about this, I have resolved this issue. Further research showed that not only does the Drupal migration module write new rows one at a time, but it also reads rows from the source one at a time. Further, for each row Drupal will write to a mapping table for the source table so that it can support rollback and update.
Since a user's data is stored in one separate table per custom field, this results in something like 8 reads and 16 writes for each user.
I ended up extending Drupal's Migration Executable for running the process. Then I overrode both the part that reads data and the part that writes it to do their work in batches, and to not write to the mapping tables. I believe that my projected time is now down to less then an hour (A speed up of 168 times!).
Still, trying to use the Drupal infrastructure was more trouble then it was worth. If you are doing this yourself just write a command line application and do the SQL queries manually.
Related
EDIT This is not a duplicate, as I do not have any Memo fields. I am also not grouping anything. The corruption is always found within the Prime table.
Lately, I am often getting a single line of data in my access 2010 db coming up with a load of chinese characters. It has happened before, but lately it is becoming a regular occurrence and I would very much like it to stop.
Here is what I have going on and what limitations I have.
Access split database. Multiple users. Users only have an *.accdr front end to work in, stored locally on their desktops (only about 6 users in total). They all use the access 2010 runtime, very very few have full MS access on their machines.
The Backend is stored on a large shared sitewide drive (or series of drives) that on all users machines is simply "G:" This drive, it should be noted, occasionally has issues like being too full. I have no means to put the backend on a dedicated machine, and other software is out of the question. IT support is offsite, and frankly they are about as clued in as AOL tech support was in the 90's.
Normal daily procedure is to load the output from another program into a merge table. This merge table is kept so we can spot changes and duplication. The merge table is then appended into the Prime table. Primary key in the Prime table prevents over writing existing information. The primary key is on 5 different columns in the prime table. Each column may have legitimate repeating values, but the combination of those values is unique. I have no pre-defined relationships. all relationships are shown on the query level. A backup of the data in the Prime table is done by creating an excel file once per day. I run Compact and Repair on the database every couple of weeks.
Every once in a long while, some hiccup in the universe, or data collision, or strange hard drive problem would cause a line in the Prime table to turn into chinese characters. When that happens, I check the backup excel file to make sure the corruption is not there. I then have everyone get out of the database. I run a Compact and repair, remove the offending line, C&R again, and get on with my day. This used to happen maybe once every 2 months.
Now I am getting this corruption on what seems to be an accelerating cycle. Once a week, 3 times a week, now it seems to be daily.
The changes in the front end made recently have all been form level stuff. Not anything in the queries themselves.
My boss won't accept the "Unusual sunspot and solar flare activity" excuse anymore.
What should I do to prevent this (within my limitations)?
Thanks in advance folks.
EDIT 2
The last few days we have been trying to systematically test various things to reproduce and isolate the corruption. I have an additional person who normally runs the daily update per my instructions. We reviewed the process and no problems or deviation. I have access to 4 different machines I can run the updates on so day one we used my daily use computer (Access 2013). Step by step checking for corruption. No corruption. Day 2 was on a machine that only has Access 2010, with same step by step process checks. No Corruption. Day 3 will be on my co-workers machine with the same step by step checks. I'll update as I go. I wonder if the problem could be machine specific.
After some careful testing, incorporating advice in all of the comments, we f determined that the problem most likely rests with the fact that the drive the DB resides in was getting full. The problem started when the drive was approaching about 90% of capacity. Since then, the Drive has been somewhat cleaned of old files, the drive is now at about 60% of capacity, and the corruption problems have gone away. We'll keep monitoring.
Thanks again for all the advice, and I hope this helps others in the future!
The setup:
A CouchDB 2.0 running in Docker on a Raspberry PI 3
A node-application that uses pouchdb, also in Docker on the same PI 3
The scenario:
At any given moment, the CouchDB has at max 4 Databases with a total of about 60 documents
the node application purges (using pouchdbs destroy) and recreates these databases periodically (some of them every two seconds, others every 15 minutes)
The databases are always recreated with the newest entries
The reason for purging the databases, instead of deleting their documents is, that i'd otherwise have a huge amount of deleted documents, and my web-client can't handle syncing all these deleted documents
The problem:
The file var/lib/couchdb/_dbs.couch always keeps growing, it never shrinks. Last time i left it alone for three weeks, and it grew to 37 GB. Fauxten showed, that the CouchDB only contains these up to 60 Documents, but this file still keeps growing, until it fills all the space available
What i tried:
running everything on an x86 machine (osx)
running couchdb without docker (because of this info)
using couchdb 2.1
running compaction manually (which didn't do anything)
googling for about 3 days now
Whatever i do, i always get the same result: the _dbs.couch keeps growing. I also wasn't really able to find out, what that files purpose is. googling that specific filename only yields two pages of search-results, none of which are specific.
The only thing i can currently do, is manually delete this file from time to time, and restart the docker-container, which does delete all my databases, but that is not a problem as the node-application recreates them soon after.
The _dbs database is a meta-database. It records the locations of all the shards of your clustered databases, but since it's a couchdb database too (though not a sharded one) it also needs compacting from time to time.
try;
curl localhost:5986/_dbs/_compact -XPOST -Hcontent-type:application/json
You can enable the compaction daemon to do this for you, and we enable it by default in the recent 2.1.0 release.
add this to the end of your local.ini file and restart couchdb;
[compactions]
_default = [{db_fragmentation, "70%"}, {view_fragmentation, "60%"}]
I have a site in drupal, at times its really slow have to reboot the server.
I see mysql is consuming too much resources.
I have a core table for drupal devel_times: it contains over 846,000,000 rows and the table itself is about 30 Gb, is that not causing the problem, because I see each time it logs entries, can I empty that table?
Even if this issue is related to drupal 5, I think it's still valid. So I would definitely say yes, you can delete those records :)
I am not a great VB programmer, but I am tasked with maintaining/enhancing a VB6 desktop application that uses Sybase ASE as a back-end. This app has about 500 users.
Recently, I added functionality to this application which performs an additional insert/update to a single row in the database, key field being transaction number and the field is indexed. The table being updated generally has about 6000 records in it, as records are removed when transactions are completed. After deployment, the app worked fine for a day and a half before users were reporting slow performance.
Eventually, we traced the performance issue to a table lock in the database and had to roll back to the previous version of the app. The first day of use was on Monday, which is generally a very heavy day for system use, so I'm confused why the issue didn't appear on that day.
In the code that was in place, there is a call to start a Sybase transaction. Within the block between the BeginTrans and CommitTrans, there is a call to a DLL file that updates the database. I placed my new code in a class module in the DLL.
I'm confused as to why a single insert/update to a single row would cause such a problem, especially since the system had been working okay before the change. Is it possible I've exposed a larger problem here? Or that I just need to reconsider my approach?
Thanks ahead for anyone who has been in a similar situation and can offer advice.
It turns out that the culprit was a message box that appears within the scope of the BeginTrans and CommitTrans calls. The user with the message box would maintain a blocking lock on the database until they acknowledged the message. The solution was to move the message box outside of the aforementioned scope.
I am not able to understand the complete picture without the SQL code, that you are using.
Also, if it is a single insert OR update, why are you using a transaction? Is it possible that many users will try to update the same row?
It would be helpful if you posted both the VB code and your SQL (with the query plan if possible). However with the information we have; I would run update statistics table_name against the table to make sure that the query plan is up to date.
If you're sure that your code has to run within a transaction have you tried adding your own transaction block containing your SQL rather than using the one already there?
Where I'm at there is a main system that runs on a big AIX mainframe. To facility reporting and operations there is nightly dump from the mainframe into SQL Server, such that each of our 50-ish clients is in their own database with identical schemas. This dump takes about 7 hours to finish each night, and there's not really anything we can do about it: we're stuck with what the application vendor has provided.
After the dump into sql server we use that to run a number of other daily procedures. One of those procedures is to import data into a kind of management reporting sandbox table, which combines records from a particularly important table from across the different databases into one table that managers who don't know sql so can use to run ad-hoc reports without hosing up the rest of the system. This, again, is a business thing: the managers want it, and they have the power to see that we implement it.
The import process for this table takes a couple hours on it's own. It filters down about 40 million records spread across 50 databases into about 4 million records, and then indexes them on certain columns for searching. Even at a coupld hours it's still less than a third as long as the initial load, but we're running out of time for overnight processing, we don't control the mainframe dump, and we do control this. So I've been tasked with looking for ways to improve one the existing procedure.
Currently, the philosophy is that it's faster to load all the data from each client database and then index it afterwards in one step. Also, in the interest of avoiding bogging down other important systems in case it runs long, a couple of the larger clients are set to always run first (the main index on the table is by a clientid field). One other thing we're starting to do is load data from a few clients at a time in parallel, rather than each client sequentially.
So my question is, what would be the most efficient way to load this table? Are we right in thinking that indexing later is better? Or should we create the indexes before importing data? Should we be loading the table in index order, to avoid massive re-ordering of pages, rather than the big clients first? Could loading in parallel make things worse by causing to much disk access all at once or removing our ability to control the order? Any other ideas?
Update
Well, something is up. I was able to do some benchmarking during the day, and there is no difference at all in the load time whether the indexes are created at the beginning or at the end of the operation, but we save the time building the index itself (it of course builds nearly instantly with no data in the table).
I have worked with loading bulk sets of data in SQL Server quite a bit and did some performance testing on the Index on while inserting and the add it afterwards. I found that BY FAR it was much more efficient to create the index after all data was loaded. In our case it took 1 hour to load with the index added at the end, and 4 hours to add it with the index still on.
I think the key is to get the data moved as quick as possible, I am not sure if loading it in order really helps, do you have any stats on load time vs. index time? If you do, you could start to experiment a bit on that side of things.
Loading with the indexes dropped is better as a live index will generate several I/O's for every row in the database. 4 million rows is small enough that you would not expect to get a significant benefit from table partitioning.
You could get a performance win by using bcp to load the data into the staging area and running several tasks in parallel (SSIS will do this). Write a generic batch file wrapper for bcp that takes the file path (and table name if necessary) and invoke a series of jobs in half a dozen threads with 'Execute Process' tasks in SSIS. For 50 jobs it's probably not worth trying to write a data-driven load controller process. Wrap these tasks up in a sequence container so you don't have to maintain all of the dependencies explicitly.
You should definitely drop and re-create the indexes as this will greatly reduce the amount of I/O during the process.
If the 50 sources are being treated identically, try loading them into a common table or building a partitioned view over the staging tables.
Index at the end, yes. Also consider setting the log level setting to BULK LOGGED to minimize writes to the transaction log. Just remember to set it back to FULL after you've finished.
To the best of my knowledge, you are correct - it's much better to add the records all at once and then index once at the end.