This is a rudimentary web indexer.
I have 2 database tables:
domainList:
PK domainName
UI domainNumber
status ... start, indexing, completed.
and
domainPages
PK pageNumber
FK domainNumber
pageHTML
pageTitle
I have several "indexer" servers that load the websites HTML and store it into the database.
The database as it gets bigger is now slowing down considerably.
INSERT INTO domainPages (domainNumber,domainPageHTML,domainPageTitle VALUES ('" & domainNumber & "',N'" & domainPageHTML & "',N'" & domainPageTitle & "')")
This is taking a long time as there are many rows. Reading from the table is taking a long time too.
I could create a new table for each set of domainPages, but I'd rather try something new: I'm looking at database partitioning to help.
All the examples on the 'net about partitioning use a date field, whearas here I need to partition on the domainNumber in the domainPages table (which is a logical foreign key on the domainList - as I believe an actualy relationship will fail with partitioning).
So I think I'm looking at a partition per unique domain? If that is correct how would I do this? Are there any examples online that don't involve a date field, but a logical foreign key on a table.
I had no answers to this question, so I had to use separate tables for each domain. Which means it takes a while to browse all the tables!
I did spot this however http://blog.sqlauthority.com/2008/01/25/sql-server-2005-database-table-partitioning-tutorial-how-to-horizontal-partition-database-table/ which is what I would have needed. It is there for anyone else looking in future.
I've not tried it but I assume that for higher IDs you would run another function e.g. once IDs reach 500 like in this example:
--Determine where values live before new partition
SELECT $PARTITION.Left_Partition (501) --should return a value of 4
--Create new partition
ALTER PARTITION FUNCTION Left_Partition ()
SPLIT RANGE(500)
--Determine where values live after new partition
SELECT $PARTITION.Left_Partition (501) --should return a value of 5
According to this article see section: Consider the table created in Figure 2. You can add a new partition to this table to contain values greater than 500, like this.
http://technet.microsoft.com/en-gb/magazine/2007.03.partitioning.aspx
Related
Hello i am currently try different data automation processes with python and postgreSQL. I automated the cleaning and upload of a dataset with 40.000 data emtries into my Database. Due to some flaws in my process i had to truncate some tables or data entries.
i am using: python 3.9.7 / postgeSQL 13.3 / pgAdmin 4 v.5.7
Problem
Currently i have ID's of tables who start at the ID of 44700 instead of 1 (due do my editing).
For Example a table of train stations begins with the ID 41801 and ends with ID of 83599.
Question
How can reorganize my index so that the ID starts from 1 to 41801?
After looking online i found topics like "bloat" or "reindex". I tired Vacuum or Reindex but nothing really showed a difference in my tables? As far as now my tables have no relations to each other. What would be the approach to solve my problem in postgreSQL. Some hidden Function i overlooked? Maybe it's not a problem at all, but it definitely looks weird. At some point of time i end up with the ID of 250.000 while only having 40.000 data entries in my table.
Do you use a Sequence to generate ID column of your table? You can check it in pgAdmin under your database if you have a Sequence object in your database: Schemas -> public -> Sequences.
You can change the current sequence number with right-click on the Sequence and set it to '1'. But only do this if you deleted all rows in the table and before you start to import your data again.
As long as you do not any other table which references the ID column of your train station table, you can even update the ID with an update statement like:
UPDATE trainStations SET ID = ID - 41801 WHERE 1 = 1;
I'm new at DynamoDB technologies but not at NoSQL (I've already done some project using Firebase).
Read that a DynamoDB best practice is one table per application I've been having a hard time on how to design my 1 to N relationship.
I have this entity (pseudo-json):
{
machineId: 'HASH_ID'
machineConfig: /* a lot of fields */
}
A machineConfig is unique for each machine and can change rarely and only by an administration (no consistency issue here).
The issue is that I have to manage a log of data from the sensors of each machine. The log is described as:
{
machineId: 'HASH_ID',
sensorsData: [
/* Huge list of: */
{ timestamp: ..., data: /* lot of fields */ },
...
]
}
I want to keep my machineConfig in one place. Log list can't be insert into the machine entity because it's a continuous stream of data taken over time.
Furthermore, I don't understand which could be the composite key, the partition key obviously is the machineId, but what about the order key?
How to design this relationship taking into account the potential dimensions of data?
You could do this with 1 table. The primary key could be (machineId, sortKey) where machineId is the partition key and sortKey is a string attribute that is going to be used to cover the 2 cases. You could probably come up with a better name.
To store the machineConfig you would insert an item with primary key (machineId, "CONFIG"). The sortKey attribute would have the constant value CONFIG.
To store the sensorsData you could use the timestamp as the sortKey value. You would insert a new item for each piece of sensor data. You would store the timestamp as a string (as time since the epoch, ISO8601, etc)
Then to query everything about a machine you would run a Dynamo query specifying just the machineId partition key - this would return many items including the machineConfig and the sensor data.
To query just the machineConfig you would run a Dynamo query specifying the machineId partition key and the constant CONFIG as the sortKey value
To query the sensor data you could specify an exact timestamp or a timestamp range for the sortKey. If you need to query the sensor data by other values then this design might not work as well.
Editing to answer follow up question:
You would have to resort to a scan with a filter to return all machines with their machineId and machineConfig. If you end up inserting a lot of sensor data then this will be a very expensive operation to perform as Dynamo will look at every item in the table. If you need to do this you have a couple of options.
If there are not a lot of machines you could insert an item with a primary key like ("MACHINES", "ALL") and a list of all the machineIds. You would query on that key to get the list of machineIds, then you would do a bunch of queries (or a batch get) to retrieve all the related machineConfigs. However since the max Dynamo item size is 400KB you might not be able to fit them all.
If there are too many machines to fit in one item you could alter the above approach a bit and have ("MACHINES", $machineIdSubstring) as a primary key and store chunks of machineIds under each sort key. For example, all machineIds that start with 0 go in ("MACHINES", "0"). Then you would query by each primary key 0-9, build a list of all machineIds and query each machine as above.
Alternatively, you don't have to put everything in 1 table - it is just a guideline that fits a lot of use cases. If there are too many machines to fit in less than 400KB but there aren't tens of thousands and you aren't trying to query all of them all the time, you could have a separate table of machineId and machineConfig that you resort to scanning when necessary.
I have seen something like this asked a number of times but not quite in this configuration. I have a table that has a one to many relation.
Let’s say I have a computer table and a parts table. The user enters a generic info in the computer table then selects parts that are stored in the parts table with a relationship to the computer table of computerId. So the original write is a simple insert. Now let’s say the user select the computer again and changes the part on the pc, adds some new, removes some, and updates a few. Then the user hits save to save the changes. I run a simple update on the computer table but now the issue with the parts table.
Would it be better to delete all the records from the parts table for the computer Id and then do a clean insert of all the parts selected.
Or Run some method that would look at the existing parts in the table and where the part has been updated update the record, where the part no longer exists do a delete, and then insert the remaining parts?
Clearly the simple solution is to delete all and then insert all.
The down side of this SQL traffic, locks, and table fragmentation.
If it is small table and only few concurrent users then fine.
In a high volume environment I do the following
There is no update - that is just an ignore
- delete items gone
- ignore any items not changed
- insert new items
And you can do that in one pass two/three statements.
Or you could define a stored procedure.
Do the delete before the insert to clear space first.
You can get real fancy and use an update for delete / insert but that just gets more complex than it is worth in my mind. You would still have an insert or a delete if the item count is not the same.
delete comp_part
where compID = #compID and partID not in (....);
Insert is a little more tricky:
You can to it with a series of inserts and if you have a PK just let the insert fail
The other way is to create a #table and use it for both the delete and insert
This is only worth the hassle if you have a REALLY busy table.
It all depends upon the business model, if you would want to track the transaction than its not a good option to delete it. If you have all your old transactions with your customers than it would be beneficial for tracking purposes., Your CustomerID would be Primarykey and you can have another Unique key as PartOrderID which will be a unique value for each insert.
Hope this helps
Really you should have three tables. Product, Part, and ProductPart; the ProductPart table would store the association of "this product has these parts". As far as updating, the simplest thing would be to delete all ProductParts for a given Product and re-insert the records you want.
I am looking for pattern, framework or best practice to handle a generic problem of application level data synchronisation.
Let's take an example with only 1 table to make it easier.
I have an unreliable datasource of product catalog. Data can occasionally be unavailable or incomplete or inconsistent. ( issue might come from manual data entry error, ETL failure...)
I have a live copy in a Mysql table in use by a live system. Let's say a website.
I need to implement safety mecanism when updating the mysql table to "synchronize" with original data source. Here are the safety criteria and the solution I an suggesting:
avoid deleting records when they temporarily disappear from datasource => use "deleted" boulean/date column or an archive/history table.
check for inconsistent changes => configure rules per columns such as : should never change, should only increment,
check for integrity issue => (standard problem, no point discussing approach)
ability to rollback last sync=> restore from history table ? use a version inc/date column ?
What I am looking for is best practice and pattern/tool to handle such problem. If not you are not pointing to THE solution, I would be grateful of any keywords suggestion that would me narrow down which field of expertise to explore.
We have the same problem importing data from web analytics providers - they suffer the same problems as your catalog. This is what we did:
Every import/sync is assigned a unique id (auto_increment int64)
Every table has a history table that is identical to the original, but has an additional column "superseded_id" which gets the import-id of the import, that changed the row (deletion is a change) and the primary key is (row_id,superseded_id)
Every UPDATE copies the row to the history table before changing it
Every DELETE moves the row to the history table
This makes rollback very easy:
Find out the import_id of the bad import
REPLACE INTO main_table SELECT <everything but superseded_id> FROM history table WHERE superseded_id=<bad import id>
DELETE FROM history_table WHERE superseded_id>=<bad import id>
For databases, where performance is a problem, we do this in a secondary database on a different server, then copy the found-to-be-good main table to the production database into a new table main_table_$id with $id being the highest import id and have main_table be a trivial view to SELECT * FROM main_table_$someid. Now by redefining the view to SELECT * FROM main_table_$newid we can atomically swicth the table.
I'm not aware of a single solution to all this - probably because each project is so different. However, here are two techniques I've used in the past:
Embed the concept of version and validity into your data model
This is a way to deal with change over time without having to resort to history tables; it does complicate your queries, so you should use it sparingly.
For instance, instead of having a product table as follows
PRODUCTS
Product_ID primary key
Price
Description
AvailableFlag
In this model, if you want to delete a product, you execute "delete from product where product_id = ..."; modifying price would be "update products set price = 1 where product_id = ...."
With the versioned model, you have:
PRODUCTS
product_ID primary key
valid_from datetime
valid_until datetime
deleted_flag
Price
Description
AvailableFlag
In this model, deleting a product requires you to update products set valid_until = getdate() where product_id = xxx and valid_until is null, and then insert a new row with the "deleted_flag = true".
Changing price works the same way.
This means that you can run queries against your "dirty" data and insert it into this table without worrying about deleting items that were accidentally missed off the import. It also allows you to see the evolution of the record over time, and roll-back easily.
Use a ledger-like mechanism for cumulative values
Where you have things like "number of products in stock", it helps to create transactions to modify the amount, rather than take the current amount from your data feed.
For instance, instead of having a amount_in_stock column on your products table, have a "product_stock_transaction" table:
product_stock_transactions
product_id FK transaction_date transaction_quantity transaction_source
1 1 Jan 2012 100 product_feed
1 2 Jan 2012 -3 stock_adjust_feed
1 3 Jan 2012 10 product_feed
On 2 Jan, the quantity in stock was 97; on 3 Jan, 107.
This design allows you to keep track of adjustments and their source, and is easier to manage when moving data from multiple sources.
Both approaches can create large amounts of data - depending on the number of imports and the amount of data - and can lead to complex queries to retrieve relatively simple data sets.
It's hard to plan for performance concerns up front - I've seen both "history" and "ledger" work with large amounts of data. However, as Eugen says in his comment below, if you get to an excessively large ledger, it may be necessary to to clean up the ledger table by summarizing the current levels, and deleting (or archiving) old records.
For example I have a table which stores details about properties. Which could have owners, value etc.
Is there a good design to keep the history of every change to owner and value. I want to do this for many tables. Kind of like an audit of the table.
What I thought was keeping a single table with fields
table_name, field_name, prev_value, current_val, time, user.
But it looks kind of hacky and ugly. Is there a better design?
Thanks.
There are a few approaches
Field based
audit_field (table_name, id, field_name, field_value, datetime)
This one can capture the history of all tables and is easy to extend to new tables. No changes to structure is necessary for new tables.
Field_value is sometimes split into multiple fields to natively support the actual field type from the original table (but only one of those fields will be filled, so the data is denormalized; a variant is to split the above table into one table for each type).
Other meta data such as field_type, user_id, user_ip, action (update, delete, insert) etc.. can be useful.
The structure of such records will most likely need to be transformed to be used.
Record based
audit_table_name (timestamp, id, field_1, field_2, ..., field_n)
For each record type in the database create a generalized table that has all the fields as the original record, plus a versioning field (additional meta data again possible). One table for each working table is necessary. The process of creating such tables can be automated.
This approach provides you with semantically rich structure very similar to the main data structure so the tools used to analyze and process the original data can be easily used on this structure, too.
Log file
The first two approaches usually use tables which are very lightly indexed (or no indexes at all and no referential integrity) so that the write penalty is minimized. Still, sometimes flat log file might be preferred, but of course functionally is greatly reduced. (Basically depends if you want an actual audit/log that will be analyzed by some other system or the historical records are the part of the main system).
A different way to look at this is to time-dimension the data.
Assuming your table looks like this:
create table my_table (
my_table_id number not null primary key,
attr1 varchar2(10) not null,
attr2 number null,
constraint my_table_ak unique (attr1, att2) );
Then if you changed it like so:
create table my_table (
my_table_id number not null,
attr1 varchar2(10) not null,
attr2 number null,
effective_date date not null,
is_deleted number(1,0) not null default 0,
constraint my_table_ak unique (attr1, att2, effective_date)
constraint my_table_pk primary key (my_table_id, effective_date) );
You'd be able to have a complete running history of my_table, online and available. You'd have to change the paradigm of the programs (or use database triggers) to intercept UPDATE activity into INSERT activity, and to change DELETE activity into UPDATing the IS_DELETED boolean.
Unreason:
You are correct that this solution similar to record-based auditing; I read it initially as a concatenation of fields into a string, which I've also seen. My apologies.
The primary differences I see between the time-dimensioning the table and using record based auditing center around maintainability without sacrificing performance or scalability.
Maintainability: One needs to remember to change the shadow table if making a structural change to the primary table. Similarly, one needs to remember to make changes to the triggers which perform change-tracking, as such logic cannot live in the app. If one uses a view to simplify access to the tables, you've also got to update it, and change the instead-of trigger which would be against it to intercept DML.
In a time-dimensioned table, you make the strucutural change you need to, and you're done. As someone who's been the FNG on a legacy project, such clarity is appreciated, especially if you have to do a lot of refactoring.
Performance and Scalability: If one partitions the time-dimensioned table on the effective/expiry date column, the active records are in one "table", and the inactive records are in another. Exactly how is that less scalable than your solution? "Deleting" and active record involves row movement in Oracle, which is a delete-and-insert under the covers - exactly what the record-based solution would require.
The flip side of performance is that if the application is querying for a record as of some date, partition elimination allows the database to search only the table/index where the record could be; a view-based solution to search active and inactive records would require a UNION-ALL, and not using such a view requires putting the UNION-ALL in everywhere, or using some sort of "look-here, then look-there" logic in the app, to which I say: blech.
In short, it's a design choice; I'm not sure either's right or either's wrong.
In our projects we usually do it this way:
You have a table
properties(ID, value1, value2)
then you add table
properties_audit(ID, RecordID, timestamp or datetime, value1, value2)
ID -is an id of history record(not really required)
RecordID -points to the record in original properties table.
when you update properties table you add new record to properties_audit with previous values of record updated in properties. This can be done using triggers or in your DAL.
After that you have latest value in properties and all the history(previous values) in properties_audit.
I think a simpler schema would be
table_name, field_name, value, time, userId
No need to save current and previous values in the audit tables. When you make a change to any of the fields you just have to add a row in the audit table with the changed value. This way you can always sort the audit table on time and know what was the previous value in the field prior to your change.