Vendor will not create index because of data integrity - database

I am trying to get a vendor to create an index on a progress 10.2b database to aid in migrating data from said database, however the vendor is reluctant to create an index, saying it could impact data integrity. There response is below. Is their any truth/merit in what is being said?
There are a number of reasons we will not add indices but the main
reason is, as you have outlined, Progress selects the index it uses
based on the parameters in the query. So for example if we had code
that does the following:
Find first record where a= 1 and b = 2
As the existing index stands this would find the record using index
‘M’ and it would find record ‘X’
If we add a new index to the table there is a chance that this code
could decide to use the new index to find the record and return record
‘Y’ instead.
Sure creating indices is a core part of any database, but proper
development practices would require heaps of testing before applying
an index change to a product system. Without testing, the integrity
of the system cannot be guaranteed.
So my thoughts on this are:
Progress selects the index it uses based on the parameters in the query
Isn't this how any database usually selects the index? Based on the required columns/where clause, it can decide the appropriate index (if any) available.
If we add a new index to the table there is a chance that this code
could decide to use the new index to find the record and return record
‘Y’ instead.
To me, it almost sounds like they have programmed their program to rely on "grabbing the first record out of the database". If it was to use an index, then sure, it might order the results differently if no order by has been specified. If this is the case, then that is just poor programming.

I pretty much agree with you:
To me, it almost sounds like they have programmed their program to rely on "grabbing the first record out of the database". If it was to use an index, then sure, it might order the results differently if no order by has been specified. If this is the case, then that is just poor programming.
If they wrote their query correctly, an index doesn't change the result — just the speed. If they left the order by and just rely that the index has the right order anyway, another index could cause problems.
However, to emphasis this: The bug is the query then.

Related

How does the SETCURRENTKEY() C/AL function in Navision work?

I have the following questions:
What does SETCURRENTKEY actually do?
What is the benefit of SETCURRENTKEY?
Why would I use SETCURRENTKEY?
When would I use SETCURRENTKEY?
What is the advantage of using an index and how do I tie this analogously to the example of an old sorting system of a library?
What type of database querying efficiency problems does this function solve?
I have been searching all over the internet and the 'IT Pro Developer Help' internal Navision documentation for this poorly documented function and I cannot find a right answer to my questions.
The only thing I know is that SETCURRENTKEY sets the current key for a record variable and it sorts the recordset based on it. When SETCURRENTKEY is used with only a few keys, it can improve query performance. I have no idea what is actually happening when a database uses an index versus not using an index.
Someone told me this is how SETCURRENTKEY works:
It is like the old sorting card system in a library: without SETCURRENTKEY you would have to go through each shelf and manually filter out for the book you want. You would find a mix of random books and you would have to say: "No, not this one. Yes, this one". With SETCURRENTKEY you can have an index analogous to the old system where you would just go to a book or music CD based on its 'Author' or 'Artist' etc.
That's all fine, but I still can't properly answer my questions.
With SETCURRENTKEY you declare the key (table index, which can consist of many fields) to be used when querying database with FINDSET/FINDFIRST/FINDLAST statements, and the order of records you will receive while iterating the recordset with NEXT statement.
Performance. The database server uses the selected key (table index) to retrieve the record set. You are always better off explicitly stating SETCURRENTKEY in your code, as it makes you think along about you database structure and indices required.
Performance, and so that you know ahead the order of records you will receive when iterating through a recordset.
When to use:
The typical use is this:
RecordVar.SETCURRENTKEY(...)
RecordVar.SETRANGE(Field, ...)
RecordVar.SETFILTER(Field, ...)
RecordVar.SETRANGE(Field, ...)
...
IF RecordVar.FINDSET THEN REPEAT
// do something with records
UNTIL RecordVar.NEXT = 0;
SETCURRENTKEY is declarative, and comes into effect only when FINDSET is executed. At the moment FINDSET is executed, the database will be queried on the table represented by RecordVar, using the filters declared by SETRANGE/SETFILTER, and the key/index declared by SETCURRENTKEY.
For 5. and 6. and generally, I would truly reccomend you to familiarize yourself with basic database index theory. This is what it is, pretty well explained by yourself using the library/book analogy.
If modifying key fields (or filtered fields, even if not in the key) in a loop, the standard way to do this in NAV is to declare a second record variable, do a GET on it using the primary key fields from the record variable you are looping through, then change and MODIFY the second record variable.

Alternatives to isActive

Marginally related to Should I delete or disable a row in a relational database?
Given that I am going to go with the strategy of warehousing changes to my tables in a history table, I am faced with the following options for implementing a status for a given row in MySQL:
An isActive booelan
An activeStatus enum
An activeStatus INT referencing a small ActiveStatus lookup table
An activeStatus INT not referencing another table
The first approach is rather inflexible in my opinion, since I might need more booleans in the future to support other types of active statuses (I'm not sure what they would be, but maybe something like "being phased out" or "active for a random group of users", etc).
I'm told that MySQL enum is bad, so the second approach probably won't fly.
I like the third approach, but I'm wondering if it is a heavy handed solution to a relatively small problem.
The fourth approach requires that we know in advance what each status INT means and seems like an outdated way to do things.
Is there a canonical right answer? Am I ignoring another approach?
Personally I would go with your third option.
Boolean values often turn out to be more complex in reality, as you suggested. ENUMs can be nice, but they have the downside that as soon as you want to store additional information about each value - who added it, when, is it only valid for a certain time period or source system, comments etc. - it becomes difficult, whereas with a lookup table those data can easily be maintained in additional columns. ENUMs are a good tool to constrain data to certain values (like a CHECK constraint), but not such a good tool if those values have significant meaning and need to be exposed to users.
It's not entirely clear from your question if you plan to treat your history table like a fact table and use it in reports, but if so then you could consider the ActiveStatus lookup table as a dimension. In this case a table is much easier, because your reporting tool can read the possible values from the dimension table in order to let the user choose his query conditions; such tools generally don't know anything about ENUMs.
From my point of view your 2nd approach is better if u have more than 2 status.Because ENUM is great for data that you know will fall within a static set. But if u have only two status active and inactive then its always better to use boolean.
EDIT:
If u r sure in future u r not gonna change the value of your ENUM then its great to use ENUM for such field.

Implementing a database -- How to get started

I've been trying to learn programming for a while. I've studied Java and Python, and I'm comfortable with their syntax. Recently, I wanted to use what I've learnt with coding a tangible software from ground up.
I want to implement a database engine, sort of a NoSQL database. I've put together a small document, sort of a specification to follow throughout my adventure of coding it. But all I know is a bunch of keywords. I don't know where to start.
Can someone help me find out how to gather the knowledge I need for this kind of work and in what order to learn things? I have searched for documents, but I feel like I'll end up finding unrelated/erroneous content or start from a wrong point, because implementing a complete database engine is (seeming to be) a truly complicated task.
I wan't to express that I'd prefer theses and whitepapers and (e)books to codes of other projects, because I've asked a question of kind in which people usually get answered in the form of "read project - x' source code". I'm not at the level of comfortably reading and understanding source code.
First, you may have a look that the answers for How to write a simple database engine. While it focus on a SQL engine, there is still a lot of good material in the answers.
Otherwise, a good project tutorial is Implementation of a B-Tree Database Class. The example code is in C++, but the description of what is done and why is probably what you'll want to look at anyway.
Also, there is Designing and Implementing Structured Storage (Database Engine) over at MSDN. Plenty of information there to help you in your learning project.
Because the accepted answer only offers (good) links to other resources, I'd thought I share my experience writing webdb, a small experimental database for browsers. I also invite you to read the source code. It's pretty small. You should be able to read through it and get a basic understanding of what it's doing in a couple of hours. Warning: I am a n00b at this and since writing it I learned a lot more about it and see I have been doing some things wrong. It can help you get started though.
The basics: BTree
I started out with adapting an AVL tree to suit my needs. An AVL tree is a kind of self-balancing binary search tree. You store the key K and related data (if any) in a node, then all items with key < K in a node in the left subtree and all items with key > K in a right subtree. You can use an array to store the data items if you want to support non unique keys.
This tree will give you the basics: Create, Update, Delete and a way to quickly get an item by key, or all items with key < x, or with key between x and y etc. It can serve as the index for our table.
A schema
As a next step I wrote code that lets the client code define a schema. Methods like createTable() etc. Schemas are typically associated with SQL, but even no-SQL sort-of has a schema; they usually require you to mark the ID field and any other fields you want to search on. You can make your schema as fancy as you want, but you typically want to model at least which column(s) serve as primary key and which fields will be searched on frequently and need an index.
Creating a data structure to store a table
I decided to use the tree I created in the first step to store my items. These were simple JS objects. Having defined which field contains the PK, I could simply insert the item into the tree using that field's value as the key. This gives me quick lookup by ID (range).
Next I added another tree for every column that needs an index. In these trees I did not store the full record, but only the key. So to fetch a customer by last name, I would first use the index on last name to get the ID, then the primary key index to get the actual record. The reason I did not just store the (reference to the) actual object is because it makes set operations a little bit simpler (see next step)
Querying
Now that we have a table with indexes for PK and search fields, we can implement querying. I did not take this very far as it becomes complicated quickly, but you can get some nice functionality with just some basics. WebDB does not implement joins; all queries operate only on a single table. But once you understand this you see a pretty clear (though long and winding) path to doing joins and other complicated stuff as well.
In WebDB, to get all customers with firstName = 'John' and city = 'New York' (assuming those are two search fields), you would write something like:
var webDb = ...
var johnsFromNY = webDb.customers.get({
firstName: 'John',
city: 'New York'
})
To solve it, we first do two lookups: we get the set X of all IDs of customers named 'John' and we get the set Y of all IDs of customers from New York. We then perform an intersection on these two sets to get all IDs of customers that are both named 'John' AND from New York. We then run through our set of resulting IDs, getting the actual record for each one and adding it to the result array.
Using the set operators like union and intersection we can perform AND and OR searches. I only implemented AND.
Doing joins would (I think) involve creating temporary tables in memory, then populating them as the query runs with the joined results, then applying the query criteria to the temp table. I never got there. I attempted some syncing logic next but that was too ambitious and it went downhill from there :)

Database Normalization and User Defined Data Storage

am looking to let the users of my web application define their own attributes for products and then enter data for those products. I have found out that this technique is called n(th) normal form.
The following is DB structure I am currently considering deploying and was wondering what the positives and negatives would be in regards to integrity and scalability (and any other -ity's you can think of)
EDIT
(Sorry, This is more what I mean)
I have been staring at this for the last 15mins and I know (where the red arrow is) induces duplication and hence you would have to have integrity checks. But I just don't understand how else what I want could be done.
The products would number no more then 10. The variables would number no more then 200 (max 20 per product). The number of product instances would not exceed 100,000, therefore the maximum size of pVariable_data would not exceed 2 million
This model is called a database in a database and is not nice. Though sometimes it is impossible first check whether you really need it and your database is really the right database for the job.
With PostgreSQL you could use: http://www.postgresql.org/docs/8.4/static/hstore.html which is a standardized solution for this kind of issues.
Assuming that pVariable is more of a pVariable type, drop the reference to product_fk. It would mean that you need a new entry in that table for every Product record. Maybe try something like this:
Product(id, active, allow_new)
pVariable_type(id, name)
pVariable_data(id, product_fk, pvariable_fk, non_typed_value, bool, int, etc)
I would use the non_typed_value as your text value, and (unless you are keeping streams) write a record into that field along with the typed value. It will mean keeping the value of a record twice (and more of a pain on updates etc) but it will make querying easier, along with reporting (anything you just need to display the value for).
Note: it would also be idea to pull anything that is common to all products and put them in the product table. For example all products will most likely have a name, suggested price, etc.

Creating an efficient search capability using SQL Server (and/or coldfusion)

I am trying to visualize how to create a search for an application that we are building. I would like a suggestion on how to approach 'searching' through large sets of data.
For instance, this particular search would be on a 750k record minimum table, of product sku's, sizing, material type, create date, etc;
Is anyone aware of a 'plugin' solution for Coldfusion to do this? I envision a google like single entry search where a customer can type in the part number, or the sizing, etc, and get hits on any or all relevant results.
Currently if I run a 'LIKE' comparison query, it seems to take ages (ok a few seconds, but still), and it is too long. At times making a user sit there and wait up to 10 seconds for queries & page loads.
Or are there any SQL formulas to help accomplish this? I want to use a proven method to search the data, not just a simple SQL like or = comparison operation.
So this is a multi-approach question, should I attack this at the SQL level (as it ultimately looks to be) or is there a plug in/module for ColdFusion that I can grab that will give me speedy, advanced search capability.
You could try indexing your db records with a Verity (or Solr, if CF9) search.
I'm not sure it would be faster, and whether even trying it would be worthwhile would depend a lot on how often you update the records you need to search. If you update them rarely, you could do an Verity Index update whenever you update them. If you update the records constantly, that's going to be a drag on the webserver, and certainly mitigate any possible gains in search speed.
I've never indexed a database via Verity, but I've indexed large collections of PDFs, Word Docs, etc, and I recall the search being pretty fast. I don't know if it will help your current situation, but it might be worth further research.
If your slowdown is specifically the search of textual fields (as I surmise from your mentioning of LIKE), the best solution is building an index table (not to be confiused with DB table indexes that are also part of the answer).
Build an index table mapping the unique ID of your records from main table to a set of words (1 word per row) of the textual field. If it matters, add the field of origin as a 3rd column in the index table, and if you want "relevance" features you may want to consider word count.
Populate the index table with either a trigger (using splitting) or from your app - the latter might be better, simply call a stored proc with both the actual data to insert/update and the list of words already split up.
This will immediately drastically speed up textual search as it will no longer do "LIKE", AND will be able to use indexes on index table (no pun intended) without interfering with indexing on SKU and the like on the main table.
Also, ensure that all the relevant fields are indexed fully - not necessarily in the same compund index (SKU, sizing etc...), and any field that is searched as a range field (sizing or date) is a good candidate for a clustered index (as long as the records are inserted in approximate order of that field's increase or you don't care about insert/update speed as much).
For anything mode detailed, you will need to post your table structure, existing indexes, the queries that are slow and the query plans you have now for those slow queries.
Another item is to enure that as little of the fields are textual as possible, especially ones that are "decodable" - your comment mentioned "is it boxed" in the text fields set. If so, I assume the values are "yes"/"no" or some other very limited data set. If so, simply store a numeric code for valid values and do en/de-coding in your app, and search by the numeric code. Not a tremendous speed improvement but still an improvement.
I've done this using SQL's full text indexes. This will require very application changes and no changes to the database schema except for the addition of the full text index.
First, add the Full Text index to the table. Include in the full text index all of the columns the search should perform against. I'd also recommend having the index auto update; this shouldn't be a problem unless your SQL Server is already being highly taxed.
Second, to do the actual search, you need to convert your query to use a full text search. The first step is to convert the search string into a full text search string. I do this by splitting the search string into words (using the Split method) and then building a search string formatted as:
"Word1*" AND "Word2*" AND "Word3*"
The double-quotes are critical; they tell the full text index where the words begin and end.
Next, to actually execute the full text search, use the ContainsTable command in your query:
SELECT *
from containstable(Bugs, *, '"Word1*" AND "Word2*" AND "Word3*"')
This will return two columns:
Key - The column identified as the primary key of the full text search
Rank - A relative rank of the match (1 - 1000 with a higher ranking meaning a better match).
I've used approaches similar to this many times and I've had good luck with it.
If you want a truly plug-in solution then you should just go with Google itself. It sounds like your doing some kind of e-commerce or commercial site (given the use of the term 'SKU'), So you probably have a catalog of some kind with product pages. If you have consistent markup then you can configure a google appliance or service to do exactly what you want. It will send a bot in to index your pages and find your fields. No SQl, little coding, it will not be dependent on your database, or even coldfusion. It will also be quite fast and familiar to customers.
I was able to do this with a coldfusion site in about 6 hours, done! The only thing to watch out for is that google's index is limited to what the bot can see, so if you have a situation where you want to limit access based on a users role or permissions or group, then it may not be the solution for you (although you can configure a permission service for Google to check with)
Because SQL Server is where your data is that is where your search performance is going to be a possible issue. Make sure you have indexes on the columns you are searching on and if using a like you can't use and index if you do this SELECT * FROM TABLEX WHERE last_name LIKE '%FR%'
But it can use an index if you do it like this SELECT * FROM TABLEX WHERE last_name LIKE 'FR%'. The key here is to allow as many of the first characters to not be wild cards.
Here is a link to a site with some general tips. https://web.archive.org/web/1/http://blogs.techrepublic%2ecom%2ecom/datacenter/?p=173

Resources