database search best performances - database

how can i do a search based on combinations of like 50 parameters like filters.
These filters can be price color size brand etc.
So we can get different pages based on these params.
So one link can have price brand size, another one size brand color, and so on.
My question is what will be the best practice to query the database based on these params.
I have one ideea to encrypt them into 101101101 sequence of 1 and 0 and search by that.
So i have like more than 2 milions possible combinations, and i want to reduce the query time.
I heard about btree but i don't know how to use it, i have given my table columns the proper indexes but from this point i don't know in wich direction should i go. How my query is going to look like.

I think that it is a good idea to "encrypt" the params, but don't do it like "10100010", because then you'll have to be storing these values as string.
Rather encode it as base10 number. It means that 100101 = 1*32+0*16+0*8+1*4+0*2+1*1 = 37.
Ofcourse, with 50 flags you'd get a number too big to store as bigint (which is 32 bytes), so try to logically group the parameters and use 2-3 fields for them.
The problem with this aproach would be with querying the data - you would have to write a function extracting a flag from the number, to be able to query the data by only one parameter and not all of them.

Related

graph database physical distribution and indexing

My question is not on the query language but on the physical distribution of data in a graph database.
Let's assume a simple user/friendship model. In RDBs you would create a table storing IDUserA/IDUserB for a representation of a friendship.
If we assume a bunch of IT-Girls for example with the Facebook limit of 5k friends, we quickly get to huge amounts of data. If GirlA(ID 1) simply likes GirlB(ID 2). It would be an entry wir [1][2] in the table.
With this model it is not possible to get over data redundancy in friendship, because then we have to do either two queries (is there an entry in IDUserA or an entry in IDUserB with ID = 1, what means physically searching both columns) or to store [1][2] and [2][1], what ends up in data redundancy. For a heavy user this means checks against 5000/10000 entries containing an indexed column, which is astronomically big.
So ok, use GraphDBs. We assume the Girls as Nodes. GirlA is the first one ever entered into the DB, so her ID is simply 0. The Entry contains a isUsed - flag for the data chunk of a byte, and is 1 if it is in use. The next 4 bytes are a flag for the filename where her node is stored in (what leads to nearly 4.3 Billion possible files and we assume the file size of 16.7MB so we could use 3 more bytes to declare the offset inside.
Lets assume we define the username datatype as a chunk of 256 (and be for the example so ridgid).
For GirlA it is [1]0.0.0.0-0.0.0
= Her User ID 0 times 256 = 0
For GirlB it is [1]0.0.0.0-0.1.0
= Her User ID 1 times 256 = 256,
so her Usernamedata starts on file 0_0_0_0.dat on offset 256 from start. We don't have to search for her data, we could simply calculate them. A User 100 would be stored in the same file on offset 25600 and so forth and so on. User 65537 would be stored in file 0_0_0_1.dat on offset 0. Loaded in RAM this is only a pointer and pretty fast.
So we could store with this method more nodes than humans ever lived.
BUT: How to find relationships? Ok, with edges. But how to store them? All in one "column" is stupid, because then we are back on relationship models. In a hashtable? Ok, we could store the 0_0_0_0.frds as a hashtable containing all friends of User0, kick off a new instance of a User-Class Object, add the Friends to a binary list or tree that could be found by the pointer cUser.pFriendlist and we would be done. But I think that I make a mistake.
Shouldn't GraphDatabases be something different than mathematical nodes connected with hash tables filled with edges?
The use of nodes and edges is clear, because it allows to connect everything with relationships of anything. But whats about the queries and their speed?
Keeping different edges in different type of files seems somekind of wrong, even if the accessibility is really fast on SSDs.
Sure, I could use a simple relational table to store a edgetype/dataending pair, but please help me: where do I get it wrong!

How can I sort facets by their tf-idf score, rather than popularity?

For a specific facet field of our Solr documents, it would make way more sense to be able to sort facets by their relative "interesting-ness" i.e. their tf-idf score, rather than by popularity. This would make it easy to automatically get rid of unwanted common English words, as both their TF and DF would be high.
When a query is made, TF should be calculated, using all the documents that participate in teh results list.
I assume that the only problem with this approach would be when no query is made, resp., when one searches for ":". Then, no term will prevail over the others in terms of interestingness. Please, correct me if I am wrong here.
Anyway,is this possible? What other relative measurements of "interesting-ness" would you suggest?
facet.sort
This param determines the ordering of the facet field constraints.
count - sort the constraints by count (highest count first) index - to
return the constraints sorted in their index order (lexicographic by
indexed term). For terms in the ascii range, this will be
alphabetically sorted. The default is count if facet.limit is greater
than 0, index otherwise.
Prior to Solr1.4, one needed to use true instead of count and false
instead of index.
This parameter can be specified on a per field basis.
It looks like you couldn't do it out of the box without some serious changes on client side or in Solr.
This is a very interesting idea and I have been searching around for some time to find a solution. Anything new in this area?
I assume that for facets with a limited number of possible values, an interestingness-score can be computed on the client side: For a given result set based on a filter, we can exclude this filter for the facet using the local params-syntax (!tag & !ex) Local Params - On the client side, we can than compute relative compared to the complete index (or another subpart of a filter). This would probably not work for result sets build by a query-parameter.
However, for an indexed text-field with many potential values, such as a fulltext-field, one would have to retrieve df-counts for all terms. I imagine this could be done efficiently using the terms component and probably should be cached on the client-side / in memory to increase efficiency. This appears to be a cumbersome method, however, and doesn't give the flexibility to exclude only certain filters.
For these cases, it would probably be better to implement this within solr as a new option for facet.sort, because the information needed is easily available at the time facet counts are computed.
There has been a discussion about this way back in 2009.
Currently, with the larger flexibility of facet.json, e.g. sorting on stats-facets (e.g. avg(price)) of another field, I guess this could be implemented as an additional sort-option. At least for facets of type term, the result-count (df for current result-set) only needs to be divided by the df of that term for the index (docfreq). If the current result-set is the complete index, facets should be sorted by count.
I will probably implement a workaround in the client for fields with a fixed and rather small vocabulary, e.g. based on a second, cashed query on the complete index. However, for term-fields and similar this might not scale.

Is it better to maintain 3 small columns or 1 large column in a Table?

Three small number columns [Number(1)] >>
OptionA | 0/1
OptionB | 0/1
OptionC | 0/1
or one larger string column [Varchar2(29)] >>
Options | OptionA=0/1|OptionB=0/1|OptionC=0/1
I'm not sure about the way database handles tables, but I think that maintaining three columns as Number(1) is better than one column as Varchar2(29) !
-EDIT-
Let me explain the situation a bit more:
I am working on a common framework where the all incoming/outgoing request/response is tracked, these interactions can be channeled to a DB/File/JMS; now the all the configuration is being loaded from a table which has a column that corresponds to the output type, currently I'm using "DB=1|FILE=1|JMS=0" as the value of that column so that later if anyone wants to add this for their module they can easily understand what is going on, in my code I've written a simple logic which splits the string by "|" and then I use the exclusive or operator to switch between choice using a switch case..
Everything is already done but I don't like the idea of one large column is better than three small + it will remove the split string I'm doing.
-EDIT-
I finally got it clarified, there may be a situation where we have to add more options; in that case if we add the data column wise, it will result in modifying the table + changing the entity + adding more if's n all; on the other hand I ended up making an enum out of it in a simple bit wise logic to switch between options; this way, I need to modify the enum & add a new handler for the new option & then we are good to go.
Using a single column to store multiple pieces of data is probably the worst thing you can do in a database.
Violating first normal form has at least the following disadvantages:
More difficult to query. OptionA = 1 and OptionB = 1 and OptionC = 0 versus substr(options, 9, 1) = '1' and substr(options, 19, 1) = '1' and substr(options, 19, 1) = '0'.
Less flexable. What happens when you need to add another option? Adding a new column is easy. Adding a new format could mess up old queries. For example, if someone tries to read OptionC with substr(options, -1, 1). (Although this is a good reason to use a 3rd option - a separate table.)
No type safety. This can be a very subtle and tricky problem. Let's say you write substr(options, 9, 1) = 1 instead of substr(options, 9, 1) = '1'. If anyone ever gets the format wrong, a single value could ruin lots of queries. Or worse, it only intermittently crashes a small number of queries, because the access paths keep changing. (Although you can prevent this with a check constraint.)
Slower queries. Normally the amount of work done in an expression or condition isn't a significant cost for a query. But adding a lot of unnecessary string manipulation can make a difference.
Less optimizing. Oracle can only build efficient query plans if it can understand your data. For example, let's say that OptionA is "0" 99.9% of the time. When you filter OptionA = 0, Oracle can use a histogram make a very accurate prediction about the number of rows returned. But for substr(options, 9, 1) = '1' you'll only get a wild guess. If you have complicated queries using this columns you may spend a lot of time trying to "fix" the cardinality estimates. (Although maybe expression statistics could help with this?)
There are times when denormalizing is a good idea. For example, if you have terabytes of data, and compress the table, the single column may take up less space. (But if you're trying to save space, why not use a format like "000" instead?).
If there really is a good reason for this, it definitely needs to be documented. Perhaps add a comment on the column.
For a start, if I am reading your question right, you want each of the options to have one of just two possible values, correct?
If so then you could:
have a separate integer (or boolean) column for each option
have an options column that is a string of 1's and 0's, one digit for each options e.g. "001"
use an 'options' column that is an integer and use a bit value for each options, e.g. optionA == options & 1, optionB == options & 2 etc.
some databases have a bit vector data type which you could use. For mysql there is the BIT data type, which can store bit strings up to 64 bits long.
There will be a trade-off between code complexity and efficiency for each of these. Ask yourself, how much of the machine's time or storage will be saved by employing each of these options? And how much of your time will be saved?
In this instance the 3 column approach is the one I would recommend, not only does this keep things simple in terms of extracting data, but should you ever wish you could set values against all 3 columns rather than being limited to one VarChar2 field. If you opt for the single column VarChar2 then it is fairly simple to extract the info you need using the substr command or perhaps another variation, and although this isn't heavy work for an Oracle db, it does essentially put extra work on the server which is not necessary.

How do I store data with a query that's a approximated?

I'm trying to find a way to store my data with fast access (better than O(n)).
My database consists of data (4096 byte strings) that represents some information about some items.
The problem is, that the query is never exact. I get one Item, and then need to find the closest match using a function F(a,b).
just an example:
1234
3456
6466
F(a,b) = return % of similar digits
GetClosest(1233,F) = 1234
The problem is that F(a,b) is a complicated algorithm, (not a proper metric).
What I have now is just go over the whole database to search for the best match.
Is there a kind of tree or other cluster database type that can give me faster finding complexity ?
More information:
F gives back a similarity value in %percentage. where 100% is a perfect match.
Sorry, the answer is "probably not" unless there is some more structure to your problem that you haven't described. With 4096 byte strings you're suffering from the curse of dimensionality.
If you had shorter strings and enough data that there was a high likelihood of the nearest match being identical over a large chunk of the string, then you could store your data with multiple tree-like structures indexed over different chunks of the string. With high likelihood the nearest would be close enough that you could prove it was nearest based only on close elements in those trees. However with the size of your strings and the limited data that can be stored in a computer, there is no way this is possibly going to work.
That said, do you need the exact closest, or only a somewhat close one? If only a likely close one, then you could index it by several random sparse samples of bits. In your search you can only check elements that match exactly in one of the elements. This will greatly reduce the search space, while rejecting fewer of the close neighbors, and may produce reasonable (even though frequently wrong) answers.
Is there some way you could assign a 'score' to each datum.
You could index/sequence the data by your score.
When you search you assign a score to your search criteria, and look for the item with the closest score.
Depends very much on your data and your definition of "difference" whether this will work.

Should I start user IDs from 1 or 1000 in database? Why?

Should I use high numbers for user IDs in database?
Are there any benefits with starting user_id from 1000 (from <9000 users project) or 10000 for more...?
The advantage of starting user IDs from 1000 (even when you will have fewer than 9,000 IDs) is that they will all have the same number of digits, so that files, for example, suffixed with the UID will sort in numeric order automatically, even if the sorter only uses alphabetic numbering. And you don't have to pad the numbers with leading zeroes to get there.
The converse is that if you only have 1000 users, numbers starting at 1,000,000,000 would look a little silly: 1,000,000,001 then 1,000,000,002 and so on.
For many purposes, therefore, it doesn't matter which you do. A uniform number of digits has some advantages, and that is why a value other than zero or one is often used as the starting point.
not really. I would just start from 1. if you have any needs to put stuff in before one, there are no issues w/ using negative numbers, so you can just do an insert and manually specifiy the id. At my company, we start all the users at one, auto incrementing, and our global admin user is ID 0.
I know this answer comes late, but still there is something to add, imo:
1) The approach to use 1000 as a start id can be of an advantage, e.g. if you do not want to make it obvious how many users you have (in case you make the id visible somewhere in an url or sth), and therefore (or in addition)
2) it can be useful if you want to make ids harder to guess, because usually the first ids belong to admins or moderators, so if you takes any id to start (e.g. 1421), you could just add another security tweak to your db...

Resources