If I have entities with the attribute :fruit:
apple
banana
grapes
tomato
and a feature allowing a user to order his fruits:
1 grapes
2 apple
3 tomato
4 banana
Is there a good way to store fruit order to the database with the expectation that a fruit may be deleted, a fruited added, and fruits reordered?
A naive solution is to add an order column. A problem with that is expensive updates. Say I have an entity: 1000000 durian. I suddenly decide it's my favorite fruit and move it to the top. This causes 999999 fruits to require an order update.
The short answer is no, Datomic doesn't have this built in, and to be fair neither does many other databases.
You have the "order" column approach that you mentioned, which also has the problem you mentioned. Gaps isn't the worst part really, since you can still get correct sorting with some gaps, it becomes even worse if you want to insert an item in the middle, then you have to update the following entities. And you should probably do it all in a transaction function unless you're certain your peer is single threaded.
There's also the linked list approach, where each entity points to the next and the last doesn't point to anything. Appending, prepending and slicing in the middle becomes constant operations.
There is not a built-in way to accomplish your goal in any database, whether PostgreSQL, Datomic, or anything. However, there is an easy answer.
Just convert your proposed "priority" column from an integer to a floating-point value. Then you can always insert a new entry between any two existing items without the need to change anything. Suppose you start with
1.0 grape
2.0 apple
3.0 tomato
4.0 banana
and you then decide that to add a pear between grape and apple. Just insert is as:
1.0 grape
1.5 pear
2.0 apple
3.0 tomato
4.0 banana
You then decide to insert cherry between grape and pear, so you get:
1.0 grape
1.25 cherry
1.5 pear
2.0 apple
3.0 tomato
4.0 banana
Then, whenever you want to examine your list you simply fetch both the priority column and the name column, sort by priority, and you are done.
Related
Currently I have a excel table that looks like this
A B C D E F G
ID NAME DATE ITEM 2020 3
1234 Alex 09-20-2020 Carrot 2019 2
1234 Alex 09-20-2020 Onion
1234 Alex 09-20-2019 Carrot
1234 Alex 09-20-2019 Mushroom
1234 Alex 09-20-2020 Pasta
1345 Morgan 09-20-2020 Pasta
1345 Morgan 09-20-2020 Tomato Sauce
1145 Jayson 09-20-2020 Tomato Sauce
1145 Jayson 09-20-2020 Cream Sauce
1345 Morgan 09-20-2019 Pasta
1345 Morgan 09-20-2019 Tomato Sauce
I want to be able to count the unique customers for each year using excel functions. This is so that the functions can be transferred to a different computers without setting up the custom functions.
The proccess currently can be done in excel without function by: adding filter to each column, filtering to only show the intended year, using remove duplicate to remove duplicates in NAME, and finally counting the rows (giving reslts seen in G2 & G3). However, I want to be able to do that through excel functions. So far what I have is that I am able to count unique values through
{=SUM(IF(FREQUENCY(IF(LEN(B2:B12)>0,MATCH(B2:B12,B2:B12,0),""),IF(LEN(B2:B12)>0,MATCH(B2:B12,B2:B12,0),''))>0,1))}
Additionally I am also able to SUMPRODUCT() for counting a array with multiple condition so for now I have combined the above forumla with
SUMPRODUCT((YEAR(C2:C12)=G1)+0)
My initial idea was to add the first function into the SUMPRODUCTI() since the first function could also produce a array that it could count. However that quickly did not work as it did not count the unique values corresponding to the year.
My question here is if there is any way to what would be a grouping function so that I can take unique values that are within a year, without transforming the data (through filters of deletion of duplicates). My current understanding with SUMPRODUCT() is that it will only look for unique values in the entire column but not within the range given for the first array.
You have got numeric ID's which you should make use of. If you have got Excel O365, in G1 use:
=COUNT(UNQIUE(FILTER(A$2:A$12,YEAR(C$2:C$12)=F1)))
With older versions, use this CSE-entered formula:
=SUM(--(FREQUENCY(IF(YEAR(C$2:C$12)=F1,A$2:A$12),A$2:A$12)>0))
And drag down.
If I have the following data:
Results Table
.[Required]
I want one grape
I want one orange
I want one apple
I want one carrot
I want one watermelon
Fruit Table
.[Name]
grape
orange
apple
What I want to do is essentially say give me all results where users are looking for a fruit. This is all just example, I am looking at a table with roughly 1 million records and a string field of 4000+ characters. I am expecting a somewhat slow result and I know that the table could DEFINITELY be structured better, but I have no control of that. Here is the query I would essentially have, but it doesn't seem to do what I want. It gives every record. And yes, [#Fruit] is a temp table.
SELECT * FROM [Results]
JOIN [#Fruit] ON
'%'+[Results].[Required]+'%' LIKE [#Fruit].[Name]
Ideally my output should be the following 3 rows:
I want one grape
I want one orange
I want one apple
If that kind of think is doable, I would try the other way round:
SELECT * FROM [Results]
JOIN [#Fruit] ON
[Results].[Required] LIKE '%'+[#Fruit].[Name]+'%'
This topic interests me, so I did a little bit of searching.
Suggestion 1 : Full Text Search
I think what you are trying to do is Full Text Search .
You will need Full-Text Index created on the table if it is not already there. ( Create FULLTEXT Index ).
This should be faster than performing "Like".
Suggestion 2 : Meta Data Search
Another approach I'd take is to create meta data table, and maintain the information myself when the [Result].Required values are updated(or created).
This looks more or less doable, but I'd start from the Fruit table just for conceptual clarity.
Here's roughly how I would structure this, ignoring all performance / speed / normalization issues (note also that I've switched around the variables in the LIKE comparison):
SELECT f.name, r.required
FROM fruits f
JOIN results r ON r.required LIKE CONCAT('%', f.name, '%')
...and perhaps add a LIMIT 10 to keep the query from wasting time while you're testing it out.
This structure will:
give you one record per "match" (per Result row that matches a Fruit)
exclude Result rows that don't have a Fruit
probably be ungodly slow.
Good luck!
Is it possible to do a compound sort in solr without Field Collapsing?
If I have two car models, Ford and Chevy, can I sort first on Ford where price is less than 2,000, then Ford > 2,000, then the Chevy models? I would like to do this without grouping, and without applying a price sort to the Chevy models.
For example, something like &sort=Model:"Ford" AND price:[0 TO 2000]
so that I get:
Ford 1, $1000
Ford 2, $500
Ford 2, $1500
_________
Ford 3, $3000
Ford 3, $5000
_______
Chevy 1
Chevy 2
Chevy 3
I've tinkered a bit with this, and I've come up with a solution based on the query() function, since you can use that together with sorting. I'm not sure about the performance, and depending on the number of documents in your index, that might not be important, so the only way is to try it and see if it performs. I've used name and price as my two fields in the schema, which I think would map to your Model and price fields.
The way sort works is that each clause is evaluated in order, so that the first sort description is performed first, then the next one if there's a draw, and so on.
I've removed url escaping and formatted everything a bit:
sort=query($sq1,0) asc,query($sq2,0) asc
&sq1=name:Ford* AND price:[0 TO 1500]
&sq2=name:Ford*
This implies that the first sort is performed on the query named in the sq1= URL parameter, but if there's a draw (which there will be, if there isn't a match), the query named under sq2= will be performed ($sq1 and $sq2 refers to these to queries, and a simple substitution will be made by Solr before evaluating the query() function).
I haven't provided a default sort order, but you could add name asc as a default sort. The 0 as the second argument to query() is a value that the sort will use if there isn't a match from the query (otherwise it'll use the score from the query). You could feed this value into product() and multiply with the price, to sort each of the "buckets" by price as well if needed.
I'm looking for an efficient way of storing sets of objects that have occurred together during events, in such a way that I can generate aggregate stats on them on a day-by-day basis.
To make up an example, let's imagine a system that keeps track of meetings in an office. For every meeting we record how many minutes long it was and in which room it took place.
I want to get stats broken down both by person as well as by room. I do not need to keep track of the individual meetings (so no meeting_id or anything like that), all I want to know is daily aggregate information. In my real application there are hundreds of thousands of events per day so storing each one individually is not feasible.
I'd like to be able to answer questions like:
In 2012, how many minutes did Bob, Sam, and Julie spend in each conference room (not necessarily together)?
Probably fine to do this with 3 queries:
>>> query(dates=2012, people=[Bob])
{Board-Room: 35, Auditorium: 279}
>>> query(dates=2012, people=[Sam])
{Board-Room: 790, Auditorium: 277, Broom-Closet: 71}
>>> query(dates=2012, people=[Julie])
{Board-Room: 190, Broom-Closet: 55}
In 2012, how many minutes did Sam and Julie spend MEETING TOGETHER in each conference room? What about Bob, Sam, and Julie all together?
>>> query(dates=2012, people=[Sam, Julie])
{Board-Room: 128, Broom-Closet: 55}
>>> query(dates=2012, people=[Bob, Sam, Julie])
{Board-Room: 22}
In 2012, how many minutes did each person spend in the Board-Room?
>>> query(dates=2012, rooms=[Board-Room])
{Bob: 35, Sam: 790, Julie: 190}
In 2012, how many minutes was the Board-Room in use?
This is actually pretty difficult since the naive strategy of summing up the number of minutes each person spent will result in serious over-counting. But we can probably solve this by storing the number separately as the meta-person Anyone:
>>> query(dates=2012, rooms=[Board-Room], people=[Anyone])
865
What are some good data structures or databases that I can use to enable this kind of querying? Since the rest of my application uses MySQL, I'm tempted to define a string column that holds the (sorted) ids of each person in the meeting, but the size of this table will grow pretty quickly:
2012-01-01 | "Bob" | "Board-Room" | 2
2012-01-01 | "Julie" | "Board-Room" | 4
2012-01-01 | "Sam" | "Board-Room" | 6
2012-01-01 | "Bob,Julie" | "Board-Room" | 2
2012-01-01 | "Bob,Sam" | "Board-Room" | 2
2012-01-01 | "Julie,Sam" | "Board-Room" | 3
2012-01-01 | "Bob,Julie,Sam" | "Board-Room" | 2
2012-01-01 | "Anyone" | "Board-Room" | 7
What else can I do?
Your question is a little unclear because you say you don't want to store each individual meeting, but then how are you getting the current meeting stats (dates)? In addition any table given the right indexes can be very fast even with alot of records.
You should be able to use a table like log_meeting. I imagine it could contain something like:
employee_id, room_id, date (as timestamp), time_in_meeting
Where foreign keys to employee id to employee table, and room id key to room table
If you index employee id, room id, and date you should have a pretty quick lookup as mysql multiple-column indexes go left to right such that you gain index on (employee id, employee id + room id, and employee id + room id + timestamp) when do searches. This is explained more in the multi-index part of:
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
By refusing to store meetings (and related objects) individually, you are loosing the original source of information.
You will not be able to compensate for this loss of data, unless you memorize on a regular basis the extensive list of all potential daily (or monthly or weekly or ...) aggregates that you might need to question later on!
Believe me, it's going to be a nightmare ...
If the number of people are constant and not very large you can then assign a column to each person for present or not and store the room, date and time in 3 more columns this can remove the string splitting problems.
Also by the nature of your question I feel first of all you need to assign Ids to everything rooms,people, etc. No need for long repetitive string in DB. Also try reducing any string operation and work using individual data in each column for better intersection performance. Also you can store a permutation all the people in a table and assign a id for them then use one of those ids in the actual date and time table. But all techniques will require that something be constant either people or rooms.
I do not understand whether you know all "questions" in design time or it's possible to add new ones during development/production time - this approach would require to keep all data all the time.
Well if you would know all your questions it seems like classic "banking system" which recalculates data on daily basis.
How I think about it.
Seems like you have limited number of rooms, people, days etc.
Gather logging data on daily basis, one table per day. Just one event, one database row, all information (field) what you need.
Start to analyse data using some crone script at "midnight".
Update stats for people, rooms, etc. Just increment number of hours spent by Bob in xyz room etc. All what your requirements need.
As analyzed data are limited and relatively small as you analyzed (compress) them, your system can contain also various queries as indexes would be relatively small etc.
You could be able to use scalable map/reduce algorithm.
You can't avoid storing the atomic facts as follows: (the meeting room, the people, the duration, the day), which is probably only a weak consolidation when the same people meet multiple times in the same room on the same day. Maybe that happens a lot in your office :).
Making groups comparable is an interesting problem, but as long as you always compose the member strings the same, you can probably do it with string comparisons. This is not "normal" however. To normalise you'll need a relation table (many to many) and compose a temporary table out of your query set so it joins quickly, or use an "IN" clause and a count aggregate to ensure everyone is there (you'll see what I mean when you try it).
I think you can derive the minutes the board room was in use as meetings shouldn't overlap, so a sum will work.
For storage efficiency, use integer keys for everything with lookup tables. Dereference the integers during the query parsing, or just use good old joins if you are feeling traditional.
That's how I would do it anyway :).
You'll probably have to store individual meetings to get the data you need anyway.
However you'll have to make sure you aggregate and anonymise it properly before creating your reports. Make sure to separate concerns and access levels to stay within the proper legal limits on data.
I have an (imperfectly) clustered string data, where the items in one cluster might look like this:
[
Yellow ripe banana very tasty,
Yellow ripe banana with little dots,
Green apple with little dots,
Green ripe banana - from the market,
Yellow ripe banana,
Nice yellow ripe banana,
Cool yellow ripe banana - my favourite,
Yellow ripe,
Yellow ripe
],
where the optimal title would be 'Yellow ripe banana'.
Currently, I am using simple heuristics - choosing the most common, or the shortest name if tie, - with the help of SQL GROUP BY. My data contains a large amount of such clusters, they change frequently, and, every time a new fruit is added to or removed from the cluster, the title for the cluster has to be re-calculated.
I would like to improve two things:
(1) Efficiency - e.g., compare the new fruit name to the title of the cluster only, and avoid grouping / phrase clustering of all fruit titles each time.
(2) Precision - instead of looking for the most common complete name, I would like to extract the most common phrase. The current algorithm would choose 'Yellow ripe', which repeats 2 times and is the most common complete phrase; however, as the phrase, 'Yellow ripe banana' is the most common in the given set.
I am thinking of using Solr + Carrot2 (got no experience with the second). At this point, I do not need to cluster the documents - they are already clustered based on other parameters - I only need to choose the central phrase as the center/title of the cluster.
Any input is very appreciated, thanks!
Solr provides an analysis component called a ShingleFilter that you can use to create tokens from groups of adjacent words. If you put that in your analysis chain (ie apply it it incoming documents when you index them), and then compute facets for the resulting field with a query restricted to the "fruit cluster", you will be able to get a list of all distinct shingles along with their occurrence frequencies - I think you can even retrieve them sorted by frequency - which you can use easily I think to derive the title you want. Then when you add a new fruit, its shingles will automatically be included in the facet computations the next time around.
Just a bit more concrete version of this proposal:
create two fields: fruit_shingle, and cluster_id.
Configure fruit_shingle with the ShingleFilter and any other processing you might want (like tokenizing at word boundaries with maybe StandardTokenizer, prior to the ShingleFilter).
Configure cluster_id as a unique id, using whatever data you use to identify the clusters.
For each new fruit, store its text in fruit_shingle and its id in cluster_id.
Then retrieve facets for a query: "cluster_id:", and you will get a list of words, word pairs, word triplets, etc (shingles). You can configure the ShingleFilter to have a max length, I believe. Sort the facets by some combination of length and/or frequency that you deem appropriate and use that as the "title" of the fruit cluster.