How to do a birthday (not birth-date) search in Solr?

How to do a birthday (not birth-date) search in Solr? - solr

I have an index that stores birth-dates, and I would like to search for anybody whose birth-date is within X days of a certain month/day. For example, I'd like to know if anybody's birthday is coming up within a certain number of days, regardless of what year they were born. How would I perform this query this using Solr? (on the "birthdate" field)
As a follow-up, assuming this query is executed very often, should I be indexing something other than the birth-date? Such as just the month-day pair? What is the most efficient way to do such a query (from the query and indexing standpoint)?

You need to remember that Solr uses Lucene, and that as of now - everything is stored and indexed as a string.
Range query as is won't work because the dates are usually internally indexed as YYYYMMDD
Having a seperate field in the index that just stores MMDD strings would be easily searchable. Or if you don't want an extra field, and are willing to index the dates differently, rearrange the order when indexing so that birthdates are indexed MMDDYYY
Then you can construct rangequeries, because everything you need to match against is in the front of the string, and lucene matches lexiographically
(A rangequery that was ba -> bc would match BAt, BAseball, but not BEcause.)
Indexing like this is a onetime fixed cost, and doesnt destroy anything other than internal arrangement chronologically. If that's a problem, use two fields, disk space is cheap!)

If a day/month pair is tricky (I don't know whether it is or not) why not have a field of "their birthday in 1980" (whether they were alive then or not). Then you just need to do the search against 1980. This is effectively a day/month pair, but stored in a type you can use easily.
Note that 1980 is a leap year, which is why I chose it - otherwise those with a birthday of February 29th could be hard to represent.
Alternatively, a "day/month" pair in the form of an integer:
(100 * month) + day
would give you a simple representation which would be easy to search and index. I've usually found that storing data in a single field is simpler than using two fields. Then again, I've never used Solr...
EDIT: I've had another idea. It's a bit balmy, but even so...
Store the birth date in a format which is effectively month, day, year. I don't know if Solr could easily do it in MM/dd/yyyy format and then do a lexicographic order search, but the alternative is
(100000 * month) + (1000 * dayOfMonth) + (year - 1900)
(This is assuming you don't need it to store birth dates earlier than 1900. I'm sure you can tailor it.)
You can still recover the original birth date, but the ordering will be in birthday order, with the oldest person first for any particular date.
It does mean it's hard to sort people by their actual age though. I don't know if that's an issue for you.
Anyway, as I said it's a bit off-the-wall, but it might help :)

You could store the birthday as a number from 1 to 366. Then search that value. The advantage is that you can then search with day ranges quite easily. The disadvantage is that you can't easily use this field for finding people whose birthday is this month.

Related

Count down a column until value in column meets condition

I have a simple daily rainfall data set and would like to calculate the antecedent dry period for each day. Here, I'm defining a dry day to be "<10". I'm fairly unfamiliar with INDEX(), MATCH(), and other fancy array functions but feel like I'll need to use them.
For example, in the image, for 1/17/2020, the values in cells C3:C9=0, C10=1, C11:C13=0. I've tried various versions of COUNTIF(), COUNTIFS(), and IF() functions but I cannot get the step-wise + re-set functionality necessary when extended "dry spells" or brief rain periods occur with gaps. Thanks!

You are right, you need to use Match. Basically you need to search for the next antecedent wet day (of which there are many here in Manchester England at the moment) and subtract 1 (Formula 1):
=MATCH(TRUE,INDEX(B15:B$1000>=10,0),0)-1
where B$1000 may need changing to include all of your data. The use of Index here is just a bit of a hack to avoid having to enter the formula as an array formula.
As you can see there is an issue when you come to the end of the range which I will come to in a minute.
In this case, we want to count the number of antecedent dry days to the end of the range (Formula 2):
=IFERROR(MATCH(TRUE,INDEX(B4:B$1000>=10,0),0)-1,COUNTIF(B4:B$1000,"<10"))
If the range ended with a dry spell, you would get this:

Return the smallest value from a list, in which only certain values are eligeble - excel

I am having some troubles formulating my problem but I hope you understand!
I have a table of firms building production plants in foreign countries in certain years. (Columns A to C).
In a seperate table i have so-called cross-national distance measures (based on the difference in gdp of the countries). (Columns G to M). Note that the distances change per year.
A simplified version of the excel would look like this:
https://new.wu.ac.at/fileadmin/wu/d/i/iib/photo/stack.JPG
What I want is a formula for the manually entered results in column D. It shall give me a result which is the following:
It shall look in which countries the specific company has previously (years before) built plants
It shall find the smallest cross-national distance from the current country to any of the countries previously entered
The value should be for the year of the current plant-construction
Let me illustrate my request with the example result i would want in cell D8:
The formula would have to find a list of countries that were previously entered in this case Turkey and Bulgaria
It would then have to into the second table and give me the minimum of the distances from Kosovo but only to Turkey and Bulgaria
This would have to be done in the rows for 2008 (current year)
I really hope you guys can help me, i figured out a way to find a minimum in a list and i can do it for certain years as well but the issue i am having that excel first needs to find the previously entered countries, memorize them in some kind of array and then use only these countries to consider the minimum distance.
Thank you very much!

Try this "array formula" for D2 copied down
=IFERROR(SMALL(IF(COUNTIFS(A$2:A$11,A2,B$2:B$11,"<"&B2,C$2:C$11,"<>"&C2,C$2:C$11,I$1:M$1)*(G$2:G$31=B2)*(H$2:H$31=C2),I$2:M$31),1),"N/A")
confirmed with CTRL+SHIFT+ENTER
That checks three conditions for your larger table - that the header row matches a qualifying country (using COUNTIFS function based on criteria in the small table), that column G matches the current year and column H matches the current country.
If all those criteria are satisfied then the relevant values in the table are returned, and SMALL finds the smallest. If there's an error (because there are no qualifying values) then N/A is returned
In Excel 2010 or later versions you can use AGGREGATE function instead of SMALL - this is useful because it doesn't require "array entry"
=IFERROR(AGGREGATE(15,6,I$2:M$31/(COUNTIFS(A$2:A$11,A2,B$2:B$11,"<"&B2,C$2:C$11,"<>"&C2,C$2:C$11,I$1:M$1)>0)/(G$2:G$31=B2)/(H$2:H$31=C2),1),"N/A")

How to represent end-of-time in a database?

I am wondering how to represent an end-of-time (positive infinity) value in the database.
When we were using a 32-bit time value, the obvious answer was the actual 32-bit end of time - something near the year 2038.
Now that we're using a 64-bit time value, we can't represent the 64-bit end of time in a DATETIME field, since 64-bit end of time is billions of years from now.
Since SQL Server and Oracle (our two supported platforms) both allow years up to 9999, I was thinking that we could just pick some "big" future date like 1/1/3000.
However, since customers and our QA department will both be looking at the DB values, I want it to be obvious and not appear like someone messed up their date arithmetic.
Do we just pick a date and stick to it?

Use the max collating date, which, depending on your DBMS, is likely going to be 9999-12-31. You want to do this because queries based on date ranges will quickly become miserably complex if you try to take a "purist" approach like using Null, as suggested by some commenters or using a forever flag, as suggested by Marc B.
When you use max collating date to mean "forever" or "until further notice" in your date ranges, it makes for very simple, natural queries. It makes these kind of queries very clear and simple:
Find me records that are in effect as of a given point in time.
... WHERE effective_date <= #PointInTime AND expiry_date >= #PointInTime
Find me records that are in effect over the following time range.
... WHERE effective_date <= #StartOfRange AND expiry_date >= #EndOfRange
Find me records that have overlapping date ranges.
... WHERE A.effective_date <= B.expiry_date AND B.effective_date <= A.expiry_date
Find me records that have no expiry.
... WHERE expiry_date = #MaxCollatingDate
Find me time periods where no record is in effect.
OK, so this one isn't simple, but it's simpler using max collating dates for the end point. See: this question for a good approach.
Using this approach can create a bit of an issue for some users, who might find "9999-12-31" to be confusing in a report or on a screen. If this is going to be a problem for you then drdwicox's suggestion of using a translation to a user-friendly value is good. However, I would suggest that the user interface layer, not the middle tier, is the place to do this, since what may be the most sensible or palatable may differ, depending on whether you are talking about a report or a data entry form and whether the audience is internal or external. For example, some places what you might want is a simple blank. Others you might want the word "forever". Others you may want an empty text box with a check box that says "Until Further Notice".

In PostgreSQL, the end of time is 'infinity'. It also supports '-infinity'. The value 'infinity' is guaranteed to be later than all other timestamps.
create table infinite_time (
ts timestamp primary key
);
insert into infinite_time values
(current_timestamp),
('infinity');
select *
from infinite_time
order by ts;
2011-11-06 08:16:22.078
infinity
PostgreSQL has supported 'infinity' and '-infinity' since at least version 8.0.
You can mimic this behavior, in part at least, by using the maximum date your dbms supports. But the maximum date might not be the best choice. PostgreSQL's maximum timestamp is some time in the year 294,276, which is sure to surprise some people. (I don't like to surprise users.)
2011-11-06 08:16:21.734
294276-01-01 00:00:00
infinity
A value like this is probably more useful: '9999-12-31 11:59:59.999'.
2011-11-06 08:16:21.734
9999-12-31 11:59:59.999
infinity
That's not quite the maximum value in the year 9999, but the digits align nicely. You can wrap that value in an infinity() function and in a CREATE DOMAIN statement. If you build or maintain your database structure from source code, you can use macro expansion to expand INFINITY to a suitable value.

We sometimes pick a date, then establish a policy that the date must never appear unfiltered. The most common place to enforce that policy is in the middle tier. We just filter the results to change the "magic" end-of-time date to something more palatable.

Representing the notion of "until eternity" or "until further notice" is an iffy proposition.
Relational theory proper says that there is no such thing as null, so you're obliged to have whatever table it is split in two: one part with the rows for which the end date/end time is known, and another for the rows for which the end time is not yet known.
But (like having a null) splitting the tables in two will make a mess of your query writing too. Views can somewhat accommodate the read-only parts, but updates (or writing the INSTEAD OF on your view) will be tough no matter what, and likely to affect performance negatively no matter what at that).
Having the null represent "end time not yet known" will make updating a bit "easier", but the read queries get messy with all the CASE ... or COALESCE ... constructs you'll need.
Using the theoretically correct solution mentioned by dportas gets messy in all those cases where you want to "extract" a DATE from a DATETIME. If the DATETIME value at hand is "the end of (representable) time (billions of years from now as you say)", then this is not just a simple case of invoking the DATE extractor function on that DATETIME value, because you'd also want that DATE extractor to produce the "end of representable DATEs" for your case.
Plus, you probably do not want to show "absent end of time" as being a value 9999-12-31 in your user interface. So if you use the "real value" of the end of time in your database, you're facing a bit of work seeing to it that that value won't appear in your UI anywhere.
Sorry for not being able to say that there's a way to stay out of all messes. The only choice you really have is which mess to end up in.

Don't make a date be "special". While it's unlikely your code would be around in 9999 or even in 2^63-1, look at all the fun that using '12/31/1999' caused just a few years ago.
If you need to signal an "endless" or "infinite" time, then add a boolean/bit field to signal that state.

Web stats: Calculating/estimating unique visitors for arbitary time intervals

I am writing an application which is recording some 'basic' stats -- page views, and unique visitors. I don't like the idea of storing every single view, so have thought about storing totals with a hour/day resolution. For example, like this:
Tuesday 500 views 200 unique visitors
Wednesday 400 views 210 unique visitors
Thursday 800 views 420 unique visitors
Now, I want to be able to query this data set on chosen time periods -- ie, for a week. Calculating views is easy enough: just addition. However, adding unique visitors will not give the correct answer, since a visitor may have visited on multiple days.
So my question is how do I determine or estimate unique visitors for any time period without storing each individual hit. Is this even possible? Google Analytics reports these values -- surely they don't store every single hit and query the data set for every time period!?
I can't seem to find any useful information on the net about this. My initial instinct is that I would need to store 2 sets of values with different resolutions (ie day and half-day), and somehow interpolate these for all possible time ranges. I've been playing with the maths, but can't get anything to work. Do you think I may be on to something, or on the wrong track?
Thanks,
Brendon.

If you are OK with approximations, I think tom10 is onto something, but his notion of random subsample is not the right one or needs a clarification. If I have a visitor that comes on day1 and day2, but is sampled only on day2, that is going to introduce a bias in the estimation. What I would do is to store full information for a random subsample of users (let's say, all users whose hash(id)%100 == 1). Then you do the full calculations on the sampled data and multiply by 100. Yes tom10 said about just that, but there are two differences: he said "for example" sample based on the ID and I say that's the only way you should sample because you are interested in unique visitors. If you were interested in unique IPs or unique ZIP codes or whatever you would sample accordingly. The quality of the estimation can be assessed using the normal approximation to the binomial if your sample is big enough. Beyond this, you can try and use a model of user loyalty, like you observe that over 2 days 10% of visitors visit on both days, over three days 11% of visitors visit twice and 5% visit once and so forth up to a maximum number of day. These numbers unfortunately can depend on time of the week, season and even modeling those, loyalty changes over time as the user base matures, changes in composition and the service changes as well, so any model needs to be re-estimated. My guess is that in 99% of practical situations you'd be better served by the sampling technique.

You could store a random subsample of the data, for example, 10% of the visitor IDs, then compare these between days.
The easiest way to do this is to store a random subsample of each day for future comparisons, but then, for the current day, temporarily store all your IDs and compare them to the subsampled historical data and determine the fraction of repeats. (That is, you're comparing the subsampled data to a full dataset for a given day and not comparing two subsamples -- it's possible to compare two subsamples and get an estimate for the total but the math would be a bit trickier.)

You don't need to store every single view, just each unique session ID per hour or day depending on the resolution you need in your stats.
You can keep these log files containing session IDs sorted to count unique visitors quickly, by merging multiple hours/days. One file per hour/day, one unique session ID per line.
In *nix, a simple one-liner like this one will do the job:
$ sort -m sorted_sid_logs/2010-09-0[123]-??.log | uniq | wc -l
It counts the number of unique visitors during the first three days of September.

You can calculate the uniqueness factor (UF) on each day and use it to calculate the composite (week by example) UF.
Let's say that you counted:
100 visits and 75 unique session id's on monday (you have to store the sessions ID's at least for a day, or the period you use as unit).
200 visits and 100 unique session id's on tuesday.
If you want to estimate the UF for the period Mon+Tue you can do:
UV = UVmonday + UVtuesday = TVmonday*UFmonday + TVtuesday*UFtuesday
being:
UV = Unique Visitors
TV = Total Visits
UF = Uniqueness Factor
So...
UV = (Sum(TVi*UFi))
UF = UV / TV
TV = Sum(TVi)
I hope it helps...
This math counts two visits of the same person as two unique visitors. I think it's ok if the only way you have to identify somebody is via the session ID.

Best way to store time (hh:mm) in a database

I want to store times in a database table but only need to store the hours and minutes.
I know I could just use DATETIME and ignore the other components of the date, but what's the best way to do this without storing more info than I actually need?

You could store it as an integer of the number of minutes past midnight:
eg.
0 = 00:00
60 = 01:00
252 = 04:12
You would however need to write some code to reconstitute the time, but that shouldn't be tricky.

If you are using SQL Server 2008+, consider the TIME datatype. SQLTeam article with more usage examples.

DATETIME start DATETIME end
I implore you to use two DATETIME values instead, labelled something like event_start and event_end.
Time is a complex business
Most of the world has now adopted the denery based metric system for most measurements, rightly or wrongly. This is good overall, because at least we can all agree that a g, is a ml, is a cubic cm. At least approximately so. The metric system has many flaws, but at least it's internationally consistently flawed.
With time however, we have; 1000 milliseconds in a second, 60 seconds to a minute, 60 minutes to an hour, 12 hours for each half a day, approximately 30 days per month which vary by the month and even year in question, each country has its time offset from others, the way time is formatted in each country vary.
It's a lot to digest, but the long and short of it is impossible for such a complex scenario to have a simple solution.
Some corners can be cut, but there are those where it is wiser not to
Although the top answer here suggests that you store an integer of minutes past midnight might seem perfectly reasonable, I have learned to avoid doing so the hard way.
The reasons to implement two DATETIME values are for an increase in accuracy, resolution and feedback.
These are all very handy for when the design produces undesirable results.
Am I storing more data than required?
It might initially appear like more information is being stored than I require, but there is a good reason to take this hit.
Storing this extra information almost always ends up saving me time and effort in the long-run, because I inevitably find that when somebody is told how long something took, they'll additionally want to know when and where the event took place too.
It's a huge planet
In the past, I have been guilty of ignoring that there are other countries on this planet aside from my own. It seemed like a good idea at the time, but this has ALWAYS resulted in problems, headaches and wasted time later on down the line. ALWAYS consider all time zones.
C#
A DateTime renders nicely to a string in C#. The ToString(string Format) method is compact and easy to read.
E.g.
new TimeSpan(EventStart.Ticks - EventEnd.Ticks).ToString("h'h 'm'm 's's'")
SQL server
Also if you're reading your database seperate to your application interface, then dateTimes are pleasnat to read at a glance and performing calculations on them are straightforward.
E.g.
SELECT DATEDIFF(MINUTE, event_start, event_end)
ISO8601 date standard
If using SQLite then you don't have this, so instead use a Text field and store it in ISO8601 format eg.
"2013-01-27T12:30:00+0000"
Notes:
This uses 24 hour clock*
The time offset (or +0000) part of the ISO8601 maps directly to longitude value of a GPS coordiate (not taking into account daylight saving or countrywide).
E.g.
TimeOffset=(±Longitude.24)/360
...where ± refers to east or west direction.
It is therefore worth considering if it would be worth storing longitude, latitude and altitude along with the data. This will vary in application.
ISO8601 is an international format.
The wiki is very good for further details at http://en.wikipedia.org/wiki/ISO_8601.
The date and time is stored in international time and the offset is recorded depending on where in the world the time was stored.
In my experience there is always a need to store the full date and time, regardless of whether I think there is when I begin the project. ISO8601 is a very good, futureproof way of doing it.
Additional advice for free
It is also worth grouping events together like a chain. E.g. if recording a race, the whole event could be grouped by racer, race_circuit, circuit_checkpoints and circuit_laps.
In my experience, it is also wise to identify who stored the record. Either as a seperate table populated via trigger or as an additional column within the original table.
The more you put in, the more you get out
I completely understand the desire to be as economical with space as possible, but I would rarely do so at the expense of losing information.
A rule of thumb with databases is as the title says, a database can only tell you as much as it has data for, and it can be very costly to go back through historical data, filling in gaps.
The solution is to get it correct first time. This is certainly easier said than done, but you should now have a deeper insight of effective database design and subsequently stand a much improved chance of getting it right the first time.
The better your initial design, the less costly the repairs will be later on.
I only say all this, because if I could go back in time then it is what I'd tell myself when I got there.

Just store a regular datetime and ignore everything else. Why spend extra time writing code that loads an int, manipulates it, and converts it into a datetime, when you could just load a datetime?

since you didn't mention it bit if you are on SQL Server 2008 you can use the time datatype otherwise use minutes since midnight

SQL Server actually stores time as fractions of a day. For example, 1 whole day = value of 1. 12 hours is a value of 0.5.
If you want to store the time value without utilizing a DATETIME type, storing the time in a decimal form would suit that need, while also making conversion to a DATETIME simple.
For example:
SELECT CAST(0.5 AS DATETIME)
--1900-01-01 12:00:00.000
Storing the value as a DECIMAL(9,9) would consume 5 bytes. However, if precision to not of utmost importance, a REAL would consume only 4 bytes. In either case, aggregate calculation (i.e. mean time) can be easily calculated on numeric values, but not on Data/Time types.

I would convert them to an integer (HH*3600 + MM*60), and store it that way. Small storage size, and still easy enough to work with.

If you are using MySQL use a field type of TIME and the associated functionality that comes with TIME.
00:00:00 is standard unix time format.
If you ever have to look back and review the tables by hand, integers can be more confusing than an actual time stamp.

Instead of minutes-past-midnight we store it as 24 hours clock, as an SMALLINT.
09:12 = 912
14:15 = 1415
when converting back to "human readable form" we just insert a colon ":" two characters from the right. Left-pad with zeros if you need to. Saves the mathematics each way, and uses a few fewer bytes (compared to varchar), plus enforces that the value is numeric (rather than alphanumeric)
Pretty goofy though ... there should have been a TIME datatype in MS SQL for many a year already IMHO ...

Try smalldatetime. It may not give you what you want but it will help you in your future needs in date/time manipulations.

Are you sure you will only ever need the hours and minutes? If you want to do anything meaningful with it (like for example compute time spans between two such data points) not having information about time zones and DST may give incorrect results. Time zones do maybe not apply in your case, but DST most certainly will.

What I think you're asking for is a variable that will store minutes as a number. This can be done with the varying types of integer variable:
SELECT 9823754987598 AS MinutesInput
Then, in your program you could simply view this in the form you'd like by calculating:
long MinutesInAnHour = 60;
long MinutesInADay = MinutesInAnHour * 24;
long MinutesInAWeek = MinutesInADay * 7;
long MinutesCalc = long.Parse(rdr["MinutesInput"].toString()); //BigInt converts to long. rdr is an SqlDataReader.
long Weeks = MinutesCalc / MinutesInAWeek;
MinutesCalc -= Weeks * MinutesInAWeek;
long Days = MinutesCalc / MinutesInADay;
MinutesCalc -= Days * MinutesInADay;
long Hours = MinutesCalc / MinutesInAnHour;
MinutesCalc -= Hours * MinutesInAnHour;
long Minutes = MinutesCalc;
An issue arises where you request for efficiency to be used. But, if you're short for time then just use a nullable BigInt to store your minutes value.
A value of null means that the time hasn't been recorded yet.
Now, I will explain in the form of a round-trip to outer-space.
Unfortunately, a table column will only store a single type. Therefore, you will need to create a new table for each type as it is required.
For example:
If MinutesInput = 0..255 then use TinyInt (Convert as described above).
If MinutesInput = 256..131071 then use SmallInt (Note: SmallInt's min
value is -32,768. Therefore, negate and add 32768 when storing and
retrieving value to utilise full range before converting as above).
If MinutesInput = 131072..8589934591 then use Int (Note: Negate and add
2147483648 as necessary).
If MinutesInput = 8589934592..36893488147419103231 then use BigInt
(Note: Add and negate 9223372036854775808 as necessary).
If MinutesInput > 36893488147419103231 then I'd personally use
VARCHAR(X) increasing X as necessary since a char is a byte. I shall
have to revisit this answer at a later date to describe this in full
(or maybe a fellow stackoverflowee can finish this answer).
Since each value will undoubtedly require a unique key, the efficiency of the database will only be apparent if the range of the values stored are a good mix between very small (close to 0 minutes) and very high (Greater than 8589934591).
Until the values being stored actually reach a number greater than 36893488147419103231 then you might as well have a single BigInt column to represent your minutes, as you won't need to waste an Int on a unique identifier and another int to store the minutes value.

The saving of time in UTC format can help better as Kristen suggested.
Make sure that you are using 24 hr clock because there is no meridian AM or PM be used in UTC.
Example:
4:12 AM - 0412
10:12 AM - 1012
2:28 PM - 1428
11:56 PM - 2356
Its still preferrable to use standard four digit format.

Store the ticks as a long/bigint, which are currently measured in milliseconds. The updated value can be found by looking at the TimeSpan.TicksPerSecond value.
Most databases have a DateTime type that automatically stores the time as ticks behind the scenes, but in the case of some databases e.g. SqlLite, storing ticks can be a way to store the date.
Most languages allow the easy conversion from Ticks → TimeSpan → Ticks.
Example
In C# the code would be:
long TimeAsTicks = TimeAsTimeSpan.Ticks;
TimeAsTimeSpan = TimeSpan.FromTicks(TimeAsTicks);
Be aware though, because in the case of SqlLite, which only offers a small number of different types, which are; INT, REAL and VARCHAR It will be necessary to store the number of ticks as a string or two INT cells combined. This is, because an INT is a 32bit signed number whereas BIGINT is a 64bit signed number.
Note
My personal preference however, would be to store the date and time as an ISO8601 string.

IMHO what the best solution is depends to some extent on how you store time in the rest of the database (and the rest of your application)
Personally I have worked with SQLite and try to always use unix timestamps for storing absolute time, so when dealing with the time of day (like you ask for) I do what Glen Solsberry writes in his answer and store the number of seconds since midnight
When taking this general approach people (including me!) reading the code are less confused if I use the same standard everywhere

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight