Mongodb huge date index size

Mongodb huge date index size - database

I have a mongodb database storing documents with a publication date and an end date. For my request i need to create an index with both on those dates.
My collection size is about 17GB for a total of 6 Millions documents.
When i create my index, the index size is about 600MB ... but when i use it intensively, the size get up to 70GB ... 6 times more than the documents themselves oO
Am i doing something wrong ? Is there some special considerations with date fields ? (I only have problem with date indexes).
Note: I suspected my dates to be "too precises" to be indexed, so i rounded them to the nearest hour ... without any index size decrease.

Okay i've found the reason of this really strange bug. I am putting it here since it can happens to others.
I'm using moment.js for managing dates on node.js
I'm using mongodb driver directly (not mongoose)
(I forgot to precise it in my question), i make an update on my dates in my intensive scripts.
I update those dates with a moment object instead of a pure javascript date making mongodb try to index every bit of subpath i think which is responsible of this huge index size.

Related

What is the optimized way for queries on partial dates in GAE Text Search?

Need to get entities filtering by month instead of complete date values (E.g. Birthdays) using Google App Engine Text Search. On verifying GAE docs, I think it is not possible to query date fields by month directly.
So in order to filter them by month/date, we consider saving each date sub value like Date(DD), Month(MM) and Year(YYYY) as separate NUMBER field along with complete date field.
I verified locally that we can achieve by saving like this. But is this the correct way of saving dates by splitting each field when we want to query on date sub values?
Is there any known/unknown limit on number of fields per document apart from 10GB size limit in GAE Text Search?
Please suggest me.
Thanks,
Naresh

The only time NUMBER or DATE fields make sense is if you need to query on ranges of values. In other cases they are wasteful.
I can't tell from your question exactly what queries you want to run. Are you looking for a (single) specific day of the month (e.g., January 6 -- of any year)? Or just "anything in June (again, without regard to year)"? Or is it a date range: something like January 20 through February 19? Or July 1 through September 30?
If it's a range then NUMBER values may make sense. But if it's just a single specific month, or a single month and day-of-month combination, then you're better off storing month and day as separate ATOM fields.
Anything that looks like a number, but isn't really going to be searched via a numerical range, or done arithmetic on, isn't really a number, and is probably best stored as an ATOM. For example, phone numbers, zip codes (unless you're terribly clever and wanting to do something like "all zip codes in San Francisco look like 941xx" -- but even then if that's what you want to do, you're probably better off just storing the "941" prefix as an ATOM).

Keep PivotTable report filter after data refresh

I have a PivotTable (actually it is five PivotTables, each on its own separate sheet) that is created from a query of an outside database. Each of the PivotTables represents a day (i.e. Today, Tomorrow, Today+2, Today+3, and Today+4). For the report filter for the first two, we use a date range filter of today and tomorrow which automatically filters the data and allows it to roll over. We created custom date ranges for the other three days, but upon every external data refresh we have to go into each sheet and reselect the report filter from all to the specified time frame. This data rolls over every day so we can see the lineup for the next 96 hours out.
Is there a way to either keep the PivotTable report filter criteria (VBA and macros are both acceptable, although we are also fairly new to both)?
Or is there some super secret way to extend the report filter from just today and tomorrow to a time range (48 hours, 96 hours) instead of next month?
I need the days to be separated, so next week will not work because all the days will populate on one page.

Without seeing a real example it's hard to tell, but how about changing the query to a relative date index, i.e. something like
SELECT DATEDIFF('day', GETDATE(), report_dt) AS days_from_today FROM reporting_table
And then set your report filters on this relative date index (days_from_today = 1 for tomorrow, etc)? You can always create another Excel column in the report =TODAY() + days_from_today to get your absolute date back. (Assuming you are just dealing with one time zone for reporting purposes.)
I.e., instead of rolling filters, keep the filters on constant indices, and let the indices cover a rolling date range. I'm not sure Excel is smart enough to do the rolling filters thing.

How to select first record prior/after a given timestamp in KDB?

I am currently just pulling in all records 1min leading up to the timestamp (e.g. if the timestamp I'm interested in is 2014.04.14T09:30):
select from Prices where timestamp within 2014.04.14T09:29 2014.04.14T09:30, stock=`GOOG
However, this is clearly not very robust. Sometimes the previous record may be at 09:25am and then the query returns nothing. Sometimes the query may return hundreds of records if there have been a lot of price changes, even though all I need is the last record returned.
I know this can be done with an asof join, but want to avoid it for the time being as Prices is simply too big at present.
I am also interested in doing the same, but in finding the first record after a given timestamp.
Note also that Prices is a splayed table

Select last record before the given timestamp:
q)select from Price where stock=`GOOG,i=last i,timestamp<2014.04.14T09:30
Select first record after the given timestamp:
q)select from Price where stock=`GOOG,i=first i,timestamp>2014.04.14T09:30

Use asof or aj to get the performance kdb+ is known for. The bigger Prices is, the more reason for doing so.
I would question your logic for avoiding aj. aj and asof use the bin operator which is binary search and hence more performant than scanning the timestamp column.
Let's create your table and run the solution from the other answer:
Prices:([]stock:`g#1000000?`GOOG,9?`4;timestamp:asc 2014.04.14+1000000?0t;price:1000000?100f,size:1000000?100j)
q)\t do[1000;select from Prices where timestamp<2014.04.14T09:30,stock=`GOOG,i=last i]
10205
We can make this a lot better by reordering the constraints:
q)\t do[1000;select from Prices where stock=`GOOG,timestamp<2014.04.14T09:30,i=last i]
2030
But nothing will beat this:
q)\t do[1000;Prices asof `stock`timestamp!(`GOOG;2014.04.14D09:30)]
9
By the way, you are using datetime in your question, which is deprecated, so I've replaced it with timestamp. This has no impact on performance.

Few more things to remember while using aj:
in-memory prices - the table should be `g#sym and time sorted within sym
on-disk prices - `p#sym and time sorted within sym
Also in case of partitioned/splayed tables, using the where constraints (except the date in the date-partitioned table) can severely impact the performance.

strange appengine query result

What am I doing wrong in this query?
SELECT * FROM TreatmentPlanDetails
WHERE
accountId = 'ag5zfmRvbW9kZW50d2ViMnIRCxIIQWNjb3VudHMYtcjdAQw' AND
status = 'done' AND
category = 'chirurgia orale' AND
setDoneCalendarEventStartTimestamp >= [timestamp for 6 june 2012] AND
setDoneCalendarEventStartTimestamp <= [timestamp for 11 june 2012] AND
deleteStatus = 'notDeleted'
ORDER BY setDoneCalendarEventStartTimestamp ASC
I am not getting any record and I am sure there are records meeting the where clause conditions. To get the correct records I have to widen the timestamp interval by 1 millisecond. Is it normal? Furthermore, if I modify this query by removing the category filter, I am getting the correct results. This is definitely weird.
I also asked on google groups, but I got no answer. Anyway, for details:
https://groups.google.com/forum/?fromgroups#!searchin/google-appengine/query/google-appengine/ixPIvmhCS3g/d4OP91yTkrEJ

Let's talk specifically about creating timestamps to go into the query. What code are you using to create the timestamp record? Apparently that's important, because fuzzing with it a little bit affects the query. It may be relevant that in the datastore, timestamps are recorded as integers representing posix timestamps with microseconds, i.e. the number of microseconds since 1/1/1970 UTC (not counting leap seconds). It's also relevant that dates (i.e. without a time) are represented as midnight, i.e. the earliest time on that day. But please show us the exact code. (It may also be important to show the actual content of the record that you're attempting to retrieve.)

An aside that is not specific to your question: Entity property names count as part of your storage quota. If this is going to be a huge dataset, you might pay more $$ than you'd like for property names like setDoneCalendarEventStartTimestamp.

Because you write :
if I modify this query by removing the category filter, I am getting
the correct results
this probably means that the category was not indexed at the time you write the matching records to the data store. You have to re-write your records to the data store if you want them added to the newly created index.

Sum field by date range

I'm using solr 3.6 and I'm kinda stuck trying to perform a special query.
I'm actually using facets by date range, the face.date.gap is set to +1DAY. Of course, the facet is supposed to return the count of docs at a date range but I also need to get the sum of a special field at the same ranges used in facet. It's like I need to count how many votes I have daily monthly, weekly, whatever... it depends on the gap params.
Any ideas? Should I use the group.query or facet.query?

One suggestion I have is to treat the weeks, days separately, and index them. For ex. Today is part of 24th week. Another suggestion is not to rule out multiple searches to service one request. One to calculate all oth facets and one to return counts for given date range (based on search results from first query).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Mongodb huge date index size - database

Related

What is the optimized way for queries on partial dates in GAE Text Search?

Keep PivotTable report filter after data refresh

How to select first record prior/after a given timestamp in KDB?

strange appengine query result

Sum field by date range

Categories

Resources