Are there data sources for high fidelity cloud cover data? - weather

Is there anything near the resolution of Himawari-8 for the entire globe?
The only one I know of is http://neo.sci.gsfc.nasa.gov/view.php?datasetId=MODAL2_D_CLD_FR&date=2016-01-01 however there are missing stripes and the granularity is on a daily basis.

Do you mean the globally data for cloud coverage? Himawari-8 is geostationary satellite. So it is able to observe continuously every 10 min for fulldisk area. But only cover one part of earth.
So if you want cloud coverage product for globally from geostationary satellite, you may look for the combined data from several geo-satellite. Like Geos-R (US), MSG (europe), Himawari-8 (japan), etc.
Previously there was CLAUS data, developed by BADC in UK. But the data (I guess) only until 2008...(need to check)
Hope it might help.

Related

Which kind of DBs calculate rate per minute statistics?

I have a use case requirement, where I want to design a hashtag ranking system. 10 most popular hashtag should be selected. My idea is something like this:
[hashtag, rateofhitsperminute, rateofhisper5minutes]
Then I will query, find out the 10 most popular #hashtags, whose rateofhits per minute are highest.
My question is what sort of databases, can I use, to provide me statistics like 'rateofhitsperminute' ?
What is a good way to calculate such a detail and store in it db ? Do some DBs offer these features?
First of all, "rate of hits per minute" is calculated:
[hits during period]/[length of period]
So the rate will vary depending on how long the period is. (The last minute? The last 10 minutes? Since the hits started being recorded? Since the hashtag was first used?)
So what you really want to store is the count of hits, not the rate. It is better to either:
Store the hashtags and their hit counts during a certain period (less memory/cpu required but less flexible)
OR the timestamp and hashtag of each hit (more memory/cpu required but more flexible)
Now it is a matter of selecting the time period of interest, and querying the database to find the top 10 hashtags with the most hits during that period.
If you need to display the rate, use the formula above, but notice it does not change the order of the top hashtags because the period is the same for every hashtag.
You can apply the algorithm above to almost any DB. You can even do it without using a database (just use a programming language's builtin hashmap).
If performance is a concern and there will be many different hashtags, I suggest using an OLAP database. OLAP databases are specially designed for top-k queries (over a certain time period) like this.
Having said that, here is an example of how to accomplish your use case in Solr: Solr as an Analytics Platform. Solr is not an OLAP database, but this example uses Solr like an OLAP DB and seems to be the easiest to implement and adapt to your use case:
Your Solr schema would look like:
<fields>
<field name="hashtag" type="string"/>
<field name="hit_date" type="date"/>
</fields>
An example document would be:
{
"hashtag": "java",
"hit_date": '2012-12-04T10:30:45Z'
}
A query you could use would be:
http://localhost:8983/solr/select?q=*:*&facet=true&facet.field=hashtag&facet.mincount=1&facet.limit=10&facet.range=hit_date&facet.range.end=2013-01-01T00:00:00Z&facet.range.start=2012-01-01T00:00:00
Finally, here are some advanced resources related to this question:
Similar question: Implementing twitter and facebook like hashtags
What is the best way to compute trending topics or tags? An interesting idea I got from these answers is to use the derivative of the hit counts over time to calculate the "instantaneous" hit rate.
HyperLogLog can be used to estimate the hit counts if an approximate calculation is acceptable.
Look into Sliding-Window Top-K if you want to get really academic on this topic.
No database has rate per minute statistics just built in, but any modern database could be used to create a database in which you could quite easily calculate rate per minute or any other calculated values you need.
Your question is like asking which kind of car can drive from New York to LA - well no car can drive itself or refuel itself along the way (I should be careful with this analogy because I guess cars are almost doing this now!), but you could drive any car you like from New York to LA, some will be more comfortable, some more fuel efficient and some faster than others, but you're going to have to do the driving and refueling.
You can use InfluxDB. It's well suited for your use case, since it was created to handle time series data (for example "hits per minute").
In your case, every time there is a hit, you could send a record containing the name of the hashtag and a timestamp.
The data is queryable, and there are already tools that can help you process or visualize it (like Grafana).
If you are happy with a large data set you could store and calculate this information yourself.
I believe Mongo is fairly fast when it comes to index based queries so you could structure something like this.
Every time a tag is "hit" or accessed you could store this information as a row
[Tag][Timestamp]
Storing it in such a fashion allows you to first of all run simple Group, Count and Sort operations which will lead you to your first desired ability of calculating the 10 most popular tags.
With the information in this format you can then perform further queries based on tag and timestamp to Count the amount of hits for a specific tag between the times X and Y which would give you your hits Per period.
Benefits of doing it this way:
High information granularity depending on time frames supplied via query
These queries are rather fast in mongoDB or similar databases even on large data sets
Negatives of doing it this way:
You have to store many rows of data
You have to perform queries to retrieve the information you need rather than returning a single data row

Databasing to feed ~9k data points to High Charts

I am working on a freelance project that captures an audio file, runs some fourier analysis, and spits out three charts (x-y plots). Each chart has about ~3000 data points, which I plan to display with High Charts in the browser.
What database techniques do you recommend for storing and accessing this much data? Should I be storing the points in an array or in multiple rows? I'm considering Mongo too. Plan is to use Rails, so I was hoping to use a single database for both data and authentication.
I haven't dealt with queries accessing this much data for a single page, and this may very well be a tiny overall amount of data. In addition this is an MVP for demonstration to investors, so making it scalable to huge levels isn't of immediate concern.
My initial thought is that using Postgres and having one large table of data points, stored per-row, will be fine, and that that a bunch of doubles is not going to be too memory-intensive relative to images and such.
Realistically, I may just pull 100 evenly-spaced data points to make the chart, but the original data must still be stored.
I've done a lot of Mongo work and I can tell you what I would do if I were you.
One of the very nice properties about your data is that the x,y coordinates are of a fixed size generally. In other words it's not like you are storing comments from users, which can vary greatly in size.
With Mongo I would first make a sample document with the 3,000 points. Just a simple array of x,y points. I would see how big that document is and how my front end handled it - in other words can High Charts handle that?
I would also try to stick to the easiest conceptual model to manage, which is one document per chart, each chart having 3k points. This is a natural way to think of the data and I would start there and see if there were any performance hits. Mongo can easily store those documents, so I think the biggest pain would be in the UI with rendering the data.
Mongo would handle authentication well. I think it's a good choice for general data storage for an MVP.

What is a good relational database design for stock market data?

Suppose there are two types of messages, QUOTE and TRADE. Both have different fields. For example TRADE has only a single price. QUOTE has both a bid and ask price. I want process messages in time order to do something like the following:
if (QUOTE) {
...
}
if (TRADE) {
...
}
My problem is the two messages are in different formats so I can't get them into the same database table. If I can't get them into the same database table how do I process sequentially? Any ideas for a suitable design?
The answer depends entirely on what you're doing and on where your app plugs into the data streams.
At one extreme, you might merely be answering customer quotes that you're pulling from an API, and basically implementing a cache. In this case two tables are fine.
At the other extreme, you might be monitoring real-time quotes for a high frequency trading platform, in which case the throughput will probably rule out using a database at all (things built around lisp, such as allegrograph, might be more appropriate), except to periodically collect aggregate statistics.
The short answer is, 'not really' For stock market and other time series data a key value store like Berkley DB or Mongo is pretty good. Also, a data format like NetCDF (http://en.wikipedia.org/wiki/NetCDF) will likely serve you better in the long run. It also depends on what kind of access you want and how much time you want to store.
You didn't indicate what you were doing with the data, which should inform your choices of storage more than anything. For example, a high-speed trading application will have different storage tradeoffs than a historical batch processing system (where Hadoop + NetCDF would be great). YMMV
Kdb+/q
Is a very good option for tick data. Used by major banks.
here is the info about that.
You can install a trail version and play with it.

best way to statistically detect anomalies in data

our webapp collects huge amount of data about user actions, network business, database load, etc etc etc
All data is stored in warehouses and we have quite a lot of interesting views on this data.
if something odd happens chances are, it shows up somewhere in the data.
However, to manually detect if something out of the ordinary is going on, one has to continually look through this data, and look for oddities.
My question: what is the best way to detect changes in dynamic data which can be seen as 'out of the ordinary'.
Are bayesan filters (I've seen these mentioned when reading about spam detection) the way to go?
Any pointers would be great!
EDIT:
To clarify the data for example shows a daily curve of database load.
This curve typically looks similar to the curve from yesterday
In time this curve might change slowly.
It would be nice that if the curve from day to day changes say within some perimeters, a warning could go off.
R
Take a look at Control Charts, they provide a way to track changes in your data visually and specify when the data is "out of control" or "anomalous". They are heavily used in manufacturing to ensure quality control.
This question is impossible to answer without knowing much more about the particular data you have. For an overview of what kinds of approaches exist, see Anomaly Detection: A Survey by Chandola, Banerjee, and Kumar.
Bayesian classification might help you find some anomalies in your data, depending on the type of data and how good you train your Bayesian filter.
There is even one available as a web service # uClassify.com.
This depends so much on what the data is. Take a statistics class and learn the basics first. This isn't usually an easy or simple problem.

What is the life span of data?

Recently I’ve found myself in a database tangle where management wants the ability to remove data from the database, but still wants that data to appear in other places. Example: They want to remove all instances of the product whizbang, but they still want whizbang to appear in sales reports. (if they ran one for a previous date).
Now I can add a field, say is_deleted, that will track whether that product has been deleted and thus still keep all my references, but over a period of time, I have the potential of housing a lot of dead data. (data that is never accessed again). How to handle this is not my question.
I’m curious to find out, in your experience what is the average life span of data? That is, on average how long is data alive or good for before it gets either replaced or deleted? I understand that this is relative to the type of data you are housing, but certainly all data has some sort of life span?
Data lives forever...or often it should. One common practice is to have end and/or start dates for a record. So for your whizbang, you have a start date (so that it won't appear on sales reports before it's official launch), and an end date (so that it drops off of reports after it's been end-of-lifed). Using the proper dates as criteria for your reporting as well as your applications, you won't see the whizbang except for when you should, and the data still exists (which it should, theoretically infinitely).
As Koistya Navin mentions, moving data to a data warehouse at a certain point is also an option, but this depends in large part on how large your 'old' data is, and how long you need to keep it readily available for access.
Many of our customers keep data online for 2 years. After that it's moved to backup disks, but it can be put online if needed.
Consider adding a column "expiration" or "effective date". This will allow you mark a product as obsolete, but reports will return that product if the time range is satisfied.
Usually it's better to move such data into seporate database (database warehouse) and keep working database clean. At data warehouse your data can be kept for many years without impacting your application.
Reference: Data Warehouse at Wikipedia
I've always gone by what is the ruling body looking for. Example the IRS wants you to keep 7 years of history or for security reasons we keep 3 years of log information, etc. So I guess you could do 2 things, determine what the life span of your data is I would say 3 years would be enough and then you could add the is_deleted flag along with a date that way you would be able to flag some data to delete sooner than later.
Yes, all data has a lifespan. And yes, it is relative to the type of data you have.
Some data has a lifespan measured in seconds (authentication tokens, for instance), some other data virtual eternity (more than the medium and formats it is stored into, like for instance ownership records).
You will have to either be more specific as to the type of data you are envisioning, or do a census in your own organization as to the usual lifespan of stuff.
Our particular flavor varies. We have some data (a vast majority) which goes stale after 3 months (hard product limit) but can be revived at any later date.
We have other data that is effectively immortal.
In practice, most of the data we serve up is fresh and frequently requested for a few weeks, at most a month, before falling to sporadic use.
How much is "a lot of dead data"?
With processing power and data storage so cheap, I wouldn't purge old data unless there's a really good reason to. You also need to consider the legal implications. Large (and even small) companies may have incredibly long retention policies for old data, to save themselves millions down the road when they are subpoenaed for it by a judge.
I would check with whatever legal department you have and find out how long the data needs to be stored. That's the safest bet.
Also, ask yourself what the benefit of removing the old data is. Is the only benefit a tidier database? If so, I wouldn't do it. Are you going to see a 10X performance increase? If so, I'd do it. This really is a complex question though, and it's tough for us to have all the information required to give you good advice.
I have a few projects where the customer wants all the historical data (going back over 19 years). Quite a bit of the really old data is malformed and is going to be a nightmare to import into the new system. We convinced them that they won't need records going back any further than 10 years, but like you said it's all relative to the type of data you're housing.
On a side note, data storage is extremely cheap right now, and if it isn't affecting the performance of your application, I would just leave it where it is.
[...] but certainly all data has some sort of life span?
Not any kind of life span we can talk about meaningfully. A lot of data is useless as soon as it's created or recorded. Such data could be discarded immediately with no effect. On the other hand, some data has enough value that it will outlive the current system that hosts it. If Amazon were to completely replace their current infrastructure, the customer histories they have stored would still be immensely valuable.
As you said, it's relative. Each type of data has its own life span that has no relation to another type of data's life span. There's no meaningful "average life span of data".
I have the potential of housing a lot of dead data. (data that is never accessed again).
But they will when they perform those reports then they are accessing that data.
Until then you'll need to keep the data in some form. Move to another table or have a switch like you mentioned.
uh...at the risk of oversimplifying...it sounds like using DateDeleted instead of a bit would solve your how-long-to-keep issue.

Resources