How to store mathematical expressions/explanations into database - database

I am given a task to develop a website for maths students with questions and their explanations.The site will have around 20,000 questions.And I need an effective way(easy storage,faster querying and fast rendering) to store those questions into the database.
Sample Question
In the first 10 overs of a cricket game, the run rate was only 3.2. What should be the run rate in the remaining 40 overs to reach the target of 282 runs?
Required run rate = 282 - (3.2 x 10) = 250 = 6.25
---------------- -----
40 40
Questions is a simple string and can easily be stored.But the real problem is to store those expressions with brackets and divide into the database?

You could store the expressions in LaTeX in the database.
Edit:
You can use libraries like http://www.mathjax.org/ for client-side rendering of the equations.

You have several options to store a string representation of mathematical expressions: MathML, LaTeX or ASCIIMathML.
For displaying it in a web browser I recommend MathJax.

Related

Merging different granularity time series in influxdb

I want to store trades as well as best ask/bid data, where the latter updates much more rapidly than the former, in InfluxDB.
I want to, if possible, use a schema that allows me to query: "for each trade on market X, find the best ask/bid on market Y whose timestamp is <= the timestamp of the trade".
(I'll use any version of Influx.)
For example, trades might look like this:
Time Price Volume Direction Market
00:01.000 100 5 1 foo-bar
00:03.000 99 50 0 bar-baz
00:03.050 99 25 0 foo-bar
00:04.000 101 15 1 bar-baz
And tick data might look more like this:
Time Ask Bid Market
00:00.763 100 99 bar-baz
00:01.010 101 99 foo-bar
00:01.012 101 98 bar-baz
00:01.012 101 99 foo-bar
00:01:238 100 99 bar-baz
...
00:03:021 101 98 bar-baz
I would want to be able to somehow join each trade for some market, e.g. foo-bar, with only the most recent ask/bid data point on some other market, e.g. bar-baz, and get a result like:
Time Trade Price Ask Bid
00:01.000 100 100 99
00:03.050 99 101 98
Such that I could compute the difference between the trade price on market foo-bar and the most recently quoted ask or bid on market bar-baz.
Right now, I store trades in one time series and ask/bid data points in another and merge them on the client side, with logic along the lines of:
function merge(trades, quotes, data_points)
next_trade, more_trades = first(trades), rest(trades)
quotes = drop-while (quote.timestamp < next_trade.timestamp) quotes
data_point = join(next_trade, first(quotes))
if more_trades
return merge(more_trades, quotes, data_points + data_point)
return data_points + data_point
The problem is that the client has to discard tons of ask/bid data points because they update so frequently, and only the most recent update before the trade is relevant.
There are tens of markets whose most recent ask/bid I might want to compare a trade with, otherwise I'd simply store the most recent ask/bid in the same series as the trades.
Is it possible to do what I want to do with Influx, or with another time series database? An alternative solution that produces lower quality results is to group the ask/bid data by some time interval, say 250ms, and take the last from each interval, to at least impose an upper bound on the amount of quotes the client has to drop before finding the one that's closest to the next trade.
NB. Just a clarification on InfluxDB terminology. You're probably storing trade and tick data in different measurements(analogous to a table). Series is a subdivision withing a measurement based on tag values. e.g
Time Ask Bid Market
00:00.763 100 99 bar-baz
is one series
Time Ask Bid Market
00:01.010 101 99 foo-bar
is another series(assuming you are storing Market name/id as a tag and not a field)
Answer
InfluxQL https://docs.influxdata.com/influxdb/v1.7/query_language/spec/ - I can't think of a way to achieve what you need with InfluxQL (Influx Query Language) as it does not support joins.
Perhaps what you could do on the client side is instead of requesting all tick data for a period and discarding most of it, make a request per trade and market to get exactly the (the most recent with respect to the trade) ask/bid datapoint that you need. Something like:
function merge(trades, market)
points = <empty list>
for next_trade in trades
quote = db.query("select last(ask), last(bid) from tick_data where time<=next_trade.timestamp and Market=market and time>next_trade.timestamp - 1m")
// or to get a list per market with one query
// quote_per_market = db.query("select last(ask), last(bid) from tick_data where time<=next_trade.timestamp group by Market")
points = points + join(next_trade, quote)
return points
Of course you'd have the overhead of querying the database more frequently but depending on the number of trades and your resource constraints it may be more efficient. NB. A potential pitfall here is that ask and bid retrieved this way are not retrieved as a pair but independently and while they are returned as a pair it could happen that they have different timestamps. If for some timestamp for some reason you only have an ask or a bid price you might run into this problem. However, as long as you write them in pairs and have no missing data it should be ok.
Flux https://www.influxdata.com/products/flux/ - Flux is a more sophisticated query language that is part of Influxdb 1.7 and 2 that allows you to do joins and operations across different measurements. I can't give you any examples yet but it's worth having a look at.
Other (relational) Times Series DBs that you could have a look at that would also allow you to do joins are CrateDB https://crate.io/ or Postgres + TimescaleDB https://www.timescale.com/products

MongoDB and Arctic

I intend to analyse multiple data sets on the same time series (daily EOD). I will need to use computed columns. Use column A + B to create column C (store net result of calculation in column C). Is this functionality available using the MongoDB / Arctic database?
I would also intend to search the data... for example: What happens when the advance decline thrust pushes over 70 when the cumulative TICK was below -100,000 in the past 'n days'
Two data sets: Cumulative TICK and the Advance Decline Thrust (Uses advancers / decliners data). So they would be stored in the database, then I would want to have the capability to search for the above condition. This is achievable with the mongoDB / Arctic database structure?
Just looking for some general information before I move to a DB format. Currently everything I had created is on excel / VBA now its alrady out grown!
Any information greatly appreciated.
Note: I will use the same database for weekly, monthly, yearly and 1 minute, 3 minute, 5 minute 60 minute TICK/TIME based bars - not feeding live but updated EOD
yes, this can be done with arctic. Arctic can store pandas dataframes, and an operation like you have mentioned is trivial in pandas. Arctic is just a store, so you'd want to read the data out of arctic (data is stored in symbols in arctic) and then perform your transform, and then write the data back. Any of the storage engines (VersionStore, TickStore, or ChunkStore) should work for this.

How to order the ngrams in Google's database (or the one hosted on AWS) by frequency

I'm looking for a way to order Google Book's Ngram's by frequency.
The original dataset is here: http://books.google.com/ngrams/datasets. Inside each file the ngrams are sorted alphabetically and then chronologically.
My computer is not powerful enough to handle 2.2 TB worth of data, so I think the only way to sort this would be "in the cloud".
The AWS-hosted version is here: http://aws.amazon.com/datasets/8172056142375670.
Is there a financially efficient way to find the 10,000 most frequent 1grams, 2grams, 3grams, 4grams, and 5grams?
To throw a wrench in it, the datasets contain data for multiple years:
As an example, here are the 30,000,000th and 30,000,001st lines from file 0
of the English 1-grams (googlebooks-eng-all-1gram-20090715-0.csv.zip):
circumvallate 1978 313 215 85
circumvallate 1979 183 147 77
The first line tells us that in 1978, the word "circumvallate" (which means
"surround with a rampart or other fortification", in case you were wondering)
occurred 313 times overall, on 215 distinct pages and in 85 distinct books
from our sample.
Ideally, the frequency lists would only contain data from 1980-present (the sum of each year).
Any help would be appreciated!
Cheers,
I would recommend using Pig!
Pig makes things like this very easy and straight-forward. Here's a sample pig script that does pretty much what you need:
raw = LOAD '/foo/input' USING PigStorage('\t') AS (ngram:chararray, year:int, count:int, pages:int, books:int);
filtered = FILTER raw BY year >= 1980;
grouped = GROUP filtered BY ngram;
counts = FOREACH grouped GENERATE group AS ngram, SUM(filtered.count) AS count;
sorted = ORDER counts BY count DESC;
limited = LIMIT sorted 10000;
STORED limited INTO '/foo/output' USING PigStorage('\t');
Pig on AWS Elastic MapReduce can even operate directly on S3 data, so you would probably replace /foo/input and /foo/output with S3 buckets too.

Large amount of timecourses in database

I have a rather large amount of data (~400 mio datapoints) which is organized in a set of ~100,000 timecourses. This data may change every day and for reasons of revision-safety has to be archived daily.
Obviously we are talking about way too much data to be handled efficiently, so I made some analysis on sample data. Approx. 60 to 80% of the courses do not change at all between two days and for the rest only a very limited amount of the elements changes. All in all I expect much less than 10 mio datapoints change.
The question is, how do I make use of this knowledge? I am aware of concepts like the Delta-Trees used by SVN and similar techniques, however I would prefer, if the database itself would be capable of handling such semantic compression. We are using Oracle 11g for storage and the question is, is there a better way than a homebrew solution?
Clarification
I am talking about timecourses representing hourly energy-currents. Such a timecourse might start in the past (like 2005), contains 8760 elements per year and might end any time up to 2020 (currently). Each timecourse is identified by one unique string.
The courses themselves are more or less boring:
"Course_XXX: 1.1.2005 0:00 5; 1.1.2005 1:00 5;1.1.2005 2:00 7,5;..."
My task is making day-to-day changes in these courses visible and to do so, each day at a given time a snapshot has to be taken. My hope is, that some loss-free semantical compression will spare me from archiving ~20GB per day.
Basically my source data looks like this:
Key | Value0 | ... | Value23
to archive that data I need to add an additional dimension which directly or indirectly tells me the time at which the data was loaded from the source-system, so my archive-database is
Key | LoadID | Value0 | ... | Value23
Where LoadID is more or less the time the source-DB was accessed.
Now, compression in my scenario is easy. LoadIDs are growing with each run and I can give a range, i.e.
Key | LoadID1 | LoadID2 | Value0 | ... | Value23
Where LoadID1 gives me the ID of the first load where the 24 values where observed and LoadID2 gives me the ID of the last consecutive load where the 24 values where observed.
In my scenario, this reduces the amount of data stored in the database to 1/30th

What is a viable local database for Windows Phone 7 right now?

I was wondering what is a viable database solution for local storage on Windows Phone 7 right now. Using search I stumbled upon these 2 threads but they are over a few months old. I was wondering if there are some new development in databases for WP7. And I didn't found any reviews about the databases mentioned in the links below.
windows phone 7 database
Local Sql database support for Windows phone 7
My requirements are:
It should be free for commercial use
Saving/updating a record should only save the actual record and not the entire database (unlike WinPhone7 DB)
Able to fast query on a table with ~1000 records using LINQ.
Should also work in simulator
EDIT:
Just tried Sterling using a simple test app: It looks good, but I have 2 issues.
Creating 1000 records takes 30 seconds using db.Save(myPerson). Person is a simple class with 5 properties.
Then I discovered there is a db.SaveAsync<Person>(IList) method. This is fine because it doesn't block the current thread anymore.
BUT my question is: Is it save to call db.Flush() immediately and do a query on the currently saving IList? (because it takes up to 30 seconds to save the records in synchronous mode). Or do I have to wait until the BackgroundWorker has finished saving?
Query these 1000 records with LINQ and a where clause the first time takes up to 14 sec to load into memory.
Is there a way to speed this up?
Here are some benchmark results: (Unit tests was executed on a HTC Trophy)
-----------------------------
purging: 7,59 sec
creating 1000 records: 0,006 sec
saving 1000 records: 32,374 sec
flushing 1000 records: 0,07 sec
-----------------------------
//async
creating 1000 records: 0,04 sec
saving 1000 records: 0,004 sec
flushing 1000 records: 0 sec
-----------------------------
//get all keys
persons list count = 1000 (0,007)
-----------------------------
//get all persons with a where clause
persons list with query count = 26 (14,241)
-----------------------------
//update 1 property of 1 record + save
persons list with query count = 26 (0,003s)
db saved (0,072s)
You might want to take a look at Sterling - it should address most of your concerns and is very flexible.
http://sterling.codeplex.com/
(Full disclosure: my project)
try Siaqodb is commercial project and as difference from Sterling, not serialize objects and keep all in memory for query.Siaqodb can be queried by LINQ provider which efficiently can pull from database even only fields values without create any objects in memory, or load/construct only objects that was requested.
Perst is free for non-commercial use.
You might also want to try Ninja Database Pro. It looks like it has more features than Sterling.
http://www.kellermansoftware.com/p-43-ninja-database-pro.aspx

Resources