Database schema design for stock market financial data - database

I'm figuring out the optimal structure to store financial data with daily inserts.
There are 3 use cases for querying the data:
Querying specific symbols for current data
Finding symbols current by values (e.g. where price < 10 and dividend.amountPaid > 3)
Charting historical values per symbol (e.g. query all dividend.yield between 2010 and 2020)
I am considering MongoDB, but I don't know which structure would be optimal. Embedding all the data per symbol for a duration of 10 years is too much, so I was thinking of embedding the current data per symbol, and creating references to historical documents.
How should I store this data? Is MongoDB not a good solution?
Here's a small example for a data for one symbol.
{
"symbol": "AAPL",
"info": {
"company_name": "Apple Inc.",
"description": "some long text",
"website": "http://apple.com",
"logo_url": "http://apple.com"
},
"quotes": {
"open": 111,
"close": 321,
"high": 111,
"low": 100
},
"dividends": {
"amountPaid": 0.5,
"exDate": "2020-01-01",
"yieldOnCost": 10,
"growth": { value: 111, pct_chg: 10 } /* some fields could be more attributes than just k/v */
"yield": 123
},
"fundamentals": {
"num_employees": 123213213,
"shares": 123123123123,
....
}
}

What approach would you take for storing this data?
Based upon the info (the sample data and the use cases) you had posted, I think storing the historical data as a separate collection sounds fine.
Some of the important factors that affect the database design (or data model) is the amount of data and the kind of queries - the most important queries you plan to perform on the data. Assuming that the JSON data you had posted (for a stock symbol) can be used to perform the first two queries - you can start with the idea that storing the historical data as a separate collection. The historical data document for a symbol can be for a year or for a range of years - depends upon the queries, the data size, and the type of information.
MongoDB's document based model allows flexible schema which can be useful for implementing future changes and requirements easily. Note that a MongoDB document can store upto 16 MB data maximum.
For reference, see MongoDB Data Model Design.

Stock market data by itself is huge. Keep it all in one place per company otherwise you got a mess sooner or later.
your example above: logos are '.png' etc, not .html
your "quotes" section will be way too big ... keep it all on the top level... that's the nice thing with mongo. each quotes-document should have a date ;) associated with it... use a dateformat mongo also has, not a string for it...

Related

Performance issue when querying time-based objects

I'm currently working on a mongoDB collection containing documents that looks like the following :
{ startTime : Date, endTime: Date, source: String, metaData: {}}
And my usecase is to retrieve all documents that is included within a queried time frame, such as my query looks like this :
db.myCollection.find(
{
$and: [
{"source": aSource},
{"startTime" : {$lte: timeFrame.end}},
{"endTime" : {$gte: timeFrame.start}}
]
}
).sort({ "startTime" : 1 })
With an index defined as the following :
db.myCollection.createIndex( { "source" : 1, "startTime": 1, "endTime": 1 } );
The problem is that queries are very slow (multiple hundreds of ms on a local database) as soon as the number of document per source increase.
Using mongo explain shows me that i'm efficiently using this index (only found documents are scanned, otherwise only index-access is made), so the slowness seems to come from the index scan itself, as this query needs to go over a large portion of this index.
In addition to that, such an index gets huge pretty quickly and therefore seems inefficient.
Is there anything i'm missing that could help makes those queries faster, or am I condemned to retrieve all the documents belonging to a given source as the best way to go ? I see that mongo now provides some time-series features, could that bring any help in regard of my problem ?

Get Display Name for License SKU in Microsoft Graph

I am trying to use Microsoft Graph to capture the products which we have licenses for.
While I can get the skupartname, that name is not exactly display-friendly.
I have come across DisplayName as a datapoint in almost all the API calls that give out an object with an id.
I was wondering if there was a DisplayName for the skus, and where I could go to get them via the graph.
For reference, the call I made was on the https://graph.microsoft.com/v1.0/subscribedSkus endpoint following the doc https://learn.microsoft.com/en-us/graph/api/subscribedsku-list?view=graph-rest-1.0
The following is what's returned (after filtering out things I don't need), and as mentioned before, while I have a unique identifier which I can use via the skuPartNumber, that is not exactly PRESENTABLE.
You might notice for some of the skus, it difficult to figure out what it is referring to based on the names in the image of the Licenses page posted after the output
[
{
"capabilityStatus": "Enabled",
"consumedUnits": 0,
"id": "aca06701-ea7e-42b5-81e7-6ecaee2811ad_2b9c8e7c-319c-43a2-a2a0-48c5c6161de7",
"skuId": "2b9c8e7c-319c-43a2-a2a0-48c5c6161de7",
"skuPartNumber": "AAD_BASIC"
},
{
"capabilityStatus": "Enabled",
"consumedUnits": 0,
"id": "aca06701-ea7e-42b5-81e7-6ecaee2811ad_df845ce7-05f9-4894-b5f2-11bbfbcfd2b6",
"skuId": "df845ce7-05f9-4894-b5f2-11bbfbcfd2b6",
"skuPartNumber": "ADALLOM_STANDALONE"
},
{
"capabilityStatus": "Enabled",
"consumedUnits": 96,
"id": "aca06701-ea7e-42b5-81e7-6ecaee2811ad_0c266dff-15dd-4b49-8397-2bb16070ed52",
"skuId": "0c266dff-15dd-4b49-8397-2bb16070ed52",
"skuPartNumber": "MCOMEETADV"
}
]
Edit:
I am aware that I can get "friendly names" of SKUs in the following link
https://learn.microsoft.com/en-us/azure/active-directory/users-groups-roles/licensing-service-plan-reference
The problem is that it contains ONLY the 70ish most COMMON SKUs (in the last financial quarter), NOT ALL.
My organization alone has 5 SKUs not present on that page, and some of our clients for who we are an MSP for, also have a few. In that context, the link really does not solve the problem, since it is not reliable, nor updated fast enough for new SKUs
You can see a match list from Product names and service plan identifiers for licensing.
Please note that:
the table lists the most commonly used Microsoft online service
products and provides their various ID values. These tables are for
reference purposes and are accurate only as of the date when this
article was last updated. Microsoft does not plan to update them for
newly added services periodically.
Here is an extra list which may be helpful.
There is a CSV download available of the data on the "Product names and service plan identifiers for licensing" page now.
For example, the current CSV (as of the time of posting this answer) is located at https://download.microsoft.com/download/e/3/e/e3e9faf2-f28b-490a-9ada-c6089a1fc5b0/Product%20names%20and%20service%20plan%20identifiers%20for%20licensing%20v9_22_2021.csv. This can be downloaded, cached and parsed in your application to resolve the product display name.
This is just a CSV format of the same table that is displayed on the webpage, which is not comprehensive, but it should have many of the products listed. If you find one that is missing, you can use the "Submit feedback for this page" button on the bottom of the page to create a GitHub issue. The documentation team usually responds in a few weeks.
Microsoft may provide an API for this data in the future, but it's only in their backlog. (source)

How to print the count of array elements along with another variable in MongoDB

I have a data collection which contains records in the following format.
{
"_id": 22,
"title": "Hibernate in Action",
"isbn": "193239415X",
"pageCount": 400,
"publishedDate": ISODate("2004-08-01T07:00:00Z"),
"thumbnailUrl": "https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ.book-thumb-images/bauer.jpg",
"shortDescription": "\"2005 Best Java Book!\" -- Java Developer's Journal",
"longDescription": "Hibernate practically exploded on the Java scene. Why is this open-source tool so popular Because it automates a tedious task: persisting your Java objects to a relational database. The inevitable mismatch between your object-oriented code and the relational database requires you to write code that maps one to the other. This code is often complex, tedious and costly to develop. Hibernate does the mapping for you. Not only that, Hibernate makes it easy. Positioned as a layer between your application and your database, Hibernate takes care of loading and saving of objects. Hibernate applications are cheaper, more portable, and more resilient to change. And they perform better than anything you are likely to develop yourself. Hibernate in Action carefully explains the concepts you need, then gets you going. It builds on a single example to show you how to use Hibernate in practice, how to deal with concurrency and transactions, how to efficiently retrieve objects and use caching. The authors created Hibernate and they field questions from the Hibernate community every day - they know how to make Hibernate sing. Knowledge and insight seep out of every pore of this book.",
"status": "PUBLISH",
"authors": ["Christian Bauer", "Gavin King"],
"categories": ["Java"]
}
I want to print title, and authors count where the number of authors is greater than 4.
I used the following command to extract records which has more than 4 authors.
db.books.find({authors:{$exists:true},$where:'this.authors.length>4'},{_id:0,title:1});
But unable to print the number of authors along with the title. I tried to use the following command too. But it gave only the title list.
db.books.find({authors:{$exists:true},$where:'this.authors.length>4'},{_id:0,title:1,'this.authors.length':1});
Could you please help me to print the number of authors here along with the title?
You can use aggregation framework's $project with $size to reshape your data and then $match to apply filtering condition:
db.collection.aggregate([
{
$project: {
title: 1,
authorsCount: { $size: "$authors" }
}
},
{
$match: {
authorsCount: { $gt: 4 }
}
}
])
Mongo Playground

Storing hierarchical data in Google Datastore

I have data in the following structure:
{
"id": "1",
"title": "A Title",
"description": "A description.",
"listOfStrings": ["one", "two", "three"]
"keyA": {
"keyB": " ",
"keyC": " ",
"keyD": {
"keyE": " ",
"KeyF": " "
}
}
}
I want to put/get this in Google Datastore. What is the best way to store this?
Should I be using entity groups?
Requirements:
I do not need transactions.
Must be performant to read the whole structure.
Must be able to query based on KeyEs content.
This link (Storing hierarchical data in Google App Engine Datastore?) mentions using entity groups for hierarchical data, but this link (How to get all the descendants of a given instance of a model in google app engine?) mentions they should only be used for transactions, which I do not need.
Im using this library (which I do not think supports ReferenceProperty?).
http://github.com/GoogleCloudPlatform/gcloud-ruby.git
If you want to be able to query by the hierarchy (keyA->keyB->keyC) -- use ancestors, or just a key which will look like this to avoid entity group limits. If you want to be able to query an entity which contains provided key -- make a computed property where you will store a flat list of keys stored inside. And store the original hierarchy in the JsonProeprty for example.
In my experience, the more entities you request the slower (less performant) it becomes. Making lots of entities with small bits of data is not efficient. Instead, store the whole hierarchy as a JsonProperty and use code to walk through it.
If KeyE is just a string, then add a property KeyE = StringProperty(indexed=True) so you can query against it.

Best databasing practice for tracing non-homogenous events

Suppose I have the the following event data scheme:
event_record_unique_id: long
event_timestamp: long
session_id: long
event_id: int
event_data: data # concrete type depends on event_id
... so, the contents of the data may depend on, let's say 500, event_ids, leading to 200 different concrete data types for "data". For example:
{
event_record_unique_id: 17126721
event_timestamp: 1234
session_id: 3452
event_id: 50
event_data: {
user_id: 123
page_id: 789
}
}
{
event_record_unique_id: 1712672123
event_record_unique_id: 17126723
event_timestamp: 1234
session_id: 3454
event_id: 51
event_data: {
user_id: 124
button_id: 789
}
}
{
event_timestamp: 1234
session_id: 3454
event_id: 51
event_data: {
crash_report: "text"
device_id: "12312"
}
}
Also:
many of the event_data attributes appear in many of the concrete event_data objects
I need to perform indexed searches on some of the event_data attributes (e.g. find me all the records where user_id=X )
there's a continuing need to keep on adding event types and new attributes
the above data structure is always trivially flattened so that a single record can be represented equivalently as a row with N columns where (and attributes name/type collision
are solved by renaming attributes).
The naive RDBMS approach would involved making ~500 tables (one per concrete type of "data"). I've discounted this approach (= excessive waste of human effort in modelling). Plus, I cannot easily search all records over user_id (since user_id appears in very many tables).
Flattening the structure in an RDBMS is also quite costly (N-8 of the elements are NULL and contain no information).
Mongodb-type document database solutions appear to be a good, however, space costs seems quite high if attribute names are held with each record, not much better than an RDBMS. However, this does allow me to index by fields in the data object.
For me, an ideal data representation of this would be a table that is optimized to allow rows with many null elements (e.g. by keeping an active column bitmask per row). Or a document DB in which a document collection maintains a library of document schemas used enable compacting the data (and each document having reference to its schema).
What kind of database would people recommend for the above example case?
MS SQL Server 2008 and up have Sparse Columns. Up to 30,000 can be added in a table, and they can be indexed (filtered indexes are recommended). Or so says BOL, I have not used them myself. This would results in a single very large table that might support what you need.
With that said, I don't know it would be particularly efficient. Some math:
Assume 10 rows a second
becomes 10*60*60*24 = 864,000 rows a day
or 315,360,000 rows a year
with a very rough over-estimate of 50 bytes a row
is about 14GB a year
for how many years do you have to keep the data?
and double that if it's more like 20 rows per second
So storage doesn't seem too way out of line... but I don't know, you want to work up some serious size projection factors. And that's just storage, what do you want or need to do with the data? Is retrieval time for specified rows important? What about analysis and data mining? I'm a SQL guy through and through, and I think it could be done, but this pretty much is the kind of problem that Hadoop and NoSQL solutions were devised for, and it could well be worth your time to thoroughly investigating those options.

Resources