I'm trying to wrap my head around DynamoDB's scans and queries, and how I should structure my tables.
Let's say I have buckets and marbles, and each bucket can contain many marbles. In a traditional relational database, I might set that up like this:
Buckets
id name
---------------
B1 Blue Bucket
B2 Red Bucket
Marbles
id name bucketId lots more fields...
------------------------------------------------
M1 Deep Swirls B1
M2 Fire Red B1
M3 Obsidian B2
As I understand it, if I structured my data this way in DynamoDB, it could be costly for RCUs because I'd have to do scans. If I wanted to get all the marbles in bucket B1, I'd have to do a scan of Marbles where bucketId = B1, which grabs the full list of marbles and then removes the ones that don't match (if I understand the inner workings of DynamoDB correctly).
This doesn't sound very performant or cost-effective. How should I structure this data?
IMPORTANT NOTE: Marbles should be able to exist on their own, i.e. part of no bucket. (bucketId = null)
You will want two tables to track this. bucket for the bucket and marble for the marble. A bucket will contain a list of marbles with some basic information (name, color, etc...) that you would use for displaying a quick list of the collection. Make sure to include the id of the marble. Then on the actual marble representation, put all of the information for the marble, plus a bucket object that will link back to it's assigned bucket. It would look something like this:
Marble
{
"id": 1,
"name": "Deep Swirls",
"color": "Red",
"complexProp": {
...
},
"bucket": {
"name": "Blue Bucket",
"id": 1
"
}
Bucket
{
"id": 1,
"name": "Blue Bucket",
"marbles": [
{
"id": 1,
"name": "Deep Swirls",
"color": "Red"
},
{
"id": 2,
"name": "Fire Red",
"color": "Red"
}
]
}
The downside to this approach is that you will need to update the marble in two places if anything changes (but if a marble changed color that would be rather impressive) if the changing data is in both places. You will also need to change data in two places if you change what bucket it is in. You can omit the bucket property on the marble representation if you don't care about quickly discovering which bucket a given marble is in.
Related
I'm pretty new to Acure Data Factory - ADF and have stumbled into somthing I would have solved with a couple lines of code.
Background
Main flow:
Lookup Activity fetchin an array of ID's to process
ForEach Activity looping over input array and uisng a Copy Activity pulling data from a REST API storing it into a database
Step #1 would result in an array containing ID's
{
"count": 10000,
"value": [
{
"id": "799128160"
},
{
"id": "817379102"
},
{
"id": "859061172"
},
... many more...
Step #2 When the lookup returns a lot of ID's - individual REST calls takes a lot of time. The REST API supports batching ID's using a comma spearated input.
The question
How can I convert the array from the input into a new array with comma separated fields? This will reduce the number of Activities and reduce the time to run.
Expecting something like this;
{
"count": 1000,
"value": [
{
"ids": "799128160,817379102,859061172,...."
},
{
"ids": "n,n,n,n,n,n,n,n,n,n,n,n,...."
}
... many more...
EDIT 1 - 19th Des 22
Using "Until Activity" and keeping track of posistions, I managed to use plain ADF. Would be nice if this could have been done using some simple array manipulation in a code snippet.
The ideal response might be we have to do manipulation with Dataflow -
My sample input:
First, I took a Dataflow In that adding a key Generate (Surrogate key) after the source - Say new key field is 'SrcKey'
Data preview of Surrogate key 1
Add an aggregate where you group by mod(SrcKey/3). This will group similar remainders into the same bucket.
Add a collect column in the same aggregator to collect into an array with expression trim(toString(collect(id)),'[]').
Data preview of Aggregate 1
Store output in single file in blob storage.
OUTPUT
In my application, I choose Document Model but I still have some questions.
Here is my example document:
{
"catalogs": {
"cat-id1": {
"name": "catalog-1",
"createdAt": 123,
"products": {
"pro-id1": {
"name": "product-1",
"createdAt": 321,
"ingredients": {}
},
"pro-id2": {
"name": "product-2",
"createdAt": 654,
"ingredients": {}
}
}
},
"cat-id2": {
"name": "catalog-2",
"createdAt": 456,
"products": {
"pro-id3": {
"name": "product-3",
"createdAt": 322,
"ingredients": {}
},
"pro-id4": {
"name": "product-4",
"createdAt": 655,
"ingredients": {}
}
}
}
}
}
But ingredients in product is referrer to another Document.
{
"ingredients": {
"ing-id1": {},
"ing-id2": {}
}
}
Document Model has several benefits:
Easy to edit schema, like if (user.first_name) user.first_name = user.name.split(' ')[0]
No need to join, easily take all data in once.
Also I know that:
On updates to a document, the entire document usually needs to be rewritten.
For these reasons, it is generally recommended that you keep documents fairly small and avoid writes that increase the size of a document .
Main idea is: Which data model leads to simpler application code?
My question will be:
What size of Document should I keep?
My application already have a Relation DB, should I combined Document Model to Relation DB to reducing complexity?
Since you already have a relational database in use, I don't see a real benefit to using a document based DB as well.
Your Database schema seems simple enough to be using a relational DB. Wheras, if catalog entries would be very different from each other, you might consider a document based model. But this does not seem to be the case.
Therefore, my advice is, you stick with a relational model.
I would design the model like this:
A table for each entity (catalog, product, ingredient) where each entry has a unique Id
A relation table for each n:m relationship (catalogProduct, productIngredient) that only contain the Id of the entities of the relationship.
An example:
The ingredients ing1, ing2 and ing3 are stored in the table ingredient.
The products prod1 and prod2 are stored in product.
ing1 and ing2 are needed for prod1
ing2 and ing3 for prod2
In productIngredient in each entry, you store the ID of an ingredient and the ID of the product it is used in.
prod1 : ing1
prod1 : ing2
prod2 : ing2
prod2 : ing3
I have a problem where I need to convert a JSON payload into SQL tables, while maintaining the relationships established in the payload. This is so that later I have the ability to query the tables and recreate the JSON payload structure in the future.
For example:
{
"batchId": "batch1",
"payees" : [
{
"payeeId": "payee1",
"payments": [
{
"paymentId": "paymentId1",
"amount": 200,
"currency": "USD"
},
{
"paymentId": "paymentId2",
"amount": 200,
"currency": "YEN"
},
{
"paymentId": "paymentId2",
"amount": 200,
"currency": "EURO"
}
]
}
]
}
For the above payload, I have a batch with payments grouped by payees. At its core it all boils down to a batch and its payments. But in that you can have groupings, for example above, it's grouped by payees.
One thing to note is that the payload may not necessarily always follow the above structure. Instead of grouping by payees, it could be by something else like currency for example. Or even no grouping at all, just a root level batch and an array of payments.
I want to know if there are conventions/rules I can follow to approach represent such data into relational tables? Thanks.
edit:
I am primarily looking to use Postgres and have looked into the jsonb feature that it provides for storing json data. However, I'm still struggling to figure out how/where (in terms of which table) to best store the grouping info.
I'm figuring out the optimal structure to store financial data with daily inserts.
There are 3 use cases for querying the data:
Querying specific symbols for current data
Finding symbols current by values (e.g. where price < 10 and dividend.amountPaid > 3)
Charting historical values per symbol (e.g. query all dividend.yield between 2010 and 2020)
I am considering MongoDB, but I don't know which structure would be optimal. Embedding all the data per symbol for a duration of 10 years is too much, so I was thinking of embedding the current data per symbol, and creating references to historical documents.
How should I store this data? Is MongoDB not a good solution?
Here's a small example for a data for one symbol.
{
"symbol": "AAPL",
"info": {
"company_name": "Apple Inc.",
"description": "some long text",
"website": "http://apple.com",
"logo_url": "http://apple.com"
},
"quotes": {
"open": 111,
"close": 321,
"high": 111,
"low": 100
},
"dividends": {
"amountPaid": 0.5,
"exDate": "2020-01-01",
"yieldOnCost": 10,
"growth": { value: 111, pct_chg: 10 } /* some fields could be more attributes than just k/v */
"yield": 123
},
"fundamentals": {
"num_employees": 123213213,
"shares": 123123123123,
....
}
}
What approach would you take for storing this data?
Based upon the info (the sample data and the use cases) you had posted, I think storing the historical data as a separate collection sounds fine.
Some of the important factors that affect the database design (or data model) is the amount of data and the kind of queries - the most important queries you plan to perform on the data. Assuming that the JSON data you had posted (for a stock symbol) can be used to perform the first two queries - you can start with the idea that storing the historical data as a separate collection. The historical data document for a symbol can be for a year or for a range of years - depends upon the queries, the data size, and the type of information.
MongoDB's document based model allows flexible schema which can be useful for implementing future changes and requirements easily. Note that a MongoDB document can store upto 16 MB data maximum.
For reference, see MongoDB Data Model Design.
Stock market data by itself is huge. Keep it all in one place per company otherwise you got a mess sooner or later.
your example above: logos are '.png' etc, not .html
your "quotes" section will be way too big ... keep it all on the top level... that's the nice thing with mongo. each quotes-document should have a date ;) associated with it... use a dateformat mongo also has, not a string for it...
I'm running a model with Ludwig.
Dataset is Adult Census:
Features
workclass has almost 70% instances of Private, the Unknown (?) can be imputed with this value.
native_country, 90% of the instances are United States which can be used to impute for the Unknown (?) values. Same cannot be said about occupation column as the values are more distributed.
capital_gain has 72% instances with zero values for less than 50K and 19% instances with zero values for >50K.
capital_loss has 73% instances with zero values for less than 50K and 21% instances with zero values for >50K.
When I define the model what is the best way to do it for the above cases?
{
"name": "workclass",
"type": "category"
"preprocessing": {
"missing_value_strategy": "fill_with_mean"
}
},
{
"name": "native_country",
"type": "category"
"preprocessing": {
"missing_value_strategy": "fill_with_mean"
}
},
{
"name": "capital_gain",
"type": "numerical"
"preprocessing": {
"missing_value_strategy": "fill_with_mean",
}
},
{
"name": "capital_loss",
"type": "numerical"
"preprocessing": {
"missing_value_strategy": "fill_with_mean"
}
},
Questions:
1) For category features how to define: If you find ?, replace it with X.
2) For numerical features how to define: If you find 0, replace it with mean?
Ludwig currently considers missing values in the CSV file, like with two consecutive commas for it's replacement strategies. In your case I would suggest to do some minimal preprocessing to your dataset by replacing the zeros and ? with missing values or depending on the type of feature. You can easily do it in pandas with something like:
df[df.my_column == <value>].my_column = <new_value>.
The alternative is to perform the replacement already in your code (for instance replacing 0s with averages) so that Ludwig doesn't have to do it and you have full control of the replacement strategy.