How to print the count of array elements along with another variable in MongoDB - arrays

I have a data collection which contains records in the following format.
{
"_id": 22,
"title": "Hibernate in Action",
"isbn": "193239415X",
"pageCount": 400,
"publishedDate": ISODate("2004-08-01T07:00:00Z"),
"thumbnailUrl": "https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ.book-thumb-images/bauer.jpg",
"shortDescription": "\"2005 Best Java Book!\" -- Java Developer's Journal",
"longDescription": "Hibernate practically exploded on the Java scene. Why is this open-source tool so popular Because it automates a tedious task: persisting your Java objects to a relational database. The inevitable mismatch between your object-oriented code and the relational database requires you to write code that maps one to the other. This code is often complex, tedious and costly to develop. Hibernate does the mapping for you. Not only that, Hibernate makes it easy. Positioned as a layer between your application and your database, Hibernate takes care of loading and saving of objects. Hibernate applications are cheaper, more portable, and more resilient to change. And they perform better than anything you are likely to develop yourself. Hibernate in Action carefully explains the concepts you need, then gets you going. It builds on a single example to show you how to use Hibernate in practice, how to deal with concurrency and transactions, how to efficiently retrieve objects and use caching. The authors created Hibernate and they field questions from the Hibernate community every day - they know how to make Hibernate sing. Knowledge and insight seep out of every pore of this book.",
"status": "PUBLISH",
"authors": ["Christian Bauer", "Gavin King"],
"categories": ["Java"]
}
I want to print title, and authors count where the number of authors is greater than 4.
I used the following command to extract records which has more than 4 authors.
db.books.find({authors:{$exists:true},$where:'this.authors.length>4'},{_id:0,title:1});
But unable to print the number of authors along with the title. I tried to use the following command too. But it gave only the title list.
db.books.find({authors:{$exists:true},$where:'this.authors.length>4'},{_id:0,title:1,'this.authors.length':1});
Could you please help me to print the number of authors here along with the title?

You can use aggregation framework's $project with $size to reshape your data and then $match to apply filtering condition:
db.collection.aggregate([
{
$project: {
title: 1,
authorsCount: { $size: "$authors" }
}
},
{
$match: {
authorsCount: { $gt: 4 }
}
}
])
Mongo Playground

Related

Database schema design for stock market financial data

I'm figuring out the optimal structure to store financial data with daily inserts.
There are 3 use cases for querying the data:
Querying specific symbols for current data
Finding symbols current by values (e.g. where price < 10 and dividend.amountPaid > 3)
Charting historical values per symbol (e.g. query all dividend.yield between 2010 and 2020)
I am considering MongoDB, but I don't know which structure would be optimal. Embedding all the data per symbol for a duration of 10 years is too much, so I was thinking of embedding the current data per symbol, and creating references to historical documents.
How should I store this data? Is MongoDB not a good solution?
Here's a small example for a data for one symbol.
{
"symbol": "AAPL",
"info": {
"company_name": "Apple Inc.",
"description": "some long text",
"website": "http://apple.com",
"logo_url": "http://apple.com"
},
"quotes": {
"open": 111,
"close": 321,
"high": 111,
"low": 100
},
"dividends": {
"amountPaid": 0.5,
"exDate": "2020-01-01",
"yieldOnCost": 10,
"growth": { value: 111, pct_chg: 10 } /* some fields could be more attributes than just k/v */
"yield": 123
},
"fundamentals": {
"num_employees": 123213213,
"shares": 123123123123,
....
}
}
What approach would you take for storing this data?
Based upon the info (the sample data and the use cases) you had posted, I think storing the historical data as a separate collection sounds fine.
Some of the important factors that affect the database design (or data model) is the amount of data and the kind of queries - the most important queries you plan to perform on the data. Assuming that the JSON data you had posted (for a stock symbol) can be used to perform the first two queries - you can start with the idea that storing the historical data as a separate collection. The historical data document for a symbol can be for a year or for a range of years - depends upon the queries, the data size, and the type of information.
MongoDB's document based model allows flexible schema which can be useful for implementing future changes and requirements easily. Note that a MongoDB document can store upto 16 MB data maximum.
For reference, see MongoDB Data Model Design.
Stock market data by itself is huge. Keep it all in one place per company otherwise you got a mess sooner or later.
your example above: logos are '.png' etc, not .html
your "quotes" section will be way too big ... keep it all on the top level... that's the nice thing with mongo. each quotes-document should have a date ;) associated with it... use a dateformat mongo also has, not a string for it...

Schema for chat application using mongodb

I am trying to build a schema for chat application in mongodb. I have 2 types of user models - Producer and Consumer. Producer and Consumer can have conversations with each other. My ultimate goal is to fetch all the conversations for any producer and consumer and show them in a list, just like all the messaging apps (eg Facebook) do.
Here is the schema I have come up with:
Producer: {
_id: 123,
'name': "Sam"
}
Consumer:{
_id: 456,
name: "Mark"
}
Conversation: {
_id: 321,
producerId: 123,
consumerId: 456,
lastMessageId: 1111,
lastMessageDate: 7/7/2018
}
Message: {
_id: 1111,
conversationId: 321,
body: 'Hi'
}
Now I want to fetch say all the consersations of Sam. I want to show them in a list just like Facebook does, grouping them with
each Consumer and sorting according to time.
I think I need to do following queries for this:
1) Get all Conversations where producerId is 123 sorted by lastMessageDate.
I can then show the list of all Conversations.
2) If I want to know all the messages in a conversation, I make query on Message and get all messages where conversationId is 321
Now, here for each new message, I also need to update the conversation with new messageId and date everytime. Is this the right way to proceed and is this optimal considering the number of queries involved. Is there a better way I can proceed with this? Any help would be highly appreciated.
Design:
I wouldn't say it's bad. Depending on the case you've described, it's actually pretty good. Such denormalization of last message date and ID is great, especially if you plan a view with a list of all conversations - you have the last message date in the same query. Maybe go even one step further and add last message text, if it's applicable in this view.
You can read more on pros and cons of denormalization (and schema modeling in general) on the MongoDB blog (parts 1, 2 and 3). It's not that fresh but not outdated.
Also, if such multi-document updates might scary you with some possible inconsistencies, MongoDB v4 got you covered with transactions.
Querying:
On one hand, you can involve multiple queries, and it's not bad at all (especially, when few of them are easily cachable, like the producer or consumer data). On the other hand, you can use aggregations to fetch all these things at once if needed.

Solr request: SQL-like JOIN, GROUP BY, SUM(), WHERE SUM()

I'm new to Solr and I have the following problem:
I have those documents:
category:contract:
{
"contract_id_s": "contract-ENG-00001",
"title_s": "contract title",
"ref_easy_s": "REFAAA",
"commitment_id_s": "ENG-00001",
},
category:commitment:
{
"commitment_id_s": "ENG-00001",
"title_s": "commitment title",
"status_s": "Validated",
"date_changed_status_s": "2015-09-30",
"date_status_initiated_s": "2015-09-27",
"date_status_confirmed_s": "2015-09-28",
"date_status_validated_s": "2015-09-30",
},
category:commitment AND sub_category_s:commitment_project:
{
"id": "ENG-00001_AAA",
"commitment_id_s": "ENG-00001",
"project_id_s": "AAA",
"project_name_s": "project name",
"project_amount_asked_s": "2000",
"project_amount_validated_s": "2100"
},
{
"id": "ENG-00001_AAA2",
"commitment_id_s": "ENG-00001",
"project_id_s": "AAA",
"project_name_s": "project name",
"project_amount_asked_s": "1000",
"project_amount_validated_s": "1200"
},
For each commitment, there could be a contract.
For each commitment, there could be some payments.
Here is what I want to do:
- by default, only select commitment that have at least :
. one sub_category_s:commitment_project with a project_amount_validated_s value.
. one contract.
- if filtered on amounts, only select in this list, commitments with the SUM of project_amount_validated_s > amount_min AND < amount_max.
I don't know what is the best practice in terms of performance?
- Requesting the ids of the commitments then requesting the details for them?
- Is there a way to JOIN the contract informations in this request?
- Or the best practice is to request each document one by one?
The problem is that I don't want to request useless data (performance, bandwidth).
There are some tools available to you in the form of:
Solr's Block Join Query Parser (which allows for simple parent/child
queries).
Solr Facets (which allow for aggregrations (e.g. sum of payments) ... with recent support for faceting on parent/child fields).
The Solr Expand Component (which recently allows parent information to be expanded from a child block join query).
However, I'm not certain you can do everything you're hoping in one query (using with these pieces). And even if you can, stitching them together doesn't even come close the the simplicity of the SELECT...JOIN...GROUP BY...HAVING SQL query you're hoping to replicate. (Unless you want to try out the Solr 6 developer snapshot with parallel SQL support)
BUT If this is your only use-case, AND Solr is not your primary datastore, I'd strongly recommend modeling your Solr data to fit your use-case.
E.g. Start simple, denormalize, and only include the fields in your datamodel needed for search:
Only one type of record: commitment
Fields
commitment_id_s
title_s
status_s
date_changed_status_s
date_status_initiated_s
date_status_confirmed_s
date_status_validated_s
total_payments_asked (numeric sum of project_amount_asked from DB)
total_payments_validated (numeric sum of project_amount_validated from DB)
project_names (multiValued list of searchable project names)
contract_names (multiValued list of searchable contract names)
Then your query just needs a filter:
total_payments_validated:[<amount_min>TO<amount_max>]
to enforce your default criteria.
Once your search has identified the commitment IDs matching the Solr query, then go back and query the source database for any additional information needed for display (project details, contract details, dates, etc...)
Ok, I've found a solution by using !join.
For instance, in PHP:
[
'q' => "{!join from=id to=service_id score=none}uri:\\$serviceUri* AND -deleted:true",
'fq' => "{!cache=false}category:monthly_volume AND type:\"$type\" AND timestamp:[$strDateStart TO $strDateEnd]",
'alt' => 'json',
'max-results' => 1000,
'sort' => 'timestamp ASC',
'statsFields' => 'stats.field=value&stats.facet=timestamp',
]
Or with URL request:
http://localhost:8983/solr/fluks-admin/select?q={!join+from=id+to=sector_id+score=none}{!join+from=uri+to=service+score=none}uri:/test-en/service-en*+AND+-deleted:true&fq={!cache=false}category:indicator+AND+timestamp:[201608+TO+201610]+AND+type:("-3"+OR+2+OR+3)+AND+-deleted:true&wt=json&indent=true&json.facet={sum_timestamp:{terms:{limit:-1, field:timestamp, facet:{sum_type:{terms:{limit:-1, field:type, facet:{sum_vol_value:"sum(vol_value)"}}}}}}}

Storing hierarchical data in Google Datastore

I have data in the following structure:
{
"id": "1",
"title": "A Title",
"description": "A description.",
"listOfStrings": ["one", "two", "three"]
"keyA": {
"keyB": " ",
"keyC": " ",
"keyD": {
"keyE": " ",
"KeyF": " "
}
}
}
I want to put/get this in Google Datastore. What is the best way to store this?
Should I be using entity groups?
Requirements:
I do not need transactions.
Must be performant to read the whole structure.
Must be able to query based on KeyEs content.
This link (Storing hierarchical data in Google App Engine Datastore?) mentions using entity groups for hierarchical data, but this link (How to get all the descendants of a given instance of a model in google app engine?) mentions they should only be used for transactions, which I do not need.
Im using this library (which I do not think supports ReferenceProperty?).
http://github.com/GoogleCloudPlatform/gcloud-ruby.git
If you want to be able to query by the hierarchy (keyA->keyB->keyC) -- use ancestors, or just a key which will look like this to avoid entity group limits. If you want to be able to query an entity which contains provided key -- make a computed property where you will store a flat list of keys stored inside. And store the original hierarchy in the JsonProeprty for example.
In my experience, the more entities you request the slower (less performant) it becomes. Making lots of entities with small bits of data is not efficient. Instead, store the whole hierarchy as a JsonProperty and use code to walk through it.
If KeyE is just a string, then add a property KeyE = StringProperty(indexed=True) so you can query against it.

Lack of rich querying functionality in NoSQL?

Every time i contemplate using NoSQL for a solution i always get hung up on the lack of rich querying functionality. I think it very well be my lack of understanding of NoSQL. It also might be due to the fact of i'm comfortable very comfortable with SQL. From my understanding NoSQL really lends itself well for simple schema scenarios (so its probably not going to work well for a relational database where you have 50+ tables). Even for trivial scenarios i always seem to want rich query functionality. Lets take a recipe database as a trivial example.
While the scheme, is no doubt, trivial you would definitely want rich querying ability. You would probably want to search by the following (and more):
Title
Tag
Category
id
Likes
User who created recipe
create date
rating
dietary restrictions
You would also want to combine these criteria into any combination you wanted to. While i know most NoSQL solutions have secondary indexes doesn't this type of querying ability severely limit how many solutions NoSQL is relevant for? I usually need this rich querying ability. Another good example would be a bug tracking application.
I don't think you want to kick off a map reduce job every time wants to search the database (i think this would be analogous to doing table scans most of the time in a traditional relational model). So i would assume there would be a lot of queries where you would have to loop through each entity and look for the criteria you wanted to search for (which would probably be slow). I understand you can run nightly map reduce jobs to either analyze the data or to maybe normalize it into a typical relational database structure for reports.
Now i can see it being useful for scenarios where you would most likely always have to read all the data anyways. Think of a web server log or maybe an IoT type of app where your collecting massive amounts of data (like censor collection) and doing nightly analysis.
So is understanding of NoSQL off or is there a limit to the # of scenarios that i works well with?
I think the issue you are encountering is that you are approaching noSQL with the same mindset of design that you would with SQL. You mentioned "rich querying" several times. To me, that points towards design flaws (using only reference ids/trying to define relationships). A significant concept in noSQL is that data can be repeated (and often should be). Your recipe example is actually a great use cases for noSQL. Here's how I would approach it using 3 of the models you mention (for simplicity sake):
Recipe = {
_id: a001,
name: "Burger",
ingredients: [
{
_id: b001,
name: "Beef"
},
{
_id: b002,
name: "Cheese"
}
],
createdBy: {
_id: c001,
firstName: "John",
lastName: "Doe"
}
}
Person = {
_id: c001,
firstName: "John",
lastName: "Doe",
email: "jd#email.com",
preferences: {
emailNotifactions: true
}
}
Ingredient = {
_id: b001,
name: "Beef",
brand: "Agri-co",
shelfLife: "3 days",
calories: 300
};
The reason I designed it this way is expressly for the purpose of it's existence (assuming it's something like allrecipes.com). When searching/filtering recipes, you can filter by the author, but their email preferences are irrelevant. Similarly, the shelf life and brand of the ingredient are irrelevant. The schema is designed for the specific use-case, not just because your data needs to be saved. Now here are a few of your mentioned queries (mongo):
db.recipes.find({name: "Burger"});
db.recipes.find({ingredients: { $nin: ["Cheese", "Milk"]}}) // dietary restrictions
Your rich querying concerns have now been reduced to single queries in a single collection.
The downside of this design is slower write speed. You need more logic on the backend, with the potential for more programmer error. The write speed is also slower than SQL due to accessing the various models to grab relevant information. That being said, how often is it viewed vs. how often is it written/edited? (this was my comment on reading trumping writing) The other major downside is the necessity of foresight. The relationship between an ingredient and a recipe doesn't change forms. But the information your application requires might. Editing a noSQL model tends to be more difficult than editing a SQL table.
Here's one other contrived example using the same models to emphasize my point about purposeful design. Assume your new site is on famous chefs instead of a recipe database:
Person = {
_id: c001,
firstName: "Paula",
lastName: "Deen",
recipeCount: 15,
commonIngredients: [
{
_id: b001,
name: "Butter",
count: 15
},
{
_id: b002,
name: "Salted Butter",
count: 15
}
],
favoriteRecipes: [
{
_id: a001,
name: "Fried Butter",
calories: "3000"
}
]
};
Recipe = {
_id: a001,
name: "Fried Butter",
ingredients: [
{
_id: b001,
name: "Butter"
}
],
directions: "Fry butter. Eat.",
calories: "3000",
rating: 99,
createdBy: {
_id: c001,
firstName: "Paula",
lastName: "Deen"
}
};
Ingredient = {
_id: b001,
name: "Butter",
brand: "Butterfields",
shelfLife: "1 month"
};
Both of these designs use the same information, but they are modeled for the specific reason you bothered gathering the information. Now, you have the requisite information for a chef list page and typical sorting/filtering. You can navigate from there to a recipe page and have that info available.
Design for the use case, not to model relationships.

Resources