Solr request: SQL-like JOIN, GROUP BY, SUM(), WHERE SUM() - solr

I'm new to Solr and I have the following problem:
I have those documents:
category:contract:
{
"contract_id_s": "contract-ENG-00001",
"title_s": "contract title",
"ref_easy_s": "REFAAA",
"commitment_id_s": "ENG-00001",
},
category:commitment:
{
"commitment_id_s": "ENG-00001",
"title_s": "commitment title",
"status_s": "Validated",
"date_changed_status_s": "2015-09-30",
"date_status_initiated_s": "2015-09-27",
"date_status_confirmed_s": "2015-09-28",
"date_status_validated_s": "2015-09-30",
},
category:commitment AND sub_category_s:commitment_project:
{
"id": "ENG-00001_AAA",
"commitment_id_s": "ENG-00001",
"project_id_s": "AAA",
"project_name_s": "project name",
"project_amount_asked_s": "2000",
"project_amount_validated_s": "2100"
},
{
"id": "ENG-00001_AAA2",
"commitment_id_s": "ENG-00001",
"project_id_s": "AAA",
"project_name_s": "project name",
"project_amount_asked_s": "1000",
"project_amount_validated_s": "1200"
},
For each commitment, there could be a contract.
For each commitment, there could be some payments.
Here is what I want to do:
- by default, only select commitment that have at least :
. one sub_category_s:commitment_project with a project_amount_validated_s value.
. one contract.
- if filtered on amounts, only select in this list, commitments with the SUM of project_amount_validated_s > amount_min AND < amount_max.
I don't know what is the best practice in terms of performance?
- Requesting the ids of the commitments then requesting the details for them?
- Is there a way to JOIN the contract informations in this request?
- Or the best practice is to request each document one by one?
The problem is that I don't want to request useless data (performance, bandwidth).

There are some tools available to you in the form of:
Solr's Block Join Query Parser (which allows for simple parent/child
queries).
Solr Facets (which allow for aggregrations (e.g. sum of payments) ... with recent support for faceting on parent/child fields).
The Solr Expand Component (which recently allows parent information to be expanded from a child block join query).
However, I'm not certain you can do everything you're hoping in one query (using with these pieces). And even if you can, stitching them together doesn't even come close the the simplicity of the SELECT...JOIN...GROUP BY...HAVING SQL query you're hoping to replicate. (Unless you want to try out the Solr 6 developer snapshot with parallel SQL support)
BUT If this is your only use-case, AND Solr is not your primary datastore, I'd strongly recommend modeling your Solr data to fit your use-case.
E.g. Start simple, denormalize, and only include the fields in your datamodel needed for search:
Only one type of record: commitment
Fields
commitment_id_s
title_s
status_s
date_changed_status_s
date_status_initiated_s
date_status_confirmed_s
date_status_validated_s
total_payments_asked (numeric sum of project_amount_asked from DB)
total_payments_validated (numeric sum of project_amount_validated from DB)
project_names (multiValued list of searchable project names)
contract_names (multiValued list of searchable contract names)
Then your query just needs a filter:
total_payments_validated:[<amount_min>TO<amount_max>]
to enforce your default criteria.
Once your search has identified the commitment IDs matching the Solr query, then go back and query the source database for any additional information needed for display (project details, contract details, dates, etc...)

Ok, I've found a solution by using !join.
For instance, in PHP:
[
'q' => "{!join from=id to=service_id score=none}uri:\\$serviceUri* AND -deleted:true",
'fq' => "{!cache=false}category:monthly_volume AND type:\"$type\" AND timestamp:[$strDateStart TO $strDateEnd]",
'alt' => 'json',
'max-results' => 1000,
'sort' => 'timestamp ASC',
'statsFields' => 'stats.field=value&stats.facet=timestamp',
]
Or with URL request:
http://localhost:8983/solr/fluks-admin/select?q={!join+from=id+to=sector_id+score=none}{!join+from=uri+to=service+score=none}uri:/test-en/service-en*+AND+-deleted:true&fq={!cache=false}category:indicator+AND+timestamp:[201608+TO+201610]+AND+type:("-3"+OR+2+OR+3)+AND+-deleted:true&wt=json&indent=true&json.facet={sum_timestamp:{terms:{limit:-1, field:timestamp, facet:{sum_type:{terms:{limit:-1, field:type, facet:{sum_vol_value:"sum(vol_value)"}}}}}}}

Related

Performance issue when querying time-based objects

I'm currently working on a mongoDB collection containing documents that looks like the following :
{ startTime : Date, endTime: Date, source: String, metaData: {}}
And my usecase is to retrieve all documents that is included within a queried time frame, such as my query looks like this :
db.myCollection.find(
{
$and: [
{"source": aSource},
{"startTime" : {$lte: timeFrame.end}},
{"endTime" : {$gte: timeFrame.start}}
]
}
).sort({ "startTime" : 1 })
With an index defined as the following :
db.myCollection.createIndex( { "source" : 1, "startTime": 1, "endTime": 1 } );
The problem is that queries are very slow (multiple hundreds of ms on a local database) as soon as the number of document per source increase.
Using mongo explain shows me that i'm efficiently using this index (only found documents are scanned, otherwise only index-access is made), so the slowness seems to come from the index scan itself, as this query needs to go over a large portion of this index.
In addition to that, such an index gets huge pretty quickly and therefore seems inefficient.
Is there anything i'm missing that could help makes those queries faster, or am I condemned to retrieve all the documents belonging to a given source as the best way to go ? I see that mongo now provides some time-series features, could that bring any help in regard of my problem ?

Database schema design for stock market financial data

I'm figuring out the optimal structure to store financial data with daily inserts.
There are 3 use cases for querying the data:
Querying specific symbols for current data
Finding symbols current by values (e.g. where price < 10 and dividend.amountPaid > 3)
Charting historical values per symbol (e.g. query all dividend.yield between 2010 and 2020)
I am considering MongoDB, but I don't know which structure would be optimal. Embedding all the data per symbol for a duration of 10 years is too much, so I was thinking of embedding the current data per symbol, and creating references to historical documents.
How should I store this data? Is MongoDB not a good solution?
Here's a small example for a data for one symbol.
{
"symbol": "AAPL",
"info": {
"company_name": "Apple Inc.",
"description": "some long text",
"website": "http://apple.com",
"logo_url": "http://apple.com"
},
"quotes": {
"open": 111,
"close": 321,
"high": 111,
"low": 100
},
"dividends": {
"amountPaid": 0.5,
"exDate": "2020-01-01",
"yieldOnCost": 10,
"growth": { value: 111, pct_chg: 10 } /* some fields could be more attributes than just k/v */
"yield": 123
},
"fundamentals": {
"num_employees": 123213213,
"shares": 123123123123,
....
}
}
What approach would you take for storing this data?
Based upon the info (the sample data and the use cases) you had posted, I think storing the historical data as a separate collection sounds fine.
Some of the important factors that affect the database design (or data model) is the amount of data and the kind of queries - the most important queries you plan to perform on the data. Assuming that the JSON data you had posted (for a stock symbol) can be used to perform the first two queries - you can start with the idea that storing the historical data as a separate collection. The historical data document for a symbol can be for a year or for a range of years - depends upon the queries, the data size, and the type of information.
MongoDB's document based model allows flexible schema which can be useful for implementing future changes and requirements easily. Note that a MongoDB document can store upto 16 MB data maximum.
For reference, see MongoDB Data Model Design.
Stock market data by itself is huge. Keep it all in one place per company otherwise you got a mess sooner or later.
your example above: logos are '.png' etc, not .html
your "quotes" section will be way too big ... keep it all on the top level... that's the nice thing with mongo. each quotes-document should have a date ;) associated with it... use a dateformat mongo also has, not a string for it...

Get Display Name for License SKU in Microsoft Graph

I am trying to use Microsoft Graph to capture the products which we have licenses for.
While I can get the skupartname, that name is not exactly display-friendly.
I have come across DisplayName as a datapoint in almost all the API calls that give out an object with an id.
I was wondering if there was a DisplayName for the skus, and where I could go to get them via the graph.
For reference, the call I made was on the https://graph.microsoft.com/v1.0/subscribedSkus endpoint following the doc https://learn.microsoft.com/en-us/graph/api/subscribedsku-list?view=graph-rest-1.0
The following is what's returned (after filtering out things I don't need), and as mentioned before, while I have a unique identifier which I can use via the skuPartNumber, that is not exactly PRESENTABLE.
You might notice for some of the skus, it difficult to figure out what it is referring to based on the names in the image of the Licenses page posted after the output
[
{
"capabilityStatus": "Enabled",
"consumedUnits": 0,
"id": "aca06701-ea7e-42b5-81e7-6ecaee2811ad_2b9c8e7c-319c-43a2-a2a0-48c5c6161de7",
"skuId": "2b9c8e7c-319c-43a2-a2a0-48c5c6161de7",
"skuPartNumber": "AAD_BASIC"
},
{
"capabilityStatus": "Enabled",
"consumedUnits": 0,
"id": "aca06701-ea7e-42b5-81e7-6ecaee2811ad_df845ce7-05f9-4894-b5f2-11bbfbcfd2b6",
"skuId": "df845ce7-05f9-4894-b5f2-11bbfbcfd2b6",
"skuPartNumber": "ADALLOM_STANDALONE"
},
{
"capabilityStatus": "Enabled",
"consumedUnits": 96,
"id": "aca06701-ea7e-42b5-81e7-6ecaee2811ad_0c266dff-15dd-4b49-8397-2bb16070ed52",
"skuId": "0c266dff-15dd-4b49-8397-2bb16070ed52",
"skuPartNumber": "MCOMEETADV"
}
]
Edit:
I am aware that I can get "friendly names" of SKUs in the following link
https://learn.microsoft.com/en-us/azure/active-directory/users-groups-roles/licensing-service-plan-reference
The problem is that it contains ONLY the 70ish most COMMON SKUs (in the last financial quarter), NOT ALL.
My organization alone has 5 SKUs not present on that page, and some of our clients for who we are an MSP for, also have a few. In that context, the link really does not solve the problem, since it is not reliable, nor updated fast enough for new SKUs
You can see a match list from Product names and service plan identifiers for licensing.
Please note that:
the table lists the most commonly used Microsoft online service
products and provides their various ID values. These tables are for
reference purposes and are accurate only as of the date when this
article was last updated. Microsoft does not plan to update them for
newly added services periodically.
Here is an extra list which may be helpful.
There is a CSV download available of the data on the "Product names and service plan identifiers for licensing" page now.
For example, the current CSV (as of the time of posting this answer) is located at https://download.microsoft.com/download/e/3/e/e3e9faf2-f28b-490a-9ada-c6089a1fc5b0/Product%20names%20and%20service%20plan%20identifiers%20for%20licensing%20v9_22_2021.csv. This can be downloaded, cached and parsed in your application to resolve the product display name.
This is just a CSV format of the same table that is displayed on the webpage, which is not comprehensive, but it should have many of the products listed. If you find one that is missing, you can use the "Submit feedback for this page" button on the bottom of the page to create a GitHub issue. The documentation team usually responds in a few weeks.
Microsoft may provide an API for this data in the future, but it's only in their backlog. (source)

How to print the count of array elements along with another variable in MongoDB

I have a data collection which contains records in the following format.
{
"_id": 22,
"title": "Hibernate in Action",
"isbn": "193239415X",
"pageCount": 400,
"publishedDate": ISODate("2004-08-01T07:00:00Z"),
"thumbnailUrl": "https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ.book-thumb-images/bauer.jpg",
"shortDescription": "\"2005 Best Java Book!\" -- Java Developer's Journal",
"longDescription": "Hibernate practically exploded on the Java scene. Why is this open-source tool so popular Because it automates a tedious task: persisting your Java objects to a relational database. The inevitable mismatch between your object-oriented code and the relational database requires you to write code that maps one to the other. This code is often complex, tedious and costly to develop. Hibernate does the mapping for you. Not only that, Hibernate makes it easy. Positioned as a layer between your application and your database, Hibernate takes care of loading and saving of objects. Hibernate applications are cheaper, more portable, and more resilient to change. And they perform better than anything you are likely to develop yourself. Hibernate in Action carefully explains the concepts you need, then gets you going. It builds on a single example to show you how to use Hibernate in practice, how to deal with concurrency and transactions, how to efficiently retrieve objects and use caching. The authors created Hibernate and they field questions from the Hibernate community every day - they know how to make Hibernate sing. Knowledge and insight seep out of every pore of this book.",
"status": "PUBLISH",
"authors": ["Christian Bauer", "Gavin King"],
"categories": ["Java"]
}
I want to print title, and authors count where the number of authors is greater than 4.
I used the following command to extract records which has more than 4 authors.
db.books.find({authors:{$exists:true},$where:'this.authors.length>4'},{_id:0,title:1});
But unable to print the number of authors along with the title. I tried to use the following command too. But it gave only the title list.
db.books.find({authors:{$exists:true},$where:'this.authors.length>4'},{_id:0,title:1,'this.authors.length':1});
Could you please help me to print the number of authors here along with the title?
You can use aggregation framework's $project with $size to reshape your data and then $match to apply filtering condition:
db.collection.aggregate([
{
$project: {
title: 1,
authorsCount: { $size: "$authors" }
}
},
{
$match: {
authorsCount: { $gt: 4 }
}
}
])
Mongo Playground

indexing large array in mongoDB

according to mongoDB documentation, it's not recommended to create multikey index for large arrays, so what is the alternative option for that?
I want to notify my app users whenever one of their contacts also start using the app, so I have to upload and manage the contacts list of each user.
we are using mongoDB with replica set of master with two secondaries machines.
does mongo can handle multikey indexing for array with hundreds of values?
hundreds of contacts for hundreds thousands of users can be very hard to mange.
the multikey solution looks like that:
{
customerId: "id1",
contacts: ["aaa", "aab", "aac", .... "zzz"]
}
index: createIndex({ contacts: 1 }).
another solution is to save each contacts in it's own document and save all the app users that related to him:
{
phone: "aaa",
contacts: ["id1", "id2", "id3"]
},
{
phone: "aab",
contacts: ["id1"]
},
{
phone: "aac",
contacts: ["id1"]
},
......
{
phone: "zzz",
contacts: ["id1"]
}
index: createIndex( { phone: 1 } )
both have poor performance on writing when uploading the contacts list:
the first on calculate huge index, and the second for updating lots of documents concurrent.
Is there a better way to do it?
I'm using a replica set with two secondaries machines, does shard key could help?
Thanks
To index a field that holds an array value, MongoDB creates an index key for each element in the array. These multikey indexes support efficient queries
against array fields.
So if i were you, my data model would be like this :
{
customerId: "id1",
contacts: ["_idx", "_idy", "_idw", .... "_idz"]
}
And then create your index on the contacts. MongoDB creates by default indexes on ids. So you will have to create new documents for the non app users, just try to to add a field, like "app_user" : true/false.
For index performance, you could make it build in the background without any issues, and for replica sets, this is how it's done.
For the sharding, it won't help you, because you won't even be able to shard anything, since you have one primary node in your cluster. Sharding needs at least 2 sets of primary Mongo instances, so in your case, you could add a fourth server, then have two replica sets, of one primary and one secondary, then shard them, and tranform your system into 2 replicated shards.
Once this is achieved, it will obviously balance the loads between the 2 shards, eventhough a hundred documents isn't really much to deal with for MongoDB.
On the other hand if you're going to go for sharding, you will need more setup, for config servers if you're using Mongodb 3.4 or higher.

Resources