Azure Search on blobs retrieve the blob itself after search - azure-cognitive-search

How would I retrieve the actual blob (pdf) itself via azure search? I know there are some Lookup possibilities but these only give me text representation of the content. I want a pdf download/stream of the underlaying document. How would I go over this? Im using the .NET SDK.
Do I need to index the filename (to retrieve via blob storage directly)? And if so how?

Answer is: just add a field called metadata_storage_path to your index! SDK:
new Field() { Name = "metadata_storage_path", Type = "Edm.String", IsSearchable = false, IsFilterable = false, IsSortable = false, IsFacetable = false }
A list of metadata properties you can add to your index is here:
https://learn.microsoft.com/azure/search/search-howto-indexing-azure-blob-storage#how-azure-search-indexes-blobs
Alternatively, by default Azure Search populates the key field in your index with the URL-safe base64 encoded value of the metadata_storage_path property. You can decode the encoded blob path value (for details, see base64Decode mapping function and retrieve the blob.

Related

When should I use _id in MongoDB?

MongoDB has a field for every document called "_id". I see people using it everywhere as a primary key, and using it in queries to find documents by the _id.
This field defaults to using an ObjectId which is auto-generated, an example is:
db.tasks.findOne()
{
_id: ObjectID("ADF9"),
description: "Write lesson plan",
due_date: ISODate("2014-04-01"),
owner: ObjectID("AAF1") // Reference to another document
}
But in JavaScript, the underscore behind a field in an object is a convention for private, and as MongoDB uses JSON (specifically, BSON), should I be using these _ids for querying, finding and describing relationships between documents? it doesn't seem right.
I saw that MongoDB has a way to generate UUID https://docs.mongodb.com/manual/reference/method/UUID
Should I forget that _id property, and create my own indexed id property with an UUID?
Use UUIDs for user-generated content, e.g. to name image uploads. UUIDs can be exposed to the user in an URL or when the user inspects an image on the client-side. For everything that is on the server/not exposed to the user, there is no need to generate a UUID, and using the auto-generated _id is preferred.
An simple example of using UUID would be:
const uuid = require('uuid');
exports.nameFile= async (req, res, next) => {
req.body.photo = `${uuid.v4()}.${extension}`;
next();
};
How MongoDB names its things should not interfere in how you name your things. If data sent by third-party hurts the conventions you agreed to follow, you have to transform that data into the format you want as soon as it arrives in your application.
An example based in your case:
function findTaskById(id) {
var result = db.tasks.findOne({"_id": id});
var task = {
id: result._id,
description: result.description,
something: result.something
};
return task;
}
This way you isolate the use of Mongo's _id into the layer of your application that is responsible to interact with the database. In all other places you need task, you can use task.id.

Read JSON from rest API as is with Azure Data Factory

I'm trying to get Azure Data Factory to read my REST API and put it in SQL Server. The source is a REST API and the sink is a SQL Server table.
I tried to do something like:
"translator": {
"type": "TabularTranslator",
"schemaMapping": {
"$": "json"
},
"collectionReference": "$.tickets"
}
The source looks like:
{ "tickets": [ {... }, {...} ] }
Because of the poor mapping capabilities I'm choosing this path. I'll then split the data with a query. Preferbly I'd like to store each object inside tickets as a row with JSON of that object.
In short, how can I get the JSON output from the RestSource to a SqlSink single column text/nvarchar(max) column?
I managed to solve the same issue by modifying mapping manually.
ADF anyway tries to parse json, but from the Advanced mode you can edit json paths. Ex., this is the original schema parsed automatically by ADF
https://imgur.com/Y7QhcDI
Once opened in Advanced mode it will show full paths by adding indexes of the elements, something similar to $tickets[0][] etc
Try to delete all other columns and keep the only one $tickets (the highest level one), in my case it was $value https://i.stack.imgur.com/WnAzC.jpg. As the result the entire json will be written into the destination column.
If there are pagination rules in place, each page will be written as a single row.

Ordering the solr search results based on the indexed fields

I have to order the search results from solr based on some fields which are already indexed.
My current api request is like this without sorting.
http://127.0.0.1:8000/api/v1/search/facets/?page=1&gender=Male&age__gte=19
And it gives the search results based on the indexed order. But I have to reorder this results based on the filed 'last_login' which is already indexed DateTimeField.
Here is my viewset
class ProfileSearchView(FacetMixin, HaystackViewSet):
index_models = [Profile]
serializer_class = ProfileSearchSerializer
pagination_class = PageNumberPagination
facet_serializer_class = ProfileFacetSerializer
filter_backends = [HaystackFilter]
facet_filter_backends = [HaystackFilter, HaystackFacetFilter]
def get_queryset(self, index_models=None):
if not index_models:
index_models = []
queryset = super(ProfileSearchView, self).get_queryset(index_models)
queryset = queryset.order_by('-created_at')
return queryset`
Here I have changed the default search order by 'created_at' value. But for the next request I have order based on the 'last_login' value. I have added a new parameter in my request like this
http://127.0.0.1:8000/api/v1/search/facets/?page=1&gender=Male&age__gte=19&sort='last_login'
but it gives me an error
SolrError: Solr responded with an error (HTTP 400): [Reason: undefined field sort]
How can I achieve this ordering possible? Please help me with a solution.
The URL you provided http://127.0.0.1:8000/api/v1/search/facets/ is not direct SOLR URL. It must be your middle-ware. Since you have tried the query directly against Solr and it works, the problem must be somewhere in middle-ware layer.
Try to print or monitor or check logs to see what URL the midde-ware actually generates and compare it to the valid URL you know works.

How to upload pdf and update field within one request in solr

All:
I am new to solr and solrj. What I want to do right now is uploading pdf file to solr and set customized field such as last_modified field at same time.
But I keep encounter the error such as " multiple values encountered for non multiValued field last_modified", I use solrj to upload pdf and set the last_modified field like
ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");
up.setParam("literal.last_modified", "2011-05-19T09:00:00Z");
I guess the error is due to when solr extract the pdf, it uses some meta data as last_modified field value as well so that my custmized last_modified value leads to a multivalue error, but I wonder how to replace the meta data with my custmized data?
Thanks
/update/extract is defined in solrconfig.xml for your core. You can see the configuration there and modify it to match it to your particular scenario. The Reference Guide lists the options.
In your particular scenario, something look strange. The parameter that seems to be relevant is literalsOverride but it is true by default. Perhaps, you are setting it to false somewhere.
You can also try explicitly map Tika's last update field to some different name.
I would enable catch-all (dynamicField *) as store=true and see what is being captured. Then you can play with the parameters until you are happy. You don't have to restart Solr, just reload the core from the Admin UI.
I have faced similar issue, where I need to fetch one dynamic field value and do some operation then update it. I use below code to achieve this.
First check for that field is it exist or not. Try using below code may be it will help you.
Map<String, String> partialUpdate = new HashMap<String, String>();
if(alreadyPresent)
{
partialUpdate.put("set", value);
}else
{
partialUpdate.put("add", value);
}
doc.addField("projectId", projectId); // unique id for solrdoc
doc.addField(keys[0], partialUpdate);
docs.add(doc);
solrServer.add(docs);
solrServer.commit();

And operator doesnt appear to work

I must be missing something,
my search includes the string long+beach so I expect results to only contain entries where (in this case) the location field contains both long and beach.
/indexes/person2/docs?$count=true&$top=100&search=long+beach&searchFields=location
plus
api-version=2014-07-31-Preview
the image I dont have rep to post shows miami beach, florida among others.
I tried setting &searchMode=any with no change.
the column looks like this
new { Name = "location",
Type = "Edm.String",
Key = false,
Searchable = true,
Filterable = true,
Sortable = true,
Facetable = true,
Retrievable = true,
Suggestions = true },
my bad?
Most likely this is an escaping issue. "+" is treated as a space character in most cases by code that handles URL encoding. "+" shouldn't show up literally in the URL but encoded as %2B. More in general, you should escape the search string before adding it to the search= parameter. Most languages have a built-in URL encoding primitive that will do this for you (e.g. Uri.EncodeDataString() in .NET)
I was having the same issue. %2B worked for me as well. I think I was thrown off by this page of the Azure Search documentation where it describes the simple query syntax that can be used to refine the search.
https://msdn.microsoft.com/en-us/library/azure/dn798920.aspx
It seems that the simple query syntax is what you would put in the search input box, not what you would send to the API.
Here's an example from my project:
search=Aberdeen%2Bdistrict (2 results)
search=Aberdeen+district (156 results)

Resources