How to use Azure Cognitive Search to connect Meta and Raw data? - azure-cognitive-search

we want to store our measurement data in an Azure Datalake. The dataset consists of the raw data and the metadata. These two datasets are in different files. For the search we want to use Cognitive Search.
How do you link the metadata with the raw data in Coginitve Search, so that in the search results (which are based on the metadata) the associated raw data is also linked.
Thanks.
Many greetings
Michael

Use the same index for both datasets. If your raw data contains dataID, colA, colB, and colC, and your metadata contains dataID, colD, and colE, you can create an index that encompasses both: dataID, colA, colB, colC, colD, and colE.
To get the data into your index, one option is an indexer. You can easily set one up in the Azure portal by going to your Cognitive Search resource and clicking the "Import Data" button. Specify "ADLS Gen 2" (Azure Datalake storage) as the data source and the index you've created as the index to pull into.
You can create two different indexers, one for your raw data and one for your metadata, that map the various fields correctly, but both can write to the same index. (An indexer always contains a data source from where it pulls data and an index to which it pushes data, so with two indexers, you could have two separate data sources for raw data vs. metadata and the same index for both.) Finally, querying that one index will give you joint search results containing both the metadata and raw data.
Some links that might be helpful:
How to create indexers
Create index - REST reference
Create indexer - REST reference

Related

Apache Solr Querying by search term from multiple tables and in all columns

I am new to Apache Solr and have worked with single table and importing it in Solr to get data using query.
Now I want to do following.
query from multiple tables ..... Like if I find by a word, it should return all occurances in multiple tables.
Search in all fields of table ....like I query by word in all fields in single table too.
Do I need to create single document by importing data from multiple tables using joins in data-config.xml? And then querying over it?
Any leads and guidance is welcome.
TIA.
Do I need to create single document by importing data from multiple tables using joins in data-config.xml? And then querying over it?
Yes. Solr uses a document model (rather than a relational model) and the general approach is to index a single document with the fields that you need for searching.
From the Apache Solr guide:
Solr’s basic unit of information is a document, which is a set of data
that describes something. A recipe document would contain the
ingredients, the instructions, the preparation time, the cooking time,
the tools needed, and so on. A document about a person, for example,
might contain the person’s name, biography, favorite color, and shoe
size. A document about a book could contain the title, author, year of
publication, number of pages, and so on.

How to cope with case-sensitive column names in big data file formats and external tables?

Background
I'm using Azure data factory v2 to load data from on-prem databases (for example SQL Server) to Azure data lake gen2. Since I'm going to load thousands of tables, I've created a dynamic ADF pipeline that loads the data as-is in the source based on parameters for schema, table name, modified date (for identifying increments) and so on. This obviously means I can't specify any type of schema or mapping manually in ADF. This is fine since I want the data lake to hold a persistent copy of the source data in the same structure. The data is loaded into ORC files.
Based on these ORC files I want to create external tables in Snowflake with virtual columns. I have already created normal tables in Snowflake with the same column names and data types as in the source tables, which I'm going to use in a later stage. I want to use the information schema for these tables to dynamically create the DDL statement for the external tables.
The issue
Since column names are always UPPER case in Snowflake, and it's case-sensitive in many ways, Snowflake is unable to parse the ORC file with the dynamically generated DDL statement as the definition of the virtual columns no longer corresponds to the source column name casing. For example it will generate one virtual column as -> ID NUMBER AS(value:ID::NUMBER)
This will return NULL as the column is named "Id" with a lower case D in the source database, and therefore also in the ORC file in the data lake.
This feels like a major drawback with Snowflake. Is there any reasonable way around this issue? The only options I can think of is to:
1. Load the information schema from the source database to Snowflake separately and use that data to build a correct virtual column definition with correct cased column names.
2. Load the records in their entirety into some variant column in Snowflake, converted to UPPER or LOWER.
Both options add a lot of complexity or even messes up the data. Is there any straight forward way to only return the column names from an ORC file? Ultimately I would need to be able to use something like Snowflake's DESCRIBE TABLE on the file in the data lake.
Unless you set the parameter QUOTED_IDENTIFIERS_IGNORE_CASE = TRUE you can declare your column in the casing you want:
CREATE TABLE "MyTable" ("Id" NUMBER);
If your dynamic SQL carefully uses "Id" and not just Id you will be fine.
Found an even better way to achieve this, so I'm answering my own question.
With the below query we can get the path/column names directly from the ORC file(s) in the stage with a hint of the data type from the source. This filters out colums that only contains NULL values. Will most likely create some type of data type ranking table for the final data type determination for the virtual columns we're aiming to define dynamically for the external tables.
SELECT f.path as "ColumnName"
, TYPEOF(f.value) as "DataType"
, COUNT(1) as NbrOfRecords
FROM (
SELECT $1 as "value" FROM #<db>.<schema>.<stg>/<directory>/ (FILE_FORMAT => '<fileformat>')
),
lateral flatten(value, recursive=>true) f
WHERE TYPEOF(f.value) != 'NULL_VALUE'
GROUP BY f.path, TYPEOF(f.value)
ORDER BY 1

NetSuite - UNION ALL equivalent in saved search?

I'm in the process of writing a SuiteTalk integration, and I've hit an interesting data transformation issue. In the target system, we have a sort of notes table which has a category column and then the notes column. Data going into that table from NetSuite could be several different fields on a single entity in NetSuite terms, but several records of different categories in our terms.
If you take the example of a Sales Order, you might have two text fields that we need to bring across as notes. For each of those fields I need to create a row, with both the notes field in the same column but separate rows. This would allow me to add a dynamic column that give the category for each of those fields.
So instead of
SO number notes 1 notes 2
SO1234567 some text1 some text2
You’d get
SO Number Category Text
SO1234567 category 1 some text1
SO1234567 category 2 some text2
The two problems I’m really trying to solve here are:
Where can I store the category name? It can’t be the field name in NetSuite. It needs to be configurable per customer as the number of notes fields in each record type might vary across implementations. This is currently my main blocker.
Performance – I could create a saved search for each type of note, and bring one row across each time, but that’s not really an acceptable performance hit if I can do it all in one call.
I use Saved Searches in NetSuite to provide a configurable way of filtering the data to import into the target system.
If I were writing a SQL query, i would use the UNION clause, with the first column being a dynamic column denoting the category and the second column being the actual data field from NetSuite. My ideal would be if I could somehow do a similar thing either as a single saved search, or as one saved search per entity, without having to create any additional fields within NetSuite itself, so that from the SuiteTalk side I can just query the search and pull in the data.
As a temporary kludge, I now have multiple saved searches in NetSuite, one per category, and within the ID of the saved search I expect the category name and an indicator of the record type. I then have a parent search which gives me the searches for that record type - it's very clunky, and ultimately results in far too many round trips for me to be satisfied.
Any idea if something like this is at all possible?? Or if not, is there a way of solving this without hard-coding the category values in the front end? Even if I can bring back multiple recordsets in one call, that would be a performance enhancement.
I've asked the same question on the NetSuite forums but to no avail.
Thanks
At first read it sounds like you are trying to query a set of fields from entities. The fields may be custom fields or built in fields. Can you not just query the entities where your saved search has all the potential category columns and then transform the received data into categories?
Otherwise please provide more specifics in Netsuite terms about what you are trying to do.

Fetch data from 20 related tables (through id), combine them to a json File and leverage spring batch for this

I have a Person database in SQL Server with tables like address, license, relatives etc. about 20 of them. All the tables have id parameter that is unique per person. There are millions of records in these tables. I need to combine theses records of the person using their common id parameter, and convert to a json table file with some column name changes. This json file then gets pushed to kafka through a producer. If I can get the example with the kafka producer as item writer- fine, but real problem is understanding the strategy and specifics on how to utilize spring batch item reader, processor, and item writer to create the composite json file. This is my first Spring batch application so I am relatively new to this.
I am hoping for the suggestions on the implementation strategy using a composite reader or processor to use person id as the cursor, and query each table using the id for each table , convert the resulting records to json and aggregate it to a composite, relational json file with root table PersonData that feeds to kafka cluster.
Basically I have one data source, same database for the reader. I plan to use Person table to fetch id and other records unique for the person, and use id as the where clause for 19 other tables. convert each resultset from the table to json, and composite the json object at the end and write to kafka.
We had such an requirement in a project and solved it with the following approach.
In Splitflow, that run parallel, we had a step for ever table that loaded the data of the table in the file, sorted by common id (this is optional, but it is easier for testing, if you have the data in files).
Then we implemented our own "MergeReader".
This mergereader had FlatFileItemReaders for every file/table (let's call them dataReaders). All these FlatFileItemReaders were wrapped with a SingleItemPeekableItemReader.
The logic for the read method of the MergeReader is as follows:
public MyContainerPerId read() {
// you need a container to store the items, that belong together
MyContainerPerId container = new MyContainerPerId();
// peek through all "dataReaders" to find the lowest actual key
int lowestId = searchLowestKey();
for (Reader dataReader : dataReaders) {
// I assume, that more than one entry in a table can belong to
// the same person id
wihile (dataReader.peek().getId() == lowestId) {
{
container.add(dataReader.read());
}
}
// the container contains all entries from all tables
// belonging to the same person id
return container;
}
If you need restart capability, you have implement ItemStream in a way, that it keeps track of the current readposition for every dataReader.
I used the Driving Query Based ItemReaders usage pattern described here to solve this issue.
Reader: just a default implementation of JdbcCursoritemReader with sql to fetch
the unique relational id (e.g. select id from person -)
Processor: Uses this long id as the input and a dao implemented by me using
jdbcTemplate from spring fetches data through queries against each of
the table for a specific id (e.g. select * from license where id=) and map results in list format to a POJO
of Person - then convert to json object (using Jackson) and then to
string
Writer: either write the file out with json string or publish json string to a
topic in case of kafka use
We went through similar exercise migrating 100mn + rows from multiple tables as a form of JSON so that we can post it to a message bus.
The idea is create a view, de-normalize the data and read from that view using JdbcPagingItemReader.Reading from one source has less overhead.
When you de-normalize the data make sure you do not get multiple rows for master table.
Example - SQL server -
create or alter view viewName as
select master.col1 , master.col2,
(select dep1.col1,
dep1.col2
from dependent1 dep1
where dep1.col3 = master.col3 for json path
) as dep1
from master master;
The above will give you dependent table data in a json String with one row for each master table data. Once you retrieve the data you can use GSON or Jackson to convert it as POJO.
We tried to avoid JdbcCursoritemReader as it will pull all data in memory and read one by one from it. It does not support pagination.

How to design a database schema for a search engine?

I'm writing a small search engine in C with curl, libxml2, and mysql. The basic plan is to grab pages with curl, parse them with libxml2, then iterate over the DOM and find all the links. Then traverse each of those, and repeat, all while updating a SQL database that maintains the relationship between URLs.
My question is: how can I best represent the relationship between URLs?.
Why not have a table of base urls (ie www.google.com/) and a table of connections, with these example columns:
starting page id (from url table)
ending page id (from url table)
the trailing directory of the urls as strings in two more columns
This will allow you to join on certain urls and pick out information you want.
Your solution seems like it would be better suited to a non relational datastore, such as a column store.
Most search engine indices aren't stored in relational databases, but stored in memory as to minimize retrieval time.
Add two fields to table - 'id' and 'parent_id'.
id - unique identifier for URL
parent_id - link between URL's
If you want to have a single entry for each URL then you should create another table that maps the relationships.
You then lookup the URL table to see if it exists. If not create it.
The relationship table would have
SourceUrlId,
UrlId
Where the SourceUrlId is the page and the UrlId is the url it points to. That way you can have multiple relationships for the same URL and you won't need to have a new entry in the Url table for every link to that url. Will also mean only 1 copy of any other info you are storing.
Why are you interested in representing pages graph?
If you want to compute the ranking, then it's better to have a more succinct and efficient representation (e.g., matricial form if you want to compute something similar to PageRank).

Resources