How to identify similar numbers from database before insert

How to identify similar numbers from database before insert - sql-server

I have a table with 20 million rows of phone numbers, which is synced to azure cognitive search. I need to search before creating a new row. Lets say I have a record with number "1234567890" if a request comes with number "1234567899" or "1234557890", this request is slightly similar to the one which is already in db. Is there any way to do that in MSSQL or Cognitive search or in Azure or any other tool.
I tried with cognitive fuzzy search, but it is not accurate

Related

Azure Search - Creating a dedicated search table in SQL, for using with an Indexer

I'm building an Assets search engine.
The data I need to be indexed for each assets is scattered into multiples tables in the SQL database.
Also, there is many events in the application that will trigger update to the asset indexed fields (in Draft, Rejected, Produced, ...).
I'm considering creating a new denormalized table in the SQL database that would exist solely for the Azure Search Index.
It would be an exact copy of the Azure Search Index fields.
The application would be responsible to fill and update the SQL table, through various event handlers.
I could then use an Azure SQL Indexer schedule to automatically import the data into the Azure Search Index.
PROS:
We are used to deal with sql table operations, so the application code remains standard, no need to learn the Azure Search API
Both the transactional and the search model are updated in the same SQL transaction (atomic). Then the Indexer update the index in an eventual consistent manner, and handle the retry logic.
Built-in support for change detection with SQL Integrated Change Tracking Policy
CONS:
Indexed data space usage is duplicated in the SQL database
Delayed Index update (minimum 5 minutes)
Do you see any other pros and cons ?
[EDIT 2017-01-26]
There is another big PRO for our usage.
During development, we regularly add/rename/remove fields from the Azure index. In its current state, some schema modifications to an Azure Index are not possible, we have to drop and re-create the index.
With a dedicated table containing the data, we simply issue a Reset to our indexer endpoint and the new index gets automatically re-populated.

Azure Search fails to index WADLogsTable

I am creating a log crawler to combine logs from all our Azure applications. Some of them store logs in SLAB format and some of them simply use the Azure Diagnostics Tracer. Since version 2.6, Azure Diagnostics tracer is actually creating two Timestamp columns in the Azure Table "WADLogsTable". The explanation for this behavior from Microsoft is the following:
"https://azure.microsoft.com/en-us/documentation/articles/vs-azure-tools-diagnostics-for-cloud-services-and-virtual-machines/"
TIMESTAMP is PreciseTimeStamp rounded down to the upload frequency boundary. So, if your upload frequency is 5 minutes and the event time 00:17:12, TIMESTAMP will be 00:15:00.
Timestamp is the timestamp at which the entity was created in the Azure table.
Sadly Azure Search currently only supports case insensitive column mapping, so when I create a simple datasource, index and indexer, I get an exception about multiple columns existing in the datasource with the same name (Timestamp).
I tried not to use Timestamp and instead use the PreciseTimeStamp, but then I get a different exception:
"Specified cast is not valid.Couldn't store <8/18/2016 12:10:00 AM> in Timestamp Column. Expected type is DateTimeOffset."
I assume this is because the current Azure Table datasource insists on keeping track of Timestamp column for change tracking behind the scenes.
The behavior is the same if I programmatically create all the objects, or use the "Import Data" functionality on the portal.
Does anyone have any other strategy or approach to overcome this issue?
We are happily indexing our SLAB tables btw, it's just the WAD failing now.

Indexing text documents by Azure Search Service

Azure's documentation suggests that we should leverage blobs to be able to index documents like MS Word, PDF, etc. We have an Azure SQL Server database of thousands of documents stored in a table's nvarchar(MAX) field. The nature of the contents in each database record is in plain English text. In fact the application converted the PDF / MS Word into plain text and stored in database.
My question is that would it be possible to index the stored "documents" in database in the same way as Azure would do against blobs? I know how to create an SQL Azure indexer but I'd like to make sure that the way that the underneath search performs against blobs will be the same for documents stored in database table.
Thanks in advance!

This is not currently possible - document extraction can only be done on blobs stored in Azure storage.

How to store large text on Azure platform (SQL database, Table storage, Blobs...)

I need to build a chat application. I also need to store all chat transcripts into some storage - SQL database, Table storage or other Azure mechanism.
It should store about 50 Mega of characters per day. Each bulk of text should be attach to a specific customer.
My question:
What is the best way to store such amount of text in Azure?
Thanks.

I would store them in Azure Tables, using conversationId as the partition key, and messageID as the rowKey. That way, you can easilly aggregate your statistics based on those two, and quickly retrieve conversations.

DB technology for efficient search in tabular data?

We have a repository of tables. Around 200 tables, each table can be thousands of rows, all tables are originally in Excel sheets.
Each table has a different scheme. All data is text or numbers.
We would like to create an application that allows free text search on all tables (we define which columns will be searched in each table) efficiently - speed is important.
The main dilemma is which DB technology we should choose.
We created a mock up by importing all tables to MS SQL Server, and creating a full text index over them. The search is done using the CONTAINS keyword. This solution works well for a small number of tables, but it doesn't scale.
We thought about a NoSQL solution, but we don't yet have any experience in it.
Our limitations (which unfortunately I can not effect): Windows servers only. But we can install on them whatever we want.
Thank you.

Check out ElasticSearch! It's a search server based on Apache Lucene and has a clean REST- and JavaScript-based API. Although it's used usually as a search-index for a primary database, it can also be used stand-alone. So you may want to write a backup routine for a few of your tables/data and try it out.
http://www.elasticsearch.org/
http://en.wikipedia.org/wiki/ElasticSearch
Comparison of ElasticSearch and Apache Solr (another Lucene-based search server):
https://docs.google.com/present/view?id=dc6zhtt5_1frfxwfff&pli=1

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight