Data Structure which allows efficent searching of objects - database

I have a very large database of Objects (read an array of key/value pairs, like [{}, {}, {}] in standard C notation), and I need to be able to search for any value of any key within that set of pairs and find the object which contains it (I'll be using fuzzy searching or similar string comparison algorithms). One approach I can think of would be to create an enormous master object with a key referencing to the original object for each value inside the object:
DB = [
{
"a": 45,
"b": "Hello World"
},
{
"a": 32,
"b": "Testing..."
}
]
// ... Generation Code ... //
search = {
45: {the 0th object},
"Hello World": {the 0th object},
32: {the 1st object},
"Testing...": {the 1st object}
}
This solution at least reduces the problem to a large number of comparisons, but are there better approaches? Please note that I have very little formal Computer Science training so I may be missing some major detail simplifying or proving impossible this problem.
P.S. Is this too broad? If so, I'll gladly delete it

Your combined index is more suitable for a full-text search, but doesn't indicate in which property of an object the value is found. An alternative technique that provides more context is to build an index per property.
This should be faster both in preparation and during lookup on property-specific searchers (e.g. a == 32) since for n objects and p properties, a binary search (used in both inserts and lookups) would require log(np) comparisons on a combined index and log(n) on a single-property index.
In either case, you need to watch out for multiple occurrences of the same value. You can store an array of offsets as the value of each index entry, rather than just a single value.
For example:
search = {
"a": {
45: [0],
32: [1]
},
"b": {
"Hello World": [0],
"Testing...": [1]
}
}

Related

Azure Data Factory - converting lookup result array

I'm pretty new to Acure Data Factory - ADF and have stumbled into somthing I would have solved with a couple lines of code.
Background
Main flow:
Lookup Activity fetchin an array of ID's to process
ForEach Activity looping over input array and uisng a Copy Activity pulling data from a REST API storing it into a database
Step #1 would result in an array containing ID's
{
"count": 10000,
"value": [
{
"id": "799128160"
},
{
"id": "817379102"
},
{
"id": "859061172"
},
... many more...
Step #2 When the lookup returns a lot of ID's - individual REST calls takes a lot of time. The REST API supports batching ID's using a comma spearated input.
The question
How can I convert the array from the input into a new array with comma separated fields? This will reduce the number of Activities and reduce the time to run.
Expecting something like this;
{
"count": 1000,
"value": [
{
"ids": "799128160,817379102,859061172,...."
},
{
"ids": "n,n,n,n,n,n,n,n,n,n,n,n,...."
}
... many more...
EDIT 1 - 19th Des 22
Using "Until Activity" and keeping track of posistions, I managed to use plain ADF. Would be nice if this could have been done using some simple array manipulation in a code snippet.
The ideal response might be we have to do manipulation with Dataflow -
My sample input:
First, I took a Dataflow In that adding a key Generate (Surrogate key) after the source - Say new key field is 'SrcKey'
Data preview of Surrogate key 1
Add an aggregate where you group by mod(SrcKey/3). This will group similar remainders into the same bucket.
Add a collect column in the same aggregator to collect into an array with expression trim(toString(collect(id)),'[]').
Data preview of Aggregate 1
Store output in single file in blob storage.
OUTPUT

How to search for a case insensitive substring in an array of objects

In ArangoDB I am playing around with a test collection that is the IMDB dataset downloaded from their site as csv. The movies document is structured as follows:
movies:
{
_key: 123456,
name: "Movie title",
... ,
releases: [
{ title: "Local title",
region: 'US',
language: 'en',
... },
{ title: "Other title",
region: 'GB',
language: '??'
... }
]
}
I have created an index on the movies.releases[*].title field.
I am interested in querying that field, not only by equality, but also by using case insensitive and substring matching.
The problem is that the only kind of query that uses the index is when I do something like that:
FOR doc IN movies:
FILTER 'search' IN doc.releases[*].title
With this I can only match the whole string in a case sensitive way: how can I look for a substring in a case insensitive way?
I cannot use a full-text index, since ArangoDB does not support it in arrays, and I cannot use LOWER() and CONTAINS() since it is an array.
Any ideas?
Thanks!
It's possible to nest your search, giving you the power to search within the array without having the constraints applied by using the '[*]' notation.
Here is an example that does a search inside each releases array, looking for a case insensitive match, and then returning if it gets any hits.
The FILTER function there will only return the movie if at least one of the releases has a match.
FOR doc IN movies
LET matches = (
FOR release IN doc.releases
FILTER CONTAINS(LOWER(release.title), LOWER('title'))
RETURN release
)
FILTER LENGTH(matches) > 0
RETURN doc
It's straight forward there to change 'title' to a parameter.
Note:
To put less pressure on the query, the goal of the matches variable is to have a LENGTH property greater than 0 if there is a release with your key word in it.
The function above has the line RETURN release which returns possibly a large amount of data when you won't be reading it, so an alternative there is to replace that line with RETURN true as that is all that is needed to force matches to become an array and have a LENGTH greater than 0.

Manipulating Output from an Array of Nested Hashes in Ruby

I've been pulling data from an API in JSON, and am currently stumbling over an elmementary problem
The data is on companies, like Google and Facebook, and is in an array or hashes, like so:
[
{"id"=>"1", "properties"=>{"name"=>"Google", "stock_symbol"=>GOOG, "primary_role"=>"company"}},
{"id"=>"2", "properties"=>{"name"=>"Facebook", "stock_symbol"=>FB, "primary_role"=>"company"}}
]
Below are two operations I'd like to try:
For each company, print out the name, ID, and the stock symbol (i.e. "Google - 1 - GOOG" and "Facebook - 2 - FB")
Remove "primary role" key/value from Google and Facebook
Assign a new "industry" key/value for Google and Facebook
Any ideas?
Am a beginner in Ruby, but running into issues with some functions / methods (e.g. undefined method) for arrays and hashes as this looks to be an array OF hashes
Thank you!
Ruby provides a couple of tools to help us comprehend arrays, hashes, and nested mixtures of both.
Assuming your data looks like this (I've added quotes around GOOG and FB):
data = [
{"id"=>"1", "properties"=>{"name"=>"Google", "stock_symbol"=>"GOOG", "primary_role"=>"company"}},
{"id"=>"2", "properties"=>{"name"=>"Facebook", "stock_symbol"=>"FB", "primary_role"=>"company"}}
]
You can iterate over the array using each, e.g.:
data.each do |result|
puts result["id"]
end
Digging into a hash and printing the result can be done in a couple of ways:
data.each do |result|
# method 1
puts result["properties"]["name"]
# method 2
puts result.dig("properties", "name")
end
Method #1 uses the hash[key] syntax, and because the first hash value is another hash, it can be chained to get the result you're after. The drawback of this approach is that if you have a missing properties key on one of your results, you'll get an error.
Method #2 uses dig, which accepts the nested keys as arguments (in order). It'll dig down into the nested hashes and pull out the value, but if any step is missing, it will return nil which can be a bit safer if you're handling data from an external source
Removing elements from a hash
Your second question is a little more involved. You've got two options:
Remove the primary_role keys from the nested hashes, or
Create a new object which contains all the data except the primary_role keys.
I'd generally go for the latter, and recommend reading up on immutability and immutable data structures.
However, to achieve [1] you can do an in-place delete of the key:
data.each do |company|
company["properties"].delete("primary_role")
end
Adding elements to a hash
You assign new hash values simply with hash[key] = value, so you can set the industry with something like:
data.each do |company|
company["properties"]["industry"] = "Advertising/Privacy Invasion"
end
which would leave you with something like:
[
{
"id"=>"1",
"properties"=>{
"name"=>"Google",
"stock_symbol"=>"GOOG",
"industry"=>"Advertising/Privacy Invasion"
}
},
{
"id"=>"2",
"properties"=>{
"name"=>"Facebook",
"stock_symbol"=>"FB",
"industry"=>"Advertising/Privacy Invasion"
}
}
]
To achieve the first operation, you can iterate through the array of companies and access the relevant information for each company. Here's an example in Ruby:
companies = [ {"id"=>"1", "properties"=>{"name"=>"Google", "stock_symbol"=>"GOOG", "primary_role"=>"company"}}, {"id"=>"2", "properties"=>{"name"=>"Facebook", "stock_symbol"=>"FB", "primary_role"=>"company"}}]
companies.each do |company|
name = company['properties']['name']
id = company['id']
stock_symbol = company['properties']['stock_symbol']
puts "#{name} - #{id} - #{stock_symbol}"
end
This will print out the name, ID, and stock symbol for each company.
To remove the "primary role" key/value, you can use the delete method on the properties hash. For example:
companies.each do |company|
company['properties'].delete('primary_role')
end
To add a new "industry" key/value, you can use the []= operator to add a new key/value pair to the properties hash. For example:
companies.each do |company|
company['properties']['industry'] = 'Technology'
end
This will add a new key/value pair with the key "industry" and the value "Technology" to the properties hash for each company.

postgresql Json path capabilities

In documentation some of the postgresql json functions uses a json path attribute.
For exemple the jsonb_set function :
jsonb_set(target jsonb, path text[], new_value jsonb[, create_missing boolean])
I can't find any of the specifications of this type of attribute.
Can it be used for example to retrieve an array element based on it's attribute's value ?
The path is akin to a path on a filesystem: each value drills further down the leaves of the tree (in the order you specified). Once you get a particular JSONB value from extracting it via a path, you can chain other JSONB operations if needed. Using functions/operators with JSONB paths is mostly useful for when there are nested JSONB objects, but can also handle simple JSONB arrays too.
For example:
SELECT '{"a": 42, "b": {"c": [1, 2, 3]}}'::JSONB #> '{b, c}' -> 1;
...should return 2.
The path {b, c} first gets b's value, which is {"c": [1, 2, 3]}.
Next, it drills down to get c's value, which is [1, 2, 3].
Then the -> operation is chained onto that, which gets the value at the specified index of that array (using base-zero notation, so that 0 is the first element, 1 is the second, etc.). If you use -> then it will return a value with a JSONB data type, whereas ->> will return a value with a TEXT data type.
But you could have also written it like this:
SELECT '{"a": 42, "b": {"c": [1, 2, 3]}}'::JSONB #> '{b, c, 1}';
...and simply included both keys and array indexes in the same path.
For arrays, the following two should be equivalent, except the first uses a path, and the second expects an array and gets the value at the specified index:
SELECT '[1, 2, 3]'::JSONB #> '{1}';
SELECT '[1, 2, 3]'::JSONB -> 1;
Notice a path must always be in JSON array syntax, where each successive value is the next leaf in the tree you want to drill down to. You supply keys if it is a JSONB object, and indexes if it is a JSONB array. If these were file paths, the JSONB keys are like folders, and array indexes are like files.

ordered fixed-length arrays in mongoDB

I'm new to monogDB and trying to design the way I store my data so I can do the kinds of queries I want to. Say I have a document that looks like
{
"foo":["foo1","foo2","foo3"],
"bar":"baz"
}
Where the array "foo" is always of length 3, and the order of the items are meaningful. I would like to be able to make a query that searches for all documents where "foo2" == something. Essentially I want to treat "foo" like any old array and be able to index it in a search, so something like "foo"[1] == something.
Does monogDB support this? Would it be more correct to store my data like,
{
"foo":{
"foo1":"val1",
"foo2":"val2",
"foo3":"val3"
},
"bar":"baz"
}
instead? Thanks.
The schema you have asked about is fine.
To insert at a specific index of array:
Use the $position operator. Read here.
To query at a specific index location:
Use the syntax key.index. As in:
db.users.find({"foo.1":"foo2"})

Resources