I have a products table on my database, and a table with features of this products. The features table have 3 columns: id, type and value. Id is a foreign key from products.
A example of data on my tables:
Table Products:
ID | Description |
01 Computer A
02 Car
03 Computer B
Table Features:
ID | Type | Value |
01 Processor Phenom X3
01 Memory 2GB
01 HDD 500GB
02 Color Blue
02 Mark Ford
03 Processor Phenom X3
03 Memory 3GB
I want the best way to index it, so, a example, when someone searches for “computer”, the faceting shows:
Phenom X3(2)
Memory 2GB(1)
Memory 3GB(1)
HDD 500GB(1)
And so on, related with the query string. If I make a query with the string “processor”, it will list Phenom X3(1) only if this products (with “processor” on description) have a feature like Processor: Phenom X3. There’s a lot of product types, so we can’t create static colums to all features and pass it to Solr…
I hope my question is clear, thanks in advance!
Use data import handler to index the data # http://wiki.apache.org/solr/DataImportHandler
You can define the products table as main entity and features as sub entity. So that the product with features is indexed as a single document.
For indexing -
Define description field as indexed true
As you want facet on type and value, you can define a new field type_value with type string and concat the type and value field in dataconfig.xml
type_value will be a multivalued field.
For searching -
Make the product description field searchable e.g. q=description:computers
You can configure this in the solrconfig.xml with proper weightage
Define the features field as a facet field and facet.field=type_value
I hope this gives a fair idea.
Related
My dataset is of the form of instances of series data, each with associated metadata. Similar to a CD collection where each CD track has metadata (artist, album, length, etc.) and a series of audio data. Or imagine a road condition survey dataset - each time a survey is conducted the metadata of location, date, time, operator, etc. is recorded, as well as some physical series data of the road condition for each unit length of road. The collection of surveys ({metadata, data} pairs) is the dataset.
I'd like to take advantage of pandas to help import, store, search and analyse that dataset. pandas does not have built-in support for this type of dataset, but many have tried to add it.
The typical solutions are either:
Add metadata to a pandas DataFrame, but this is the wrong way around - I want a collection of metadata records each with associated data, not data with associated metadata.
Casting data to be valid field in a DataFrame and storing it as one of the metadata fields, but the casting process discards significant integrity.
Using multiple indices to create a 3D DataFrame, but this imposes design details on your choice of index, which limits experimentation.
This sort of dataset is very common, and I see a lot of people trying to bend pandas to accommodate it. I wonder what the right approach is, or even if pandas is the right tool.
I now have a working solution, but since I haven't seen this method documented I wonder if there be dragons ahead.
My "database" is a pandas DataFrame that looks something like this:
| | Description | Time | Length | data_uuid |
| 0 | My first record | 2017-03-09 11:00:00 | 502 | f7ee-11e6-b702 |
| 1 | My second record | 2017-03-10 11:00:00 | 551 | f7ee-11e6-a996 |
That is, my metadata are rows of a DataFrame, which gives me all the power of pandas, but my data is given an uuid on importation. The data for each metadata is actually a separate DataFrame, serialised to a file whose name is the uuid.
That way, an illustrative example of looking up a record and pulling out the data looks like this:
display(df_database[df_database['Length'] >= 550.0])
idx = df_database[df_database['Length'] >= 550.0].index[0]
df_data = pd.read_pickle(DATA_DIR + str(df_database.at[idx, 'data_uuid']))
display(df_data)
With suitable importation, storage and lookup functions, this seems to give me the power (with associated cumbersomeness!) of pandas without pulling too many restrictive tricks.
I am building an eCommerce site and I have about 10 different custom post types. For each custom post type, I have about 20 custom fields (with a single meta_value assigned to each meta_key).
What I was thinking about doing is grouping all 20 custom fields into a single custom field so that I'll have 20 meta_values (array) assigned to a single meta_key. This would essentially reduce the number of rows inside the database table from 20 rows down to a single row for each custom post type.
For example, let's say I have post type "book".
Currently my post_meta table for this post type might look like this:
meta_key | meta_value
_________________________
chapters | 5
_________________________
paragraphs | 10
_________________________
author | john_smith
_________________________
price | 19
_________________________
If I convert it to this:
meta_key | meta_value
____________________________
book_detail | [chapters:5 ,paragraphs:10, author:john_smith, price:19]
____________________________
Would doing something like this optimize WordPress query speed or would I just be wasting time?
The tables should be kept as normalized as possible. Storing the CSV or any delimited values is not recommended.
Also, in case you need to query a specific aspect of a post such as only chapters, then normalized data comes handy where you just need to put a simple predicate:
where metakey = 'chapters';
I am building an OLAP cube in MS SQL Server BI Studio. I have two main tables that contain my measures and dimensions.
One table contains
Date | Keywords | Measure1
where date-keyword is the composite key.
One table contains looks like
Date | Keyword | Product | Measure2 | Measure3
where date-keyword-product is the composite key.
My problem is that there can be a one-to-many relationship between date-keyword's in the first table and date-keyword's in the second table (as the second table has data broken down by product).
I want to be able to make queries that look something like this when filtered for a given Keyword:
Measure1 Measure2 Measure3
============================================================
Tuesday, January 01 2013 23 19 18
============================================================
Bike 23
Car 23 16 13
Motorcycle 23
Caravan 23 2 4
Van 23 1 1
I've created dimensions for the Date and ProductType but I'm having problems creating the dimension for the Keywords. I can create a Keyword dimension that affects the measures from the second table but not the first.
Can anyone point me to any good tutorials for doing this sort of thing?
Turns out the first table had one row with all null values (a weird side effect of uploading an excel file straight into MS SQL Server db). Because the value that the cube was trying to apply the dimension to was null in this one row, the whole cube build and deploy failed with no useful error messages! Grr
As it is known, Cassandra is great in low-cardinality indexes and not so good with high-cardinality ones. My column family contains a field storing URL value.
Naturally, searching for this specific value in a big dataset can be slow.
As a solution, I've come up with idea of taking first characters of url and storing them
in separate columns, e.g. test.com/abcd would be stored as (ab, test.com/abcd) columns.
So that when a search by specific URL value needs to be done, I can narrow it down by 26*26 times by searching the "ab" first and only then looking up exact url in the obtained resultset.
Does it look like a working solution to reduce URL cardinality in Cassandra?
If you need this to be really fast, you probably want to consider having a separate table with the value that you are searching for as the column key. Key prefix searches are usually faster than column searches in BigTable implementations.
A problem with that is that a sequential scan is going to have to follow after you use the low-cardinality index, in order to finally arrive at the one specific URL queried.
As Chris Shain mentioned, you can build a separate column family to build an inverted index:
Column Family 'people'
ssn | name | url
----- | ------ | ---
1234 | foo | http://example.com/1234
5678 | bar | http://hello.com/world
Column Family 'urls'
url | ssn
------------------------ | ------
http://example.com/1234 | 1234
http://hello.com/world | 5678
The downside is that you need to maintain the integrity of your manual index yourself.
Using Solr 3.3
Key Store Item Name Description Category Price
=========================================================================
1 Store Name Xbox 360 Nice game machine Electronic Games 199.99
2 Store Name Xbox 360 Nice game machine Electronic Games 199.99
3 Store Name Xbox 360 Nice game machine Electronic Games 249.99
I have data similar to above table and loaded into Solr. Item Name,
description Category, Price are searchable.
Expected result
Facet Field
Category
Electronic(1)
Games(1)
**Store Name**
XBox 360 Nice game machine priced from 199.99 - 249.99
What will be the query parameters that I can send to Solr to receive results above, basically I wan to group it by Store, ItemName, Description and min max price
And I want to keep paging consistent with the main (StoreName). The paging should be based on the Store Name group. So if 20 stores were found. I should be able to correctly page.
Please suggest
If using Solr 4.0, the new "Grouping" (which replaces FieldCollapsing) fixes this issue when you add the parameter "group.facet=true".
So to group your fields you would have add the following parameters to your search request:
group=true // Enables grouping
group.facet=true // Facet counts to be number of groups instead of documents
group.field=Store // Groups results by the field "Store"
group.ngroups=true // Tells Solr to return the number of groups found
The number of groups found is what you would show to the user and use for paging, instead of the normal total count, which would be the total number of documents in the index.
Have you looked into field collapsing? It is new in Solr 3.3.
http://wiki.apache.org/solr/FieldCollapsing
What I did is I created another field that grouped the required fields in a single field and stored it, problem solved, so now I just group only on that field and I get the correct count.