I am following the documentation at https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-fraud-detection-notebook.html to create my sagemaker feature group.
In section 4 there is the step of setting up record identifiers.
In my case a record would have 2 identifiers. The 2 identifiers together make up the unique record in feature store.
How can I achieve this? Is this supported by SageMaker?
If its creation is supported, how does one query this?
Thanks!
Just for the community after research on this topic as #durga_sury mentions only one record identifier is possible,
If you want to have 2, the current way is to combine them into a composite identifier and then store in feature store.
Related
Is there a way in snowflake to describe/catalog schema so it is easier to search? For example, column name is "s_id", which really represents "system identifier". Is there a way to specify that s_id is "system identifier" and user can search "system identifier", which would return s_id as column name.
Just about every object in Snowflake allows for a comment to be added. You could leverage the comment field to be more descriptive of the columns intent, which can then be searched in information_schema.columns. I've seen many Snowflake customers use the comments as a full-blown data dictionary. My only suggestion is to not use any apostrophe's in your comments, as they screw up your get_ddl() results.
Adding comments along with table and field can or may serve the purpose but if you are in a complex enterprise environment and a lot of groups are part of it, it will be hard to follow this approach for long. Either you must have single ownership for data models tool which promotes such changes to the snowflake. Managing such information via comment is error-prone and that's why snowflake has partners which help on the different development requirement.
If you go to their partner website (link), you will find a lot of companies that have good integration with SF and one of them is Alation (link). You can try it for free for 30 days and see if that serves your purpose of data cataloging or not.
Coming from a RDBMS background and trying to wrap my head around ElasticSearch data storage patterns...
Currently in SQL Server, we have a star schema data mart, RecordData. Rows are organized by user ID, geographic location that pertains to the rest of the searchable record, title and description (which are free text search fields).
I would like to move this over to ElasticSearch, and have read about creating a separate index per user. If I understand this correctly, with this suggestion, I would be creating a RecordData type in each user index, correct? What is a recommended naming convention for user indices that will be simple for Kibana analysis?
One issue I have with this recommendation is, how would you organize multiple web applications on the ES server? You wouldn't want to have all those user indices all over the place?
Is it so bad to have one index per application, and type per SQL Server table?
Since in SQL Server, we have other tables for user configuration, based on user ID's, I take it that I could then create new ES types in user indices for configuration. Is this a recommended pattern? I would rather not have two data base systems for this web application.
Suggestions welcome, thank you.
I went through the same thing, and there are a few things to take into account.
Data Modeling
You say you use a star schema today. Elasticsearch is typically appropriate for denormalized data where the totality of the information resides in each document unlike with a star schema. If you can live with denormalized, that is fine but I assume that since you already have star schema, denormalized data is not an option because you don't want to go and update millions of documents each time the location name change for example(if i understand the use case). At least in my use case that wasn't an option.
What are Elasticsearch options for normalized data?
This leads us to think of how to put star schema like data in a system like Elasticsearch. There are a few options in the documentation, the main ones i focused were
Nested Objects - more details at https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-objects.html . In nested objects the entire information is kept in a single document, meaning one location and its related users would be in a single document. That may make it not optimal becasue the document will be huge and again, a change in the location name will require to update the entire document. So this is better but still not optimal.
Parent - Child Relationship - more details at https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child.html . In this case the location and the User records would be kepts in separate indices similarly to a relational database. This seems to be the right modeling for what we need. The only major issue with this option is the fact that Kibana 4 does not provide ways to manipulate/aggregate documents based on parent/child relationship as of this writing. So if you main driver for using Elasticsearch is Kibana(this was mine), that kind of eliminates the option. If you want to benefit from the elasticsearch speed as an engine this seems to be the desired option for your use case.
In my opinion once you got right the data modeling all of your questions will be easier to answer.
Regarding the organization of the servers themselves, the way we organize that is by having a separate cluster of 3 elasticsearch nodes behind a Load Balancer(all of that is hosted on a cloud) and then have all your Web Applications connect to that cluster using the Elasticsearch API.
Hope that helps.
What are the types of planners that we use in NETEZZA and what is their use?
Eg: FACTERAL_PLANNER
Can anyone explain or please provide any link to get to know about this.
I just found an unofficial blog post that mentions:
the Fact Relationship (factrel) planner,
the Snowflake planner,
the Star planner.
Here's a link to one document entitled Determining fact relations with fact relation planner and Snowflake planner but the IBM website won't let me view it even after logging in with my customer id. You may have more luck. The title suggests that the factrel planner uses one algorithm (absolute table size) for guessing which are your fact tables while the snowflake planner uses a different method (size relative to other tables).
I can't find anything relevant about the snowflake planner or the star planner in any of the official documents or training documents I have. Searches for snowflake or star always return things about schemas rather than planners.
I've been looking around for a while but didn't happen to come across any official name (not even an unofficial one) of a document that summarizes all the tables in the database structure including the field names and a brief description of the purpose of this field within the table.
Does such a document actually have an official or commonly used name? I made one to help myself understand the current database design of a company while designing the new one, and also made one of the new database to help others understand the new database design for the company. If this kind of document has an official name then I could have a look in a certain document to ensure mine meets the requirements for it and making maybe more informative/professional.
The document is formally called Data Dictionary.
A platform-specific example is Data Description Specifications (DDS) that allow the developer to describe data attributes in file descriptions that are external to the application program that processes the data, in the context of an IBM System
Solr provides an easy way to search documents based on keywords, but I was wondering if it had the ability to return the keywords themselves?
For example, I may want to search for all documents created by Joe Blogs last week and then get a feel for the contents of those documents by the keywords inside them. Or do I have to work out the key words myself and save them in a field?
Assuming by keywords you mean the tokens that Solr generates when parsing a particular field, you may want to review the documentation and examples for the Term Vector Component.
Before implementing it though, just checking the Analysis screen of the Solr (4+) Admin WebUI, as it has a section that shows the terms/tokens particular field actually generates.
If these are not quite the keywords that you are trying to produce, you may need to have a separate field that generates those keywords, possibly by using UpdateRequestProcessor in the indexing pipeline.
Finally, if you are trying to get a feel to do some sort of clustering, you may want to look at the Carrot2, which already does this and integrates with Solr.
What you are asking for is know as "Topic Model". Solr does not have out of the box support for this. However there are other tools that you can integrate to achieve this.
Apache Mahout supports LDA algorithm, that can be used to model topics. There are several examples of integrating Solr with Mahout. Here is one such.
Apache UIMA (Unstructured Information Management Applications.) I won't bother typing about it. Instead, here is a brilliant presentation.