Categorical Variable Treating in Data Science problem while Building Model [closed] - analytics

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I came across various problem whether to consider categorical variable which really have some impact on prediction.
I want to know , whether we should consider categorical variable while building model which has around 43 levels.
categorycategory_level
i want to build a model for binary classification problem, for that i have already tried LevelEncoder,OneHotencoder etc from scikit learn.
But nothing works out and dont know how i can consider this categorical feature.

We can use categorical variables in predictions. If you have around 43 levels as you mentioned, you may club similar levels into a single category and so on. This will be a business decision or you can see how different categories in that variable are related to output variable. This will bring down the number of levels from 43 to a less number. Then create the dummy variables on those clubbed categories.

Another way to do this would be by using ANOVA (Analysis of Variance) to see how different are the various categories in that variable. If they are not significantly different, you can club them in one category. I will share an example to explain the same.

Related

What is the best way to store a group of coordinates, for quick lookup based on proximity? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I need to store a group of coordinates in a database. I also need to query the database, given a coordinate that may or not be in the database, and get a list of coordinates within a certain proximity.
What's the best way to go about doing this? I've heard that the world map can be broken into hexagons, with each coordinate assigned to a hexagon, but I don't need to store the entire world map at this point (also what if two points are close to each other but in different hexagons?)
The app is similar to a food delivery app, so accuracy within a couple miles is important.
Depending on the db server you are using, there are geography types.
For instance MsSql has the point type: https://learn.microsoft.com/en-us/sql/t-sql/spatial-geography/point-geography-data-type?view=sql-server-ver15
Point ( Lat, Long, SRID )
You can then use this guide: https://learn.microsoft.com/en-us/sql/relational-databases/spatial/query-spatial-data-for-nearest-neighbor?view=sql-server-ver15 to create the proper indexes and query for the nearest neighbors (however many you require)
MySql also supports spatial data, as can be seen here: https://dev.mysql.com/doc/refman/8.0/en/spatial-types.html
Most db flavors do, so just Google spatial data for the database of your choice.
Important: don't try to recreate them yourself in a custom way. The built in solution is optimized for these queries. Any custom implementation with other types, will most certainly be slower.

How to generate unique ids in C [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am developing an app which requires to generate an id for new users I want to do it with the smallest number of characters that allows me to create 100 billion diferent possible ids so how should I do that and how to avoid giving two users the same it? Should I look if that id exists? Should I use a random id generator or give ids in order like 001 002 and so on?
This depends entirely on what kind of functionality you expect from this id, do you intend for these id's to correlate with persisted data, such as a database? If this is the case, it might be more prudent to let the database handle the unique ID generation for you. Otherwise, using sequential values such as 1,2,3... etc would probably be ideal. unsigned long will keep you covered for the first 2 billion users... If you somehow go beyond that, you can rethink your data storage then.
The question is very broad.

How is sorting done in cassandra? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am a newbie in Cassandra.
Does Cassandra follow any specified sorting algorithm like bubble sort, binary sort, etc.? If not, How does it sort in order by command?
It doesn't, or at least it shouldn't. In Cassandra you build your data model around your use cases. So if you want to retrieve sorted data you have to store it sorted. If you want the same data sorted in different ways, you store the same data multiple times sorted differently. There is a lot more to read about how Cassandra works, and I think every user of Cassandra should.
Links related to your question:
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useSimplePrimaryKeyConcept.html
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCompoundPrimaryKeyConcept.html
https://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlCreateIndex.html
Getting started with Cassandra:
https://academy.datastax.com/courses (the first two courses is a must do. You need to register but they 100% free)

OWL / Protege - Defining a concept characteristic involving counts [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm working on an ontology and I'm having an issue regarding the best approach for defining some concepts. To make my question easier to express, I'll take an example.
Let's suppose that I'm interested, while defining the concept of Football, to say that it requires 2 teams. I have 2 approaches:
Define a hasTeam object property and a Team class and make Football a subclass of:
hasTeam exactly 2 Team
Define a teamCount data property and make Football a subclass of:
teamCount value 2
Which are the advantages of each and which might be the better approach when defining an ontology?
The first solution allows you to specify which teams are involved in Football (football match, I assume), while the second does not allow for this - it is just a restriction over the integer datarange saying that the only value admissible for your property is 2.
I would go for the first solution, as the second one basically reduces the data property to a marker - since there is only one possible value, its presence is equivalent to the individual it's applied to belonging to a class, and allows for less information to be modeled.
But it really depends on the rest of your requirements.

Bullet Points and Databases [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I received a list from a customer using bullet points, and then sub bullet points. What is the best way to store these in a Postgres database, if you could give me an example of this, that would be great.
Thanks!
Structure of it is something similar to this:
Defect1
possible instance of defect1
another possible instance of defect1
Defect2
possible instance of defect2
another possible instance of defect2...
For indented lists you're basically talking about a tree structure. There are many ways to store hierarchies. See this answer for a comparison.
Design Relational Database - Use hierarchical datamodels or avoid them?
Depending on how you want to use the data, i.e., if you're just going to spit it back out as it came in, you may be able to skip the hierarchy aspect in this particular use case and just store each line in sequence with an indentation field. It won't do nearly what can be done with a tree, but it may be all that's needed in your particular case.

Resources