How to get the list of users based on a reputation threshold? - stackexchange

I tried to take a list of users with reputation greater than 5 using a Stack Exchange explorer simple query:
select id
from users
where reputation > 5
I get only 50,000 rows. I expected to get millions. Is there any threshold? Is there any way to get them all?

This is a cross-site duplicate of "Why can't I pull in all the SO users from Data Explorer?" on Meta Stack Exchange.
The Data Explorer (SEDE) limits to 50K rows.
Either refine your query or download and use the Data Dump instead. That's what the Data Dump is for.
Alternatively, you can access the Data Dump via Google's BigQuery -- which also has an API.

Related

How to access snowflake query profile overview statistics via SQL?

In Snowflake SnowSight UI, in the Query Profile view, there is a section called Profile Overview where you can see the breakdown of the total execution time. It contains statistics like Processing, Local Disk I/O, Remote Disk I/O, Synchronization etc.
Full list here
https://docs.snowflake.com/en/user-guide/ui-snowsight-activity.html#profile-overview
I want to access those statistics programmatically instead of having to navigate to that section for each query that I want to analyze. The only system view I know that provides query statistics is the QUERY_HISTORY however it doesn't contain those stats.
https://docs.snowflake.com/en/sql-reference/account-usage/query_history.html
Question is, can I get those stats in any of the system views? If so, where and how?
It is possible to programmatically access query profile using GET_QUERY_OPERATOR_STATS
Returns statistics about individual query operators within a query. You can run this function for any query that was executed in the past 14 days.
For example, you can use this information to determine which operators are consuming the most resources. As another example, you can use this function to identify joins that have more output rows than input rows, which can be a sign of an “exploding” join (e.g. an unintended Cartesian product).
These statistics are also available in the query profile tab in Snowsight. The GET_QUERY_OPERATOR_STATS() function makes the same information available via a programmatic interface.
The GET_QUERY_OPERATOR_STATS function is a table function. It returns rows with statistics about each query operator in the query
set query_id = '<query_ud>';
select *
from table(get_query_operator_stats($query_id));
2023 update: GET_QUERY_OPERATOR_STATS()
See https://stackoverflow.com/a/74824120/132438 with Lukasz answer.
https://docs.snowflake.com/en/sql-reference/functions/get_query_operator_stats.html
Bad news: There's no programmatic way to get this.
Good news: This is a frequent request, so we might eventually have news.
In the internal tracker I left a note to update this answer once there is progress we can report.
You can do it via https://github.com/Snowflake-Labs/sfsnowsightextensions#get-sfqueryprofile. Doing it at scale (scraping-style) will likely yield ~60%-80% success rate. Please don't abuse it.
Inspired by a clever customer who did that to get what is now offered by https://docs.snowflake.com/en/sql-reference/account-usage/access_history.html
Completely unsupported as it says so on the repo homepage.
Just FYI, there is an upcoming feature called GET_QUERY_STATS (currently in private preview) https://docs.snowflake.com/en/LIMITEDACCESS/get_query_stats.html that will do just this and obviate the reason for Get-SFQueryProfile once it ships.

Google Search API Wildcard

I have a Python project running on Google App Engine. I have a set of data currently placed at datastore. On user side, I fetch them from my API and show them to the user on a Google Visualization table with client side search. Because the limitations I can only fetch 1000 record at one query. I want my users search from all records that I have. I can fetch them with multiple queries before showing them but fetching 1000 records already taking 5-6 second so this process can exceed 30 seconds timeout and I don't think putting around 20.000 records on a table is good idea.
So I decided to put my records on Google Search API. Wrote a script to sync important data between datastore and Search API Index. When perform a search, couldn't find anything like wildcard character. For example let's say I have user field stores a string which contains "Ilhan" value. When user search for "Ilha" that record not show up. I want to show record includes "Ilhan" value even if it partially typed. So basically SQL equivalent of my search should be something like "select * from users where user like '%ilh%'".
I wonder if there is a way to that or is this not how Search API works?
I setup similar functionality purely within datastore. I have a repeated computed property that contains all the search substrings that can be formed for a given object.
class User(ndb.Model):
# ... other fields
search_strings = ndb.ComputedProperty(
lambda self: [i.lower() for i in all_substrings(strings=[
self.email,
self.first_name,
self.last_name,], repeated=True)
Your search query would then look like this:
User.query(User.search_strings == search_text.strip().lower()).fetch_page(20)
If you don't need the other features of Google Search API and if the number of substrings per entity won't put you at risk of hitting the 900 properties limit, then I'd recommend doing this instead as it's pretty simple and straight forward.
As for taking 5-6 seconds to fetch 1000 records, do you need to fetch that many? why not fetch only 100 or even 20 and use the query cursor for the user to pull the next page only if they need it.

Google app engine - help in query optimization

I have run into a scenario while running query in app engine which is increasing my cost considerably.
I am writing the below query to fetch book names -
Iterable<Entity> entities =
datastore.prepare(query).asIterable(DEFAULT_FETCH_OPTIONS);
After that I run a loop to match the name with the name the user has requested. This is causing data reads for the entire books in the datastore and with the book details increasing day by day in the datastore, it is further impacting the cost since it is reading the entire list.
Is there an alternative to fetch data for only the requested book detail by the user so that I dont have to read the complete data store? Will SQL help or filters? I would appreciate if someone provides the query.
You have two options:
If you match the title exactly, make it an indexed field and use a filter to fetch only books with exactly the same title.
If you search within titles too:
a. You can use Search API to index all titles and use it to find the books your users are looking for.
b. A less optimal but quick solution is to create a projection query that reads only the book titles.

Using search server with cakephp

I am trying to implement customized search in my application. The table structure is given below
main table:
teacher
sub tables:
skills
skill_values
cities
city_values
The searching will be triggered with location which is located in the table city_values with a reference field user_id, and city_id . Here name of the city and its latitude and longitude is found under the table cities.
Searching also includes skills, the table relations are similar to city. users table and skill_values table can be related with field user_id in the table skill_values. The table skills and skill_values related with field skill_id in table skill_values.
Here we need find the location of the user who perform this search, and need to filter this results with in 20 miles radius. there are a few other filters also.
My problem is that i need to filter these results without page reload. So i am using ajax, but if number of records increase my ajax request will take a lot of time to get response.
Is that a good idea that if i use some opensource search servers like sphinx or solr for fetching results from server?
I am using CAKEPHP for development and my application in hosted on cloud server.
... but if number of records increase my ajax request will take a lot of time to get response.
Regardless of the search technology, there should be a pagination mechanism of some kind.
You should therefore be able to set the limit or maximum number of results returned per page.
When a user performs a search query, you can use Javascript to request the first page of results.
You can then simply incrementing the page number and request the second, third, fourth page, etc.
This should mean that the top N results always appear in roughly the same amount of time.
It's then up to you to decide if you want to request each page of search results sequentially (ie. as the callback for each successful response), or if you wait for some kind of user input (ie. clicking a 'more' link or scrolling to the end of the results).
The timeline/newsfeed pages on Twitter or Facebook are a good example of this technique.

Paging of Query Result Set

Greetings Overflowers,
I'm wondering if there is a way to query some kind of a database and only fetch a certain window in the full result set without having to actually go through them all.
For example, if I query my database and I want only results number 100 to 200, would the database fetch all the results (say 0 to 1000) that match my query and later on filter them to exclude any thing outside my specified window frame ?
Actually, I'm working on a full text search problem (not really relational db stuff).
So how about Google and other search engines, do they get full result then filter or do they have direct access to only the needed window frame ?
Thank you all !
Your question is probably best answered in two parts.
For a database (traditional, relational), a query that is executed contains a number of "where" clauses, which will cause the database engine to limit the number of results that it returns. So if you specify a where clause that basically limits between 2 values of the primary key,
select * From table where id>99 and id<201;
you'll get what you're asking for.
For a search engine, a query you make to get the results will always paginate - using various techniques, all the results will be pre-split into pages and a few will be cached. Other pages will be generated on demand. So if you want pages 100-200 then you only ever fetch those that are needed.
The option to filter is not very efficient because large data sources never want to load all their data into memory and slice - you only want to load what's needed.

Resources