How to extract unique values of columns and rows in Google Sheets? - database

I am trying to extract the unique names of every company listed on the following Fortune 500 index since 1955. Many names are repeated since they stayed in the index for years at a time. The only function I can find in Google Sheets is UNIQUE, but it only works on single columns, not entire datasets. Any suggestions?

Use FLATTEN() with UNIQUE() function.
=UNIQUE(FLATTEN(B2:U))

Related

databases and frontend: load balancing for analyzing data

I have a scrapper which gets news-articles over the day by different sources.
I want to display data like 'most common words in the last 30 days (in source X)' on my page.
For now I have saved the articles to my database consisting of the timestamp the article was released and a string of the content.
With a few datasets this works fine, but I do no understand how to balance the load, that the front end has most flexibility but not too much data to count.
I thought you could run a script, which takes all the articles from one day and create a new tables containing each word with its count. I came across two points here:
1 - How do I create a table for this? Since every article has different length and different sets of words I would need a table with as many fields, as the number of words in the longest article. I could say I will only save the first 20, but I don't really like the idea.
2 - If the script takes all the articles from one day and calculates the word_counts, I have a minimum resolution of 1 day. So I won't be able to differentiate any further. I chose the script to run for each day to reduce the data that I will need to send to the front on demand.
Don't create a table with a separate column for each of the first 20 words. Please. I beg you. Just don't.
Two possible approaches.
Use a fulltext search feature in your DBMS. You didn't tell us which one you use, so it's hard to be more specific.
Preprocess: Create a table with columns article_id, word_number, and word. This table will have a large number of rows, one for each word in each article. But that's OK. SQL databases are made for handling vast tables of simple rows.
The unique key on the table contains two columns: article_id and word_number. A non-unique key for searching should contain word, article_id, word_number.
When you receive an incoming article, assign it an article_id number. Then break it up into words and insert each word into the table.
When you search for a word do SELECT article_id FROM words WHERE word=?. Fast. And you can use SQL set manipulation to do more complex searches.
When you remove an article from your archive, DELETE the rows with that article_id value.
To get frequencies do SELECT COUNT(*) frequency, word FROM words GROUP BY word ORDER BY 1 DESC LIMIT 50.

Alternative to VLookUp that can check if two factors are true

I am wondering is there any alternative to VLookUP() that can check two factors before returning a value. I want to search for an identifier that is only unique for a given date.
I.E the Key exists multiple times in the dataset but only once for each date so the date and the key combined form a primary key.
Note: I want to do this without adding a column to the dataset.
This is a simplified example. I want a formula that will return 304 if I look up using 02/03/20 and 89076.
My current solution is to make another column that concatenates column A and column B and then do a Vlookup on the column but I am looking for a solution that does not require adding another column.
Using Excel 2010
In Excel 2010, if you not looking for a figure but actual text, try the following:
=INDEX(D:D,MATCH(1,INDEX((A:A=DATEVALUE("02/03/20"))*(B:B=89076),),0))

How to find the number of duplicate documents in solr based on a indexed field

I have few near duplicate documents stored in solr. Schema has a autogenerated uuid as the unique key so duplicates can get into the index. I need to get the counts of duplicated documents based on field/fields in the schema.
I am trying to get quick numbers without writing a client program and going through the full result set, something on solr console itself.
Tried to use facets but not able to get the total counts. below query gives the duplicates for each value of 'idfield' but they need to be iterated till last page and summed up (over couple of million entries).
q=*:*&facet=true&facet.mincount=2&facet.field=idfield
jason facet query can be used to find out unique values as explained in this blog
http://yonik.com/solr-count-distinct/
or it can be done using collapse filter and finding the difference
q=*:*&fq={!collapse=true field=idfield} - get the numfound and subtract from MatchAllDocs query (*:*)
You can also use facet.mincount=2 to get duplicate documents by faceting on unique id field. Ex: /solr/core/select?q=:&facet=on&facet.field=uniqueidfield&facet.mincount=2&facet.missing=true
Also you can add facet.limit=-1&rows=0 to get the document ids with duplicate ids.

How could I go about getting distinct field counts in Azure Search

I have an index with around 35 million documents. When a user issues a query with any combination of search words and filters, I need to get a count of unique values on another field. The purpose is to answer the question "How many unique (field x) are there with a given query?".
I'm pretty sure that Azure Search doesn't have any capability to do this, so I thought I would try to do another query where I select just the field I want to count distinct values of, but I think this would be very time consuming with such a large index. I'm also under the impression that I can only skip at max 100,000 records, which would make it impossible for me to do this if a query returned more than 100k results.
Any ideas on how to go about this?
Thanks!
Azure Search doesn't directly support distinct count of values today. In order to support it in a single query combined with $filter, it would either have to be supported as a new facet type, or maybe with a combination of $count and $filter where the field being counted is the key field (note that $count and $filter can't be combined today).
Feel free to add distinct count to the Azure Search feedback forum to help prioritize the feature.
Original Answer
If you wanted a count of documents per unique value, you could use facets. For example, if you're searching for shoes under $100 dollars and you want to know, out of the hits, how many shoes of each color there are, you would do this:
GET /indexes/products/docs?search=shoes&$filter=price+lt+100&facet=color&api-version=2015-02-28
The response will contain a #search.facets property that contains buckets for each unique value along with a count. You can find more info here and here.

Datastore Index Creation Fails Without Explanation

I'm trying to create a compound index with a single Number field and a list of Strings field. When I view the status of the index it just has an exclamation mark with no explanation. I assume it is because datastore concludes that it is an exploding index based on this FAQ page: https://cloud.google.com/appengine/articles/index_building#FAQs.
Is there any way to confirm what the actual failure reason is? Is it possible to split the list field into multiple fields based on some size limit and create multiple indexes for each chunk?
You get the exploding indexes problem when you have an index on multiple list/repeated properties. In this case a single entity would generate all combinations of the property values (i.e. an index on (A, B) where A has N entries and B has M entries will generate N*M index entries).
In this case you shouldn't get the exploding index problem since you aren't combining two repeated fields.
There are some other obscure ways in which an index build can fail. I would recommend filing a production ticket so that someone can look into your specific indexes.
I believe it was the 1000 item limit per entity for indexes on list properties. I partitioned the property into groups of 999, e.g. property1, property2 etc. as needed. I was then able to create indexes for each chunked property successfully.

Resources