how can I filter capital letters in a data set - arrays

I have a column with a lot of rows more than 150k, each cell has a text, there is some cells has problems of having some sentences with capital letters I wanna fix this issue how can I filter them to know how many of them I have for example I have some cells like that I wanna detect the cells that has some capital sentences to be able to fix them:
Each year, They carefully curate the finest gifts to fill our Baskits
from new and wellloved brands, to unique products made exclusively for
us. They specialize in helping busy professionals give thoughtful,
impactful gifts for business development, colleague/employee
recognition, holiday gifts and more. Here are the top three things to
know about us: 1) They ARE CANADA'S LEADING GIFT DELIVERY SERVICE 30+
years of experience and 20, 000 customers 2) They MAKE THOUGHTFUL
GIFTING QUICK AND EASY Online and Mobile Webstore (Open 24/7) Call
Centre Gift Specialists Two Retail Stores (Downtown + North Toronto)
3) They HAVE DELIVERY OPTIONS TO SUIT YOUR NEEDS Delivery Across North
America sameday (and Saturday) Delivery in the GTA

try in A2:
=INDEX(REGEXMATCH(A2:A; "(.*)[A-Z]{2}(.*)")
or if you want a count:
=SUMPRODUCT(1*REGEXMATCH(A2:A; "(.*)[A-Z]{2}(.*)"))

Since it looks like you're in Google Sheets, just do a REGEXMATCH() for two capital letters in a row (as a review flag):
=BYROW(A2:A, LAMBDA(x, REGEXMATCH(x, "(.*)[A-Z]{2}(.*)"))
The BYROW() makes it a one-liner for the entire column. Ditch that if needed.

Related

LOOKUP pickle in Google sheets (adding different payments from different clients and splitting them by month)

I have an issue with the use of LOOKUP formulas for this example:
I am trying to determine a way to add different payments from different clients by month
This is the minimum example and the information in blue is the desired result (which I am trying to automate).
Can someone shine a light as to which formulas to use, and how to accomplish that given I am separating the payments made from each client by month?
use:
=INDEX(IFNA(VLOOKUP(B4:B6&C3:G3, {C10:C&TEXT(B10:B, "mmmm"), D10:D}, 2, )))
or if you want zeros:
=INDEX(IFNA(VLOOKUP(B4:B6&C3:G3, {C10:C&TEXT(B10:B, "mmmm"), D10:D}, 2, ))*1)

How to solve Google sheets formula issue

Hoping someone here can help with a small issue.
I'm writing an array formula, which uses a vlookup to pull prices. Identical items can be either on the sales floor, or in various overstock locations, and since they gradually increase in value, only the price attached to the version thats on the sales floor should be valid. Vlookup unfortunately pulls the price for the first match it finds.
I've cannibalized the following code from various sources online, and it does work at pulling the correct prices, however, it appends "Sales floor" to every price.
Is there something I'm missing, or a way to further refine it?
={"Variant Price";if(Checkbox!B1=TRUE,Arrayformula(IFERROR(VLOOKUP(A2:A&"Sales Floor",Inventory!A2:K&Inventory!N2:N,Column(J2:J),false))),"")}
The variant price and checkbox portions are simply to add a title, and to be able to control whether the formula is activated, this sheet has a large number of formulas that don't need to run constantly.
try:
={"Variant Price"; IF(Checkbox!B1=TRUE,
ARRAYFORMULA(SUBSTITUTE(IFERROR(VLOOKUP(A2:A&"Sales Floor",
Inventory!A2:K&Inventory!N2:N,Column(J2:J), 0)), "Sales Floor", )))}

How to skip rows with the same values

I have the following problem: I have a dataset with over 1million entries (shown below), that includes the variables company (=Name of the company (string)) and reviews (=amount of reviews a company received) and company1 (assigns numeric to specific company name). Now I want to calculate the average amount of reviews a company in the dataset receives. But if I just do sum reviewsthen it will count the amount of reviews of company 3 two times, the amount of reviews of company five 23 times etc. (as often as they are listed in the data). How do I avoid this and only count them once?
Your image is not readable (by me on a laptop). The Stata tag wiki gives detailed advice on how to give data examples and the command dataex bundled with recent versions of Stata is easily used for SE.
The flavour of your request is easier to follow. Here is an analogue. With the Grunfeld data we can calculate a mean investment for each year.
webuse grunfeld, clear
egen mean = mean(invest), by(year)
Now we might want to know how many years had mean invest above 200 (in the units used)?
su mean if mean > 200
or
count if mean > 200
returns the number of observations (not years). If you try it, the result is 30. In the Grunfeld data, there are 10 companies each measured for each year, so dividing by 10 is an easy answer. For more complicated datasets, it would better to tag each year just once, and then look only at tagged observations:
egen tag = tag(year)
count if tag & mean > 200
It would be more common to tag panels, not years, but the principle is the same. See the help for egen.
collapse and contract offer other routes, with or without using frames.

User search pricing calculation

I'm building a search engine which provide me a list of cap drivers. We have some requirements:
User is searching cheapest cap driver to bring him from place a to place b. He can go from any place to any place.
Default formula would be distance * price per mile
But there are also special prices like AMSTERDAM to THE HAGUE would be always 100 EUR
The price for each mile is season based winter/summers have different prices.
Faceting search based on attributes. Like is there Champagne/Luxory/Male/Female driver/Etc etc.
User want's to sort on cheapest ride/but also distance.
What would be the best approach to fit all there requirements? I've tried Solr but have not found a good solution for putting the price modal in there. Any ideas?

How many address fields would you use for a UK database?

Address records are probably used in most database, but I've seen a number of slightly different sets of fields used to store them. The number of fields seems to vary from 3-7, and sometimes all fields are simple labelled address1..addressN, other times given specific meaning (town, city, etc).
This is UK specific, though I'm open to comments about the rest of the world too. Here you need the first line of the address (actually just the number) and the post code to identify the address - everything else is mostly an added bonus.
I'm currently favouring:
Address 1
Address 2
Address 3
Town
County
Post Code
We could add Country if we ever needed it (unlikely).
What do you think? Is this too little, too much?
The Post Office suggests (http://www.postoffice.co.uk/portal/po/content1?catId=19100182&mediaId=19100267) 7 lines:
Addressees Name
Company/Organisation
Building Name
Number of building and name of thoroughfare
Locality Name
Post Town
Post Code
They then say you do not need to include a County name provided the Post Town and Postcode are used.
The BSI have BS 7666 - that covers all addressing. I recommend you look there.
The 2000 version recommends
An address shall be based upon a logical data model comprising the following entities:
addressable object, with sub-types:
primary addressable object;
secondary addressable object;
street;
locality;
town;
administrative area, a.k.a. district;
county;
postcode.
See: http://landregistry.data.gov.uk/def/common/BS7666Address
I don't know whether this is minimal (I doubt it) but the heading on my cheque book says something pretty close to:
Lloyds TSB
Isle of Man Offshore Centre
Peveril Buildings
Peveril Square
Douglas
Isle of Man
IM99 0XX
United Kingdom
This causes fits when I try to enter it into the US banking system.
If I were you, I'd call Royal Mail and ask them... or look on their website for postcode lookup as a best practice.
There's different types of addresses, and each different type has a slightly different structure. Forward sorting offices have a different postal address structure than a residential home with a street number. What if the house has a name instead of a number? There are so many factors to consider.
Since I moved to Canada I had to do something similar and it's far more complicated than a straightforward residential address which generally has:
Street Number if applicable
Street Number Suffix if applicable
House Name
Street Name
Street Type
Street Direction if applicable
Unit Number for flats, townhouses or other types of building/location
Minor Municipality (Village)
Major Municipality (Major Town/City)
County
PostCode
Country if you include Scotland, Wales, Northern Ireland (and now I noticed Eire)
Then you get businesses that have their own Delivery Route, PO Boxes, Forward Sortation Offices...
It gets complicated in a real hurry.
Best bet - give Royal Mail a call and they should be able to give you information on their standard address templates.
EDIT: Your 3 field method isn't a bad one...particularly. However, data sanitization may be a significant issue using the field setup you have and you may need a fairly complex strategy for making sure that the address entered is valid. It's far easier to sanitize single dedicated fields to make sure input is correct than it is to parse various address tokens out of combined fields.
Another simpler way to gain this info is to go on the Royal Mail website and check their postcode lookup page.
On their main postcode lookup, they use 4 fields and I guess they have some form of validation on the street name/type field. They separate the house number and name and I guess they only allow major municipality. I'm assuming the county/country are assumed. If you break out their advanced search, they give you two extra fields for flat number and business name.
Given that some fields are combined on their site, you have to assume that there's some amount of validation to make sure that data entered can be gainfully used.
Premises elements
Sub Building Name
Building Name
Building Number
Organisation Name
Department Name
PO Box Number
Thoroughfare elements
Dependent Thoroughfare Name
Dependent Thoroughfare Descriptor
Thoroughfare Name
Thoroughfare Descriptor
Locality elements
Double Dependent Locality
Dependent Locality
Post Town
Postcode element
Postcode
This answer may be a few years late, but it's aimed at those like myself looking for guidance on how to correctly format postal addresses for both storing in a database (or the likes of it) and for printing purposes.
Taken from Royal Mail Doc, link below - conveniently titled the 'Programmers Guide'
Technical specififcation for users of PAF
Page 27 - 42 was most helpful for me.
It's very likely that a "UK" will be opened to Eire as well, and in some lines of business there will be legal differences, generally between Scotland / NI / the channel islands and England and Wales.
In short, I would add country to the list. Otherwise it's fine (no fewer certainly), though of course any address is traceable from a building reference, a post code and a country alone.
Where we live in France its just 3 lines:-
myname
village/location name
6 digit postcode followed by post town name in uppercase
Even from UK that's all that is required

Resources