Appengine Logging: How to query a request-path prefix - google-app-engine

In Google Appengine Stackdriver logging control panel, I was able to query logs by path-prefix with regex queries: "path:/abc/def/.*" now this filter does not show any results anymore.
The documentation from Hell for this interface from Hell has no examples that I find useful. Simple text searches have too many false positives. Did anyone figure out how to make request-path regex or prefix queries in that panel like the following?
Request-path: /abc/def/.*
Request-path: /abc/def/.*.jpg
Update: I figured out the first part: They changed it to "path:/abc/def" without regex. However this thing is crippled, it seems to only search back 7 days into the past?

7 days is the max logs age in the (default) Basic Tier. From Quota Policy:
Maximum limit Value
Retention of log entries 30 days (Premium Tier)
7 days (Basic Tier)

Related

Salesforce Activities - Task and Events - Reporting

When building a report on Activities (Tasks and Events), I had a Subject of Introduction.
I know that I have hundreds, but when I create a Report, it is only returning 8.
I removed all other fields from the report, made sure the filters were All Activities. Date: All Time, and all other filters were set to show me all data, not just my data.
Any advice on what I might be missing in my report?
Thanks,
Jason
Created the report using Lighting and Searched Activities, then selected Tasks and Events.
Expecting to see more than 8 items with the Subject of Introduction.
You may be a victim of archived activities. Can you see these Tasks/Events all right on the record itself just like that or are they "below the fold", you need to click "view all"?
https://help.salesforce.com/s/articleView?id=sf.activities_archived.htm&type=5
Events that ended more than 365 days ago,
Closed tasks due more than 365 days ago ,
Closed tasks created more than 365 days ago (if they have no due date)
If that's the case - you can contact SF support to increase the limit: https://help.salesforce.com/s/articleView?id=000385669&type=1
If you're after an one-off export - maybe API access will be faster than support route, check https://developer.salesforce.com/docs/atlas.en-us.api.meta/api/sforce_api_guidelines_archive.htm and https://help.salesforce.com/s/articleView?id=000385173&type=1 out.

Crawling and scraping random websites

I want to build a a webcrawler that goes randomly around the internet and puts broken (http statuscode 4xx) image links into a database.
So far I successfully build a scraper using the node packages request and cheerio. I understand the limitations are websites that dynamically create content, so I'm thinking to switch to puppeteer. Making this as fast as possible would be nice, but is not necessary as the server should run indefinetely.
My biggest question: Where do I start to crawl?
I want the crawler to find random webpages recursively, that likely have content and might have broken links. Can someone help to find a smart approach to this problem?
List of Domains
In general, the following services provide lists of domain names:
Alexa Top 1 Million: top-1m.csv.zip (free)
CSV file containing 1 million rows with the most visited websites according to Alexas algorithms
Verisign: Top-Level Domain Zone File Information (free IIRC)
You can ask Verisign directly via the linked page to give you their list of .com and .net domains. You have to fill out a form to request the data. If I recall correctly, the list is given free of charge for research purposes (maybe also for other reasons), but it might take several weeks until you get the approval.
whoisxmlapi.com: All Registered Domains (requires payment)
The company sells all kind of lists containing information regarding domain names, registrars, IPs, etc.
premiumdrops.com: Domain Zone lists (requires payment)
Similar to the previous one, you can get lists of different domain TLDs.
Crawling Approach
In general, I would assume that the older a website, the more likely it might be that it contains broken images (but that is already a bold assumption in itself). So, you could try to crawl older websites first if you use a list that contains the date when the domain was registered. In addition, you can speed up the crawling process by using multiple instances of puppeteer.
To give you a rough idea of the crawling speed: Let's say your server can crawl 5 websites per second (which requires 10-20 parallel browser instances assuming 2-4 seconds per page), you would need roughly two days for 1 million pages (1,000,000 / 5 / 60 / 60 / 24 = 2.3).
I don't know if that's what you're looking for, but this website renders a new random website whenever you click the New Random Website button, it might be useful if you could scrape it with puppeteer.
I recently had this question myself and was able to solve it with the help of this post. To clarify what other people have said previously, you can get lists of websites from various sources. Thomas Dondorf's suggestion to use Verisign's TLD zone file information is currently outdated, as I learned when I tried contacting them. Instead, you should look at ICANN's CZDNS. This website allows you to access TLD file information (by request) for any name, not just .com and .net, allowing you to potentially crawl more websites. In terms of crawling, as you said, Puppeteer would be a great choice.

0% of the site has been indexed in drupal how to solve this isse?

Your search yielded no results in Drupal 7.12. I have index 0%. In my local it is working but on the staging server it is not working.Check if your spelling is correct. Remove quotes around phrases to search for each word individually. bike shed will often show more results than "bike shed". Consider loosening your query with OR. bike OR shed will often show more results than bike shed.
I got a solution just install spambot module and configure then run the cron manually.
Should just need to run the cron, /admin/config/system/cron, either manually or set the "Run cron every" setting. Every time the cron runs it indexes a certain amount of the site.

Google SDK Report dont get info

I have a app in GAE. This belongs to a domain. This app get info from Reports Api.
If I ask for information 4 days ago, I get the information from doc and gmail. If I ask for information 3 days ago I get only the information of gmail. If I ask for information two days ago (1) does not give me information.
But if I get info from (1) date in the future (eg: ten day) the Api get all info from this date.
The api has a delay when I get info?
Thanks.
See https://support.google.com/a/answer/6000239?hl=en&ref_topic=4639149
Short answer is that yes, there is a delay in when full report data is available. Some reports have lower latency than others and it is possible that you'll only get a subset of the expected report data back.

Webapps: Storing and searching through user submitted blocks of text

Background:
I'm building a poetry site with user submitted content. The relevant user actions for my questions are that users can:
a. Go to fancysitename.com/view to see all poems so far
b. Go to fancysitename.com/submit to submit your own poem.
c. Go to fancysitename.com/apoemid to view a particular poem you've bookmarked before.
d. Go to fancysitename.com/search to enter a word to search for in all the poems.
All the poems are stored as text fields in a database and referenced by a poem id. So the "apoemid" in step c will be the primary key of the tuple and I'll just pull up the text after getting the key from the url.
Question:
The poems exist nowhere except in a database. My webapp is literally 4 html files. Will this approach affect my search engine rankings?
Is there a more efficient way to do 'd' rather than do a Select * on the db and manually parsing the text on the server? Each poem will be at the most 10 lines long, so I would imagine using a full text search engine like Lucerne will probably be overkill.
Caveat
I'm running this on the google app engine for now, so my database customization options are pretty limited. So while I'd certainly be interested in hearing about the ideal way to do this, this is a pet side project so my budget is limited :(
Thanks!
Edit: Apparently I don't google so well at 7am. I've since found a solution for question 2 here so please disregard question 2.
AppEngine currently doesnt support full text indexing, they do have a better than nothing SearchableModel.
Some details of SearchableModel can be found here:
http://groups.google.com/group/google-appengine/browse_thread/thread/f64eacbd31629668/8dac5499bd58a6b7?lnk=gst&q=searchablemodel
Regarding search engine ranking, yes having all your poems in the datastore can affect your ranking. This is generally overcome through the use of a sitemap. Here is an article about how StackOverflow uses a sitemap to help its search ranking.
http://www.codinghorror.com/blog/archives/001174.html
In most database engines, you can accomplish this kind of searching. For example MysQL does have full text searching. I am not sure how app engine works but you can always have a stored procedure does this search.
Where you store your data will not affect your site's ranking, only how you serve it up (on what URLs, etc). There's absolutely no way for an arbitrary search spider to tell where you store your data, and no reason for it to care, either.
Regardless of the length of your text, you will need full-text searching if you want to search inside a string. As Sam points out, SearchableModel ought to work just fine for that.

Resources