How can I analyse instagram pictures? - dataset

I ve the task to crowd and analyse a set of instagram users's posts, including texts and pictures. I don t need to know the user identity.
In particular, I ve to use the images to train some machine learning classifier/regressor. So that I need to temporaly store the picture to extract visual features.
Reading the instagram API policy I m a bit confused :
Comply with any requirements or restrictions imposed on usage of Instagram user photos and videos ("User Content") by their respective owners. You are solely responsible for making use of User Content in compliance with owners' requirements or restrictions.
14 only store or cache User Content for the period necessary to provide your app's service.
17.Don't apply computer vision technology to User Content, without our prior permission
11 => what are the common owner's restrictions? I mean, when people post a picture on instagram do they set any restriction or are they publicly available? And then there are some cases of pictures with restrictions
14 => i have not to build or sell any app. I need to store the pictures just the time needed for the analysis. However, according to this point I could store pictures on a drive and than remove them when finished. Did I get the point or I m wrong?
17 => I guess that this point is related to image manipulation. Not analysis.
I hope that someone can clarify me what can I do, considering my purpose. As an example, at this page you can see a project in which a set of instagram picture have been crowled, labelled and analysed. The downloadable file doesn t contain the pictures, but just their instagram ids.

Related

Crawling and scraping random websites

I want to build a a webcrawler that goes randomly around the internet and puts broken (http statuscode 4xx) image links into a database.
So far I successfully build a scraper using the node packages request and cheerio. I understand the limitations are websites that dynamically create content, so I'm thinking to switch to puppeteer. Making this as fast as possible would be nice, but is not necessary as the server should run indefinetely.
My biggest question: Where do I start to crawl?
I want the crawler to find random webpages recursively, that likely have content and might have broken links. Can someone help to find a smart approach to this problem?
List of Domains
In general, the following services provide lists of domain names:
Alexa Top 1 Million: top-1m.csv.zip (free)
CSV file containing 1 million rows with the most visited websites according to Alexas algorithms
Verisign: Top-Level Domain Zone File Information (free IIRC)
You can ask Verisign directly via the linked page to give you their list of .com and .net domains. You have to fill out a form to request the data. If I recall correctly, the list is given free of charge for research purposes (maybe also for other reasons), but it might take several weeks until you get the approval.
whoisxmlapi.com: All Registered Domains (requires payment)
The company sells all kind of lists containing information regarding domain names, registrars, IPs, etc.
premiumdrops.com: Domain Zone lists (requires payment)
Similar to the previous one, you can get lists of different domain TLDs.
Crawling Approach
In general, I would assume that the older a website, the more likely it might be that it contains broken images (but that is already a bold assumption in itself). So, you could try to crawl older websites first if you use a list that contains the date when the domain was registered. In addition, you can speed up the crawling process by using multiple instances of puppeteer.
To give you a rough idea of the crawling speed: Let's say your server can crawl 5 websites per second (which requires 10-20 parallel browser instances assuming 2-4 seconds per page), you would need roughly two days for 1 million pages (1,000,000 / 5 / 60 / 60 / 24 = 2.3).
I don't know if that's what you're looking for, but this website renders a new random website whenever you click the New Random Website button, it might be useful if you could scrape it with puppeteer.
I recently had this question myself and was able to solve it with the help of this post. To clarify what other people have said previously, you can get lists of websites from various sources. Thomas Dondorf's suggestion to use Verisign's TLD zone file information is currently outdated, as I learned when I tried contacting them. Instead, you should look at ICANN's CZDNS. This website allows you to access TLD file information (by request) for any name, not just .com and .net, allowing you to potentially crawl more websites. In terms of crawling, as you said, Puppeteer would be a great choice.

Access Database for Links

I am trying to create a database that can be used in my office. What I am trying to do is create a form where a user can input a link such as "www.stackoverflow.com" and it pulls up information about that link. For example, when we are reviewing documents, if we saw the link "www.stackoverflow.com" we would have to create an Alternate Text that instead would read "Stack Overflow Website" if it was being read by a screen reader.
We deal with so many links across so many documents that it can be hard for my team to be consistent document to document, person to person. So having this database would allow someone to enter the url and then it would pull up the alt text we have for it in multiple languages. I am learning and trying to figure out databases but I am not very well knowledged on how to put it all together within Access.
I would appreciate any kind of help or pushes in the right direction.

How to add first name and email before uploading a video?

Hi guys im brand new and not a developer but I need a way for users when they go to my site they can upload there video and there would be a option for them to add there first name and email so when the video is uploaded the database can keep all the data together.
Ideally I want this as easy as possible for the user and this would just go to our youtube channel or any video platform will work.Any advice would be great!
Please provide more information like what platform are you using ?.
There's more than one way to skin a cat.
The simple way to achieve with web technologies like (Php,node,jave) is maintain the basic user information into the sessions, and whenever it's necessary use this information.
You need to get some knowledge about the system you are using. You particularly need:
access to the server
to know the server type
access to the database
to know the database type
where the relevant files are
After you have gathered all these information, you at least know what you do not know. The next step is to gather information about how you can implement the feature you need. Look at it like at a puzzle with many small pieces. If you are patient-enough, at the end you will resolve the puzzle.

Storing Images in SQL Server database

I'm trying to create a sample ASP.NET MVC application with a ViewModel and onion architecture - very simple online shop.
So as you suppose this shop has products, and each product should have one very small image and when user clicks on that product, he is redirected to a details page, and of course he should see a bigger image of the product.
AT first I thought, it's a simple application, I would (internet) links to the pictures in the database. But then I thought, ok what about when this image is erased from internet, my product will no longer have an image.
So I should store those pictures in the database somehow. I have heard about something called FileStream that is the right way but I found no material to understand what is that.
I hope someone would help me.
There are several options. You could save the picture in the database using a varbinary.
Read here how to read it using MVC.
When you opt for a solution where you split database and file storage, which is perfectly possible, you should consider that it could mean extra maintenance for cross-checking deleted records, etc.
If you choose the last option, the information in the article will mostly suite your needs too.

Multi language implementation ideas for user content (session, cookie, url, subdomain, sub dir, etc?)

I am lost on multi-language implementation. How to handle it? Session, Cookies, File, ...other ways?
Overview
Website is a user content website, like a social network. We will have system content controlled by us and user content translated by users. Languages supported will be system controlled. To start there will be the top 20 supported languages. There are two user types (Non-user and logged-in user). Both user types have pages as not all pages are behind a log-in. Non users can still view many public pages or profile pages that are public.
Requirement
I want to access a public page in French (as an example) directly without having to hit the site in english then change the language to french. (optional)
For user content -> If I want to translate an English content to Italian, I am looking to translate only that 1 content (example status update) not the entire page. So page is in English but I can input Italian for that one content without converting the entire page into Italian.
Search for content based on language from one place. If I am reading reviews, I want to load only German reviews from the menu but not change other page content.
I want to view all wall posts that are in German, can I do it straight from my profile by changing the language or do I have to logout of that language session and login with a new session for the new language, if session based?
I am seeking to be able to change language on any page, for any content without having the user to login or logout.
I need to perform analytics for internal purposes based on language type. (like number of wall posts by people by network X who posted content in Chinese. So I will need to track per language per content.)
Other
I am still not sure if the content will be database or file driven but first I am looking into how I can best handle multi-language for scalability yet keep it user friendly.
Suggestions?
This was probably not answered because of this section of the faq: "Your questions should be reasonably scoped. If you can imagine an entire book that answers your question, you’re asking too much."
https://stackoverflow.com/faq
Translation engine+search engine+user engine+analytics engine? Try learning and implementing one at a time. I'm going to answer it just in case someone sees this and is still interested, but I'm not an expert in this area neither, so I'll list what I did and think.
1st, create the language engine. A simple "Language" drop-down menu somewhere should be enough so far (for the visitors) with the database, cookies, session and code correctly done. Create it as you like it, what you listed is rather complex but perfectly achievable.
2nd, add the user engine, including database, log in/log out forms, code and everything needed and put both of them together. Each user needs to have a column in the "user" table with their preferred language. Modify slightly the language engine to support users. This must be easy to implement now.
3rd, (and still new for me), create the search engine.
4th, implement the analytics engine. I'd recommend using an external one, since it's much easier and complete.
But, as stated, this is just my opinion.

Resources