Create a database from a property advert website - database

I need to keep track of real estate selling prices, and here in Greece there is a property advert website that features thousands of ads in a systematic way like this:
http://1drv.ms/1gwJhRe
Every advertisement has a title, then the region and a small description and finally on the right the area and selling price.
I understand that they have a database where everything is stored, but this isn't accessible directly.
Trying to make a spreadsheet by extracting data one by one will take years.
How could I create a software that would end up creating a text file with tabulated data (region, area, selling price, ...) from each and every one of the ads?
I am willing to study hard in order to learn what it takes for this.

For this, you will need to perform something called Web Scraping. Check out Scrapy.org
A crawler is program that would read the HTML doc and can interpret it in a format you need.

Related

How to handle lots of unchanging backing data that is kind of unrelated to my application

Background
I'm creating a layered .net core application to handle tracking campaigns for a board game. Because of this there is a lot of data that comes from the game itself, for example:
Characters
Weapons
Equipment
Missions
Objectives that belong to a mission
Rewards that belong to objectives
Etc
The application is not to manipulate this data. This data is typically printed on cards that come with the board game so it won't change. The only changes it may have are when I manually add new characters or something due to a new expansion being released.
As far as the app is concerned, these are similar to how you might have a lookup table of States in the US. The app needs to list them so you can select them, entities in the domain hold references to them, but their actual data is irrelevant to the application itself. It's just lookup data.
Except there is a lot of this data and some of it is related. For example an objective belongs to a specific mission and a reward belongs to a specific objective.
The Problem
If my application was being designed to manage this data there would be no problem. However this is not the case. It is designed to manage "Campaigns", which are 2-5 players sitting down to play a game with these cards. It is managing "instances" of this data that have additional properties.
For example a new campaign is created and a row is added to the Campaign table. Now a mission must be added to it.
I can't just add a reference to the Mission data because I also need to store the outcome of the mission specific to this campaign. So I create a CampaignMission entity that references the mission data, the campaign id, and has a column for the mission outcome.
But that Mission data had related Objective data. The data just holds things like objective name, description, rewards etc, but in the campaign I also need to store the outcome of this objective specific to this CampaignMission. So again I create a CampaignObjective that references the Objective data, the CampaignMission, and has a column for the objective outcome.
Before you know it I am doing this for everything. CampaignCharacter, CampaignWeapon, CampaignReward. I feel like I'm just replicating the structure of the game data, relationships included.
Where the game data has relationships, my Campaign entities feel like they're mirroring the relationships to the point where, from the same object, you can access the same piece of game data by following two separate paths, the original game data relationship or the Campaign entity "replica" relationship.
For example if I want the name of the first reward for the first objective in the first campaign mission, you can access it in two ways:
Campaign.CampaignMissions[0].Mission.Objectives[0].Rewards[0].Name
Campaign.CampaignMissions[0].CampaignObjectives[0].CampaignRewards[0].Reward.Name
Both of these point to the same piece of game data. I really feel like there should only be one path:
Campaign.Missions[0].Objectives[0].Rewards[0].Name
Where I'm Stuck
I'm not sure if this is normal but it all just feels wrong. Almost as though the game data shouldn't even be part of the application. I mean the game data could be hosted on some 3rd party API and it wouldn't make any difference to my actual application. It's just data I need to read but I feel it's impacting my app structure in ways it shouldn't be.
My application doesn't really need to know the difference between Mission game data and a Mission in a campaign. All it needs to care about is that a campaign can have missions, and those missions have a name etc and an outcome. It doesn't feel like the Mission game data itself needs to be an entity in my domain.
What I've Tried
I tried keeping single entities in my domain and keeping them separate in my database. So for example a Mission in the domain would include both the game data fields like mission name, the mission outcome and a list of domain Objectives.
When a domain Mission for a campaign is requested from the data layer, the entry is retrieved from the CampaignMission table, along with its game data from the Mission table, then flattened via AutoMapper and returned to the domain as a single Mission entity containing everything.
This just caused a bit of a nightmare with Entity Framework and handling the mappings back and forth between data layer and domain because the CampaignMission in the database also had CampaignObjectives which linked to Objectives that also had to be flattened etc, and I had to keep track of the primary keys for all of these throughout my domain so everything could be unflattened and mapped back again when I want to persist something. It just didn't make sense, in terms of tracking primary keys/identity, for a single domain entity to be represented by entries in multiple tables.
What I'm Now Considering
I'm considering just moving all of the game data into a totally separate project, completely unrelated to my application. My application could then query project as though it was some third party API or something and get any data it needs and I can keep it all out of my solution.
Since the game data would no longer have IDs in my application, when I add a mission to a campaign it would simply have a column for "name" which would hold the mission name. When I want to use that mission I would grab it from the db and map it to a domain entity, so at this point it contains the campaign-specific data such as mission outcome, and also the name. Then I'd query the game data project using the mission name and map all the returned data back on to the entity as well, leaving me with a complete entity.
This is essentially replicating the behaviour of what I already tried but removing the need to track identity for the game data by simply using a name that I can query. It removes the concept of backing game data from my domain and leaves me with a single entity, Mission.
The Question
I've wasted a lot of time on this so far and I'm sure it must be a common problem in similar types of applications. I was wondering if anyone had a better solution for dealing with this kind of situation before I go ahead and try completely separating the data.
I have to admit, typing out the "What I'm Now Considering" section has clarified a few things for myself but I would still love to hear if there is a better way.
Thank you in advance if anyone reads all of this.
Here's what you should be doing. First, add the game data entities to the DbContext as DbQuery<T>:
public DbQuery<Campaign> Campaigns { get; set; }
This will allow you to query it, but will not allow changes. Then, since the game data is static, you might want to actually just persist it on a singleton, which you can then inject where you need it.
In either case, on the actual campaign data that's being persisted, you should only store the id of the game data concept. For example, MissionId, not CampaginMission.Mission. When you need the actual Mission info, just look it up based on the MissionId, either directly from your DbQuery<Mission> property on your context or your singleton class.

Set up better workflow in Excel

Okay. I was looking at my physio's spreadsheets for his small, private business today. He uses Excel to keep track of his clients appointments, fees, attendences, medical reports etc. At the moment he has a single sheet where he adds every clients appointment to the list as he goes - there's 3 years of details, one row for every appointment! It's huge and pretty hard to navigate and make sense of when he's extracting information such as fees paid/unpaid, total visits, etc.
I'm a novice to sub-intermediate at Excel, but getting better. What I'm wondering is it possible to set up a "front page" where he can enter a day's details in a single spreadsheet, press an export cell,and then have Excel pass the relevant data to individual sheets for each client. The data on that front page would look for the clients name as a string to find the relevant sheet and drop the information in.
I'm not asking how to do it as such, but rather wondering if this is possible at all!
Thanks
=COUNTIF(Sheet1!RangeToLookThrough, ValueYoureLookingFor)
Put this where you want.
If you want a daily dashboard, you could replicate this to do something like :
=SUMPRODUCT(--(Sheet1!B2:B4000<>""),--(MONTH(Sheet1!B2:B4000)=9))
Feel free to alter this to capture yesterday or whatever date your looking for. I just used an old report to mess with and give you an example.

Dataset - Where can I find a dataset about different car instances about their original prices/retail value?

I am looking for statistical analysis a data set, which contains car instances with its' respective information like brand, model, year when produced, and maybe some other key values as well (but that's just the bonus), AND car price when it was brand new - out of the factory.
Some extra background
I have a data set with about 120,000 used cars and I would like to do some statistical analysis, how the value has been decreased over the time etc.
I have found some webpages, which contain the value, which I am looking for, but in those cases it seems that I would have to do some webpage crawling to get my values out of there in the automatic manner. Also that would be against the terms of the page - stealing the data from them, stealing is bad.
I have worked out of some Stackoverflow topics, but those links on those topics don't contain any information about the price when the car models were new.
Good datasets or APIs, without the price when new:
http://www3.epa.gov/otaq/tcldata.htm
http://developer.edmunds.com/api-documentation/overview/index.html
http://www.carqueryapi.com/
I used https://www.redbook.com.au/ for MRSP.
There you have to take count that it is Australian site and therefore you need to do your own currency transformation.
You need to determine the need of the inclusion of VAT and sometimes you need to adjust the price with the inflation.

Defining Related Fields for Presentation/Reporting from SSAS

I might be going down the wrong track here, but what I am after is some guidance on what to do for related data in SSAS.
What I am after is, when I create a cube I want to be able to use it for reporting/end users, but my dimension may have a product cube, but what is the appropriate way of bringing in other data like the Product Name, Description, Created Date, etc - information that you may not want to actually drill into, but is related to the axis that you're looking at.
I would need to show some of this information if I was reporting from it, but I can't seem to see a way, and most of the youtube videos, etc seem to go over the real basics on Adventureworks and that's it - so I'm not sure how this works in the real world.
So, in the end they can go into Excel/whatever, and see:
Product code Name Total Profit
Rather than just the code axis, and the profit.
It depends on how you want to see it. We typically call these types of fields "member properties". You create them by setting the AttributeHierarchyEnabled property to False. You use them in Excel as described here (works in Excel 2007 and Excel 2010):
http://blog.davyknuysen.be/2009/08/03/olap-reporting-with-excel-2007-use-member-properties/

Is there a public geolocation system that I can query for names of local cities?

I'm looking to attempt to simplify address entry into a system where the city textbox has autosuggest initially populated by the user's geolocation. In the past it has seemed that autosuggesting the city name is prohibitively costly without knowing the province/state/country first but it doesn't make sense to require the user to enter the address backwards as we don't think about address information this way. On the other hand, not autosuggesting the city name means we end up with all sorts of weird and wonderful entries for mis-spelled cities from around the world.
I was wondering if there's a service that I can query that would automatically respond with the most appropriate city names according to not only what the user enters in the textbox, but the location of that user based on the country and political boundary they fall within?
For instance, if I am in Canada [as I am] and I enter 'Mi' then I'd be presented with all cities within Canada starting with 'Mi' until it was determined that the information I was entering wasn't Canadian at which point, it would use the next most likely configured country based on our usage pattern - i.e. it would check the U.S. next, followed by Mexico and then other less likely destinations. I can write all this myself if I had the database but I don't know where I can find one and my suspicion is that it would be less scalable than querying a pre-existing service on the web.
Looks as though MaxMind offers a free database that you could download in CSV:
There's an online demo to test it a bit if you'd like, but no way to query it through a web service.
IPInfoDB also has their database available for download - they have an XML API, but it only supports looking up the city/country for a particular IP. You're trying to do something a little more wide than that, looking for every city in a particular country, with country selected based on IP. I wouldn't expect that there's a web service for that, it's a pretty specific requirement.
Edited to add: You could use the IPInfoDB API to look up the country though, and then generate the autocomplete suggestions from a local country/city database. That way all the IP-geolocation wouldn't need to be done locally. There are various places that you can get a list of cities in a particular country. For example, here's some comprehensive lists maintained by the National Geospatial-Intelligence Agency

Resources