Skip to content

Sanoj32/Major-Project

Repository files navigation

Major-Project

A central repo for code, sharing resources, workflows and roadmaps.Check the Projects section for project managemet plans.

Roadmap

  1. Project proposal (In development)
  2. Scrape data from all sites mentioned below. (In development)
  3. Clean the data and remove unnecessary variables and noise. (In development)
  4. Perfrom EDA and find out the relationships between variables. (Pending)
  5. Build a model to predict/give an objective price of the house based on the parameters. (Pending)
  6. Deploy the model using flask(Api), HTML, CSS, JS and PHP/Laravel(Backend). (Pending)
  7. Documentation (Pending)

INFORMATION AND RESOURCES

Folder heirarchy

- major-project
  - csv_files
    - raw (Raw unclean data after scraping)
    - links (Links to be scrapped)
    - clean (Cleaned data)
- jupyter_notebooks (Notebooks to clean data and develop the model)
- scraping_files

Tutorial resources

Important programming concepts

  • creating and importing a python module and using its functions (files)
  • namespace of functions in a python module
  • how functions from other python modules are called
  • Python data types: Lists, Tuples, dictionary and sets
  • for loops, map function
  • method call from a library vs function call ie how obj.method() works
  • lambda function/closures/annonyomous function
  • Python virtual environmets
  • Basics of Object oriented python. Python class constructors. How a function gets access to the object it is called upon and how it manipulates the object.
  • What is anacodna?
  • How pip works

Important Mathematics and Statistics Concepts (Add more)

  • Co-relation between variables and multicollinearity.
  • Linear/Multiple/Logistic reression
  • Squared error of regression line
  • p-value, level of significance, null and alternate hypothesis.
  • contour plot
  • Fit a line using least square method.
  • Vectors, Partial derivative, gradient of a function and gradient descent.

Important diagrams

  • Scatter plot
  • Heatmap - To find multicollinearity
  • Box plot
  • Histograms
  • Violin plot

Important Machine learning Comncepts

Algorithms (listed possible best to worst)

  1. Ridge regression - Because the data seems to have multicollinearity
  2. Linear regression - Related = Wald's test
  3. Logistic regression ??
  4. Convolution neural networks
  5. Random forest
  6. ID3

Use waka as a testing tool and may use external librariess to check efficiency of the model.

Miscellaneous

Libraries (Must read official docs)

Go through the basic contents of the docs atleast once

Websites to scrape

S.N. Name Data Amount Library required Status Remarks
1 99aana 3K BS4 Completed
2 Nepal Homes 1K Selenium & BS4 Links fetched Data to be fetched using BS4
3 Basobaas 2K Selenium & BS4 Links fetched Data to be fetched using BS4
4 1Ropani 600 Selenium & BS4 Completed
5 Hamrobazar 3K Halted Presents a captcha to check for bots
6 Gharbazar 340 Selenium Links fetched Low amount of data. Scan for house keyword in title.
7 Gharghaderi 300 Selenium Halted Low amount of data
8 Housing Nepal less than 300 BS4 Halted Low amount of data
9 Real Estate In Nepal less than 300 BS4 Halted Low amount of data. Restriction for bots
10 Nepal Home Search 140 BS4 Halted Low amount of data
11 Nepal Realestates less than 300 BS4 Halted Low amoun of data.
12 The Realtors 300 Selenium & BS4 Halted Low amount of data. Scan title or house keyword
13 GharJagga Nepal 330 Selenium Halted Low data and has infinite scrolling

Parameters (subject to change)

  1. Price
  2. Location in District
  3. Number of rooms
  4. Number of floors
  5. Area
  6. Time posted
  7. Road size and road type*
  8. Room size *
  9. Bedrooms *
  10. Bathrooms *
  11. Car parking *
  12. kitchen *
  13. Living room *
  14. Garage
  15. Furnished ? *
  16. Guestroom *
  • Files/Modules and variables = snake_case
  • directories = all lowercase preferably without underscores
  • classes = PascalCase