1. Academic Honesty: This individual project is an open book and take home project. Though discussion is encouraged, each student should finish the project on his/her own effort. Students can read and search related materials from both offline and online sources. Other authors’ intellectual contributions (e.g. language, codes, figures, thoughts, ideas, expressions etc.) should be be properly cited if they appear in your project report. Students are expected to abide by the academic honor code and must not copy from the work of others. This includes published or unpublished articles, Wiki documents, etc. Plagiarism will result in failure and, very likely, will have more severe consequences.

2. Goal: The goal of this project is for students to demonstrate they are capable of taking a problem description and its associated dataset and producing a report explaining statistical analyses that solve the stated problem. There is not necessarily a uniquely correct answer. A successful report should produce not only correct analysis methods, but also explanations that would be comprehensible to someone with only a basic knowledge of statistics.

3. Grading: The grading is based on the overall quality of the report. The following aspects are considered important for a high quality report: (a) the methodology used should be suitable; (b) the implementation and results should be correct and clear; (c) the explanation should be comprehensible; (d) the mathematics involved should be rigorous; (e) the presentation should be precise and concise; (f) the report format should be correct.

4. Deadline and Submission: 9pm of December 14th (Monday), 2020. Submission will be through Gradescope, and late submission will be penalized. You can write up the report in any word processing software (LaTex, MS Word, Pages, R Markdown, etc.), however the submission should contain only one single PDF file. DO NOT SUBMIT A HANDWRITTEN REPORT.

5. Implementation and Coding: Students can implement any suitable and reasonable method to solve the problem. For each method, students should give a clear description towards its implementation. The standard of clarity is that someone else can replicate the method based on your description. Implementing methods beyond the scope of lectures are neither encouraged nor punished. Students can code in any programming language, while R is the most recommended one. Do not include the codes in the report.

For Question 2, follow the instruction steps to analyze the Boston housing data, available from the MASS package:

> install.packages(“MASS”)

> library(MASS)

> data(Boston)

The accompanying R file (Final_Project_Q2.R) contains the code that replicates the two figures in the assignment sheet. There you can see how to draw scatterplot and scatterplot matrix. Specifically, you’ll find how to draw scatterplot between a numeric variable and a binary variable.

Question 1 is more of a free style. You can use linear regression and regularized methods; you can also fit a polynomial regression model (beyond linear relationship) and use data-driven methods to decide the order of the polynomial. Whatever method(s) you choose, explain your motivation, your understanding about the problem and data, visualize it if necessary, describe your model and method properly with enough details, and finally interpret your conclusions.

What is a polynomial model? See the MPG versus Horsepower example in Week7b lecture notes.

  • attachment

  • attachment

  • attachment

  • attachment

  • attachment