Project Due Date: Monday, May 18, 11:59 PM
This project may be done independently or in pairs. We will have higher standards for those working in pairs, but either way we expect it to be a substantial project on which you devote significant effort. It's difficult to quantify "significant effort" and there's no detailed grading rubric. Part of the purpose of the proposal is for the staff to provide feedback on whether the project appears to be of the appopriate scope, with higher expectations for pairs than for individuals.
Late Policy: Both the project proposal and the final complete project are subject to the late policy, and they are counted separately. All assignments and projects are due at 11:59pm on the due date. Each assignment and project may be turned in up to 24 hours late for a 10% penalty and up to 48 hours late for a 30% penalty. No assignments or projects will be accepted more than 48 hours late. Students have four free late days they may use to turn in work late with no penalty: five 24-hour periods, no pro-rating. This late policy is enforced without exception.
Honor Code: Under the Honor Code at Stanford, you are expected to submit your own original work for assignments, projects, and exams. On many occasions when working on assignments or projects (but never exams!) it is useful to ask others -- the instructor, the TAs, or other students -- for hints, or to talk generally about aspects of the assignment. Such activity is both acceptable and encouraged, but you must indicate on all submitted work any assistance that you received. Any assistance received that is not given proper citation will be considered a violation of the Honor Code. In any event, you are responsible for understanding, writing up, and being able to explain all work that you submit. The course staff will pursue aggressively all suspected cases of Honor Code violations, and they will be handled through official University channels.
Datasets: We've identified the following sources of data that we recommend using for your project. You are free to use other datasets if you prefer; see discussion below.
- Poverty Statistics
- Download link: Poverty Data
- Source: World Bank Data
- Description: For countries with an active poverty monitoring program, the World Bank -- in collaboration with national institutions, other development agencies, and civil society -- regularly conducts analytical work to assess the extent and causes of poverty and inequality, examine the impact of growth and public policy, and review household survey data and measurement methods. Data here includes poverty and inequality measures generated from analytical reports, from national poverty monitoring programs, and from the World Bank's Development Research Group which has been producing internationally comparable and global poverty estimates and lines since 1990.
- Consumer Complaints
- Download link: Consumer Complaints (zipped csv file)
- Source: Consumer Complaint Database
- Description: A database of complaints the Consumer Financial Protection Bureau has received about financial products and services.
- USA's Consumer Price Index
- Download link: historicalcpi.xls
- Source: United States Department of Agriculture Economic Research Service
- Description: The Consumer Price Index (CPI) for food is a component of the all-items CPI. The CPI measures the average change over time in the prices paid by urban consumers for a representative market basket of consumer goods and services. While the all-items CPI measures the price changes for all consumer goods and services, including food, the CPI for food measures the changes in the retail prices of food items only.
- Indicators on Women and Men
- Download links:
- Source: United Nations Statistics Division (UNSD)
- Description: The Indicators on Women and Men provides the latest statistics and indicators on women and men in six specific fields of concern: population, women and men in families, health, education, work, and political decision making. The statistics and indicators refer to the latest year for which sex-disaggregated data are available. The data have been compiled from official national sources as well as international sources.
- Startups: Funding and Acquisitions
- Download link: Data on Startup Companies, Investments, and Acquisitions (zipped folder with many csv files included)
- Source: Crunchbase
- Description: Crunchbase data contains crowdsourced information on a large number of startups including who invested in them and how much. Data includes Companies across the world that have raised money, Investors (individual and institutional) that have invested in those companies, Funding rounds of investment, and records of all acquisitions of these startups. Other information about the companies (e.g., category, location) is also included.
- Crime and Socioeconomic Indicators
- Download links:
- Source: City of Chicago, census.gov (via Big Data for Social Good Challenge)
- Description:
- Crimes - Reported incidents of crime (except murders) in the city of Chicago from 2001 to present, minus the most recent seven days.
- Small Area Income and Poverty Estimates - The files in the data directory contain estimates of poverty and income for 2013. There is one data file for each state and for the US, with data for all the 2013 statistics. Additionally, there is one file that includes data for the US and each state and county.
- Census Data - A selection of six socioeconomic indicators of public health significance, and a hardship index.
- New York City
- Download links:
- Source: data.ny.gov
- Description: New York City Open Data contains data on a wide variety of NYC aspects (e.g., education, safety, recreation, and many more). New York City Restaurant Inspection Results captures restaurant inspections, violations, grades, and adjudication information in NYC.
- Walmart
- Download links:
- stores.csv, features.csv, train.csv (download all button)
- Source: Kaggle Competition
- Description: Historical sales data for 45 Walmart stores in different regions. (Example area to explore: the effect on sales of weather, temperature, fuel consumption, the holiday season, and other factors.)
- Download links:
- World Health
- Download link: Indicators - we recommend any of the datasets in the Health section
- Source: World Bank Data
- Description: Worldwide health data covering factors such as fertility rates, HIV, immunization, population, life expectancy, birth rates, death rates and many more
- 100+ Interesting Data Sets for Statistics
- 19 Free Public Data Sets For Your First Data Science Projects
Finally, the course datasets page has links to even more sources of data. Specifically, we want to highlight that several Coronavirus datasets were added to the page recently.
If you use any dataset other than one from the list above, you will need to include a description and pointer in your project proposal. Be warned: In data analysis projects, it's common for more than 90% of the overall effort to be in obtaining the data and getting it ready for analysis, with less than 10% going into the analysis itself. We're trying to alleviate this imbalance and the attendant frustration by providing a menu of datasets we know are not difficult to work with.
Data formats: Most of the data listed above is provided in .xlsx (Excel) or .csv format. We've created a set of instructions for converting data from .xlsx to .csv format (for use in Google Sheets or Python programs), and from .csv to .db format (for use with SQL): Data Format Conversion.
Tools and techniques: You are welcome to use any of the tools and techniques learned in class and practiced on the assignments, or you may use other tools and techniques that you're familiar with. Do be aware that the teaching staff may not be able to provide a great deal of support if you choose to use tools or techniques we're not well-versed in.
The project is intentionally a bit vague and open-ended; we're looking for you to show initiative and inventiveness. Try to find something in the data that other students are not likely to find!
Project Proposal
Due Date: Monday, Apr 27 11:59 PM - Staff will provide feedback by Friday, May 1 if proposal is turned in on time.The main purpose of the proposal is for us to give feedback on whether the scope of the project is in the range of what we're expecting, whether your plans are crisp enough, and in cases where you plan to use a different dataset than one from the list above, whether it looks suitable and promising. On average we expect proposals to be about half-a-page long, though we know the lengths will vary. Please create a document containing the following two parts.
- Dataset
- State what data you plan to use -- either which one of the datasets we've suggested, or another dataset of your choosing.
- Describe the data. As part of this, please include the total size of the dataset (e.g. number of rows) and a small sample of the data.
- Include a link to the source of the data, and discuss any difficulties you anticipate getting the data ready for analysis.
- Goals
- Formulate a specific set of questions you want to answer, points you want to make, or issues you wish to explore through the data. Be as concrete as possible.
What To Turn In
Your proposal should be in a pdf document named project1_proposal.pdf. Include clearly at the top of the document the name(s) and SUID(s) for the student or student-pair submitting the proposal, then include the two parts of the proposal specified above. Upload the pdf document to Gradescope under "Project 1 Proposal: Personal Data Analysis". For projects being done in pairs, only one partner needs to submit to Gradescope, and should add their partner's name to the submission under 'GROUP'. See the group submission video for details.Complete Project
Due Date: Monday, May 18, 11:59 PMUse techniques and tools such as (but not limited to) those covered in class to manipulate, analyze, and possibly visualize the data in order to achieve your objectives. It is likely you will end up developing a data processing pipeline, where in each step you transform or otherwise manipulate some or all of your data to get it into a form that's suitable for the next step. In the final step your data should be in the best form to answer your questions or otherwise achieve your objectives.
In many cases the early steps in a pipeline are more about preparing the data -- correcting mistakes, filling in missing values, creating consistent representations, mapping corresponding values -- while the later steps are more focused on summarization and analysis. If you use one of the recommended datasets, your preparation steps may be minimal.
Jupyter notebooks can be a convenient method for constructing and maintaining data processing pipelines, which may include Python and/or SQL processing, but we are not requiring Jupyter for the project. If you plan to include spreadsheet manipulations then you will need to work outside of Jupyter regardless.
What To Turn In
You will be turning in a single PDF writeup to Gradescope.
The writeup should include parts 1 and 2 from the project proposal, discuss in reasonable detail how you went about your analysis, and finally (and most importantly) discuss the conclusions drawn from your data-driven study. On average we expect the writeups to be about 5-7 pages long, though we know the lengths will vary. Data visualizations can be pasted into the writeup, but it is likely you will need to include other artifacts such as spreadsheets, scripts, or Jupyter notebooks to document your analysis. At the end of your writeup, include a section titled Description of Files Used that lists all the artifacts that you used to generate the analysis and visualizations, with a clear description of what each one contains. For example:
- poverty_data_processing.py - This python script performs the initial data cleaning and processing
- poverty_analysis.ipynb - This jupyter notebook performs the main data analyses, using both SQL queries and python data manipulation
- poverty_visualizations.xlsx - This spreadsheet performs additional data manipulations and contains the final visualizations
Here is a guideline for the sections in the main writeup:
- Include clearly at the top of the document the name(s) and SUID(s) for the student or student-pair submitting the project.
- Dataset: as in project proposal (possibly modified based on feedback)
- Goals: as in project proposal (possibly modified based on feedback)
- Data processing: Description of steps that were taken from raw data to final results
- Visualizations: when relevant
- Conclusions: resolution of questions, issues, or points from part 2, based on your study
- Description of Files Used
Upload the pdf document to Gradescope under "Project 1 Writeup: Personal Data Analysis". For projects being done in pairs, only one partner needs to submit to Gradescope, and should add their partner's name to the submission under 'GROUP'. See the group submission video for details.
NOTE: We do not require you to submit your code for this assignment. However, we may request the entire code or spreadsheets if necessary.