3  Main Project

If you have successfully completed the Pre-project Exercise, you’re ready to start the Main Project.

The deliverable is a 35-slide deck examining pay differences between women and men using the March 2024 CPS. We strongly recommend adopting a cadence for your project activity that is in sync with our treatment of related course topics. Here is the mapping:

Diagram showing how the 35 project slides map to the six course modules: Data Fundamentals → Slides 1-7; Beginning to Learn → Slides 8-12; Models for Exploration → Slides 13-20; Making Inferences → Slides 21-26; Sources of Bias → Through Slide 20 (optional check); Regression Fundamentals → Slides 27-35

Project-course alignment
Note

If you fall behind on the course content, the project gets harder fast. Stay on schedule. We won’t have any sympathy for those who start late and come up short in the end.

To help you stay on track, we offer an optional Project Progress Check, which covers the first 20 slides. As a nudge to take us up on the option, we will award up to 5 bonus points depending on the correctness of your submission. Even if you whiff on the progress check, you will be in a much better position to succeed on the final project than if you skip the check entirely.

3.1 First, download the Main Project file pack and sort its contents

You should already have downloaded and sorted the contents of the Pre-project Exercise file pack. Now do the same with the Main Project file pack, which contains

  • project.Rmd — R Markdown template you will complete and knit into a slide deck
  • cpsmar_e.R — R script that creates the CPS data extract for your analysis from the raw CPS files
  • check_setup_project.R — R script that checks your setup for the Main Project

Here’s where each file goes:

File Goes in
project.Rmd Project/ (root)
cpsmar_e.R Project/r/
check_setup_project.R Project/r/

Run check_setup_project.R to confirm everything’s in the right place.

3.2 General instructions

Before you jump into your project work, take note of the following general instructions that apply to the entire project. Overlooking any of them will prove costly.

  • Slide format

    An acceptable submission has 35 slides on 35 pages of PDF output, rendered in landscape mode, precisely matching the format of the Final Project reference deck. If your deck doesn’t comply exactly with the reference deck, it will receive a score of 0.

  • Write-up line limits

    Each slide requiring text content has a specific line limit. We will not read beyond the line limit.

    What counts as a “line”

    “Lines of text” means rendered, countable lines on your final PDF — not sentences, and not lines as they appear in RStudio’s editor or output preview. A long sentence that wraps into three visual lines on the rendered slide counts as three lines. Submissions that exceed the line limit are heavily penalized. Always check your knitted PDF and count the actual visible lines on the slide before you submit.

  • HTML code for table formatting

    Code chunks associated with table construction are wrapped in the <div class="table-..."> and </div> HTML tags, which govern table formatting. Do not edit or remove them.

  • The YAML block

    As we explained in the Pre-project Exercise, the YAML block at the top of the template contains important formatting information. You must edit the author: field to insert your name, but do not change anything else in the YAML.

    If your YAML becomes corrupted, here is the definitive code block to replace your corrupted version:

    ---
    title: "BUSN 5000 Project"
    subtitle: "Exploring Pay Differences between Women and Men"
    author: "First Name Last Name"
    date: |
        | Summer 2026
        | (updated `r format(Sys.time(), '%d %b %y')`)
    output:
      ioslides_presentation:
        css: css/project_slidedeck.css
        widescreen: true
    ---
  • The setup chunk

    Like with the Pre-project Exercise template, below the YAML you will find the setup chunk and it is complete as is. Do not touch it. As you should know from the Pre-project Exercise, the setup chunk loads the required R packages and sets global options for the rest of the document.

  • Echo settings

    The setup chunk sets echo = TRUE, which will display code chunks by default. Note that we have overridden this global setting on certain slides by setting echo = FALSE in those chunks, where displaying the code would be redundant or otherwise not valuable. Do not change these individual echo settings.

  • Get the units right

    In your write-ups, make sure you use percent and percentage point correctly. They are not interchangeable. For example, a change in the gender wage gap from 25% to 20% is a 5 percentage point decrease — not a 5% decrease. A 5% decrease in the gender wage gap from a 25% baseline would amount to a change of 1.25 percentage points to 23.75%. We will definitely penalize you for this sort of error, so don’t make it.

    Workflow tip

    You can run individual chunks without knitting the entire document, so work the project code chunks one at a time. After you complete a chunk, confirm it runs cleanly by clicking the green play button at the top-right of the chunk. If it does, set eval = TRUE on that chunk so its output appears when you knit the whole document.

3.3 Slide-by-slide instructions

There are 35 slides in a successfully rendered deck. Some are just section dividers, with no point values and requiring no contribution from you. For those that require your input – either code completions or text responses – we provide specific instructions for how to enter it correctly, along with the corresponding point values in parentheses. So, here it goes, slides 1-35:

  1. BUSN 5000 Project.

    Title slide. It’s auto-generated by the YAML with your name as entered on the author: line.

  2. Academic Honesty Statement (1 point)

    Type your first and last name on the “Signature:” line. You may consult Terry Analytics Lab staff, the TA, or the instructor for assistance — but your deliverable must represent your work and be completed and submitted by you.

  3. Introduction

    Section divider.

  4. Overview (2 points)

    Provide a brief overview of the project. Include:

    • What you are trying to learn about
    • The data you are using to learn
    • A brief summary of your findings

    Limit your overview to 6 lines of text.

    Pro tip

    To quote Yoda: “Do or do not. There is no try.” When you explain what your project is about or summarize what you learned, don’t say “I try / seek / look / aim / attempt (or am trying / seeking / …)”. Instead, say “I show / document / demonstrate / report / find …”. Never say “I hope…”

    You might write this slide last. It’s easier to summarize your findings after you’ve completed the analysis.

  5. Data

    Section divider.

  6. March 2024 CPS (4 points)

    Using the ASEC documentation (cpsmar24_documentation.pdf), provide a brief overview of the March 2024 CPS. Include:

    • A description of the standard monthly CPS
    • The additional information collected in the ASEC
    • Approximately the number of households surveyed in March 2024

    Limit your summary to 4 lines of text.

    Pro tip

    You can ask AI, but you should verify the response by consulting the CPS documentation and formulate the overview in your own words.

  7. March 2024 CPS Extract (4 points)

    Complete the read_data code chunk to read the extract you created with cpsmar_e.R.

    Then, in the write-up section, explain the actions you applied to the March 2024 survey in cpsmar_e.R to create the data extract cpsmar_e.csv. Include:

    • The variables you selected from the person file
    • The variables you selected from the household file
    • The restriction(s) you applied to the data extract
    • The number of observations and variables in the data extract

    Limit your explanation to 6 lines of text.

    Pro tips

    Refer to each variable by its plain English meaning, not the name used in the script. Refer to the key tidyverse “verbs” in the script, like mutate and rename.

  8. Analysis sample (2 points)

    Complete the btl1 code chunk to create your analysis sample (cpsmar_a) with the following restrictions and additions:

    • Restrict to individuals who are 23 to 62 years old (inclusive)
    • Restrict to individuals who have positive earnings
    • Create a character-valued gender variable

    Underneath the code chunk, document the number of observations in your analysis sample.

    Limit your documentation to 2 lines.

    Pro tips

    Refer to your notes or the slides for examples of filter and mutate. Consult the Environment tab in the Northeast pane of RStudio for information about the cpsmar_a data frame.

  9. Baseline earnings distributions

    Section divider.

  10. Plotting earnings distributions

    Complete the btl2 code chunk to:

    • Create the figure1 ggplot object (the earnings distribution by gender plot)
    • Calculate average earnings for each gender using the earnings_fvm object
    • Pull the average earnings for women and men into single values (avg_earnings_f, avg_earnings_m) using filter and pull
  11. Distribution of earnings by gender (8 points)

    Complete the btl2.5 code chunk to display Figure 1 by writing the name of the Figure 1 object.

  12. Baseline comparisons (4 points)

    Summarize the main empirical facts associated with Figure 1. Include:

    • A description of the most important fact communicated by the figure
    • The average earnings of men and women
    • The dollar difference and percentage difference in average earnings between men and women in the sample

    Limit your summary to 5 lines of text on the slide.

    Pro tips

    Use inline R syntax with the avg_earnings_f and avg_earnings_m objects to insert the respective averages in your write-up rather than typing the numbers manually.

  13. The career gender gap

    Section divider.

  14. Wages and hours differences (8 points)

    Complete the mi1 code chunk to create and display Table 1 (wages and hours by gender).

  15. Documenting the differences (2 points)

    Summarize the wage and hours differences presented in Table 1.

    Limit your documentation to 3 lines of text on the slide.

  16. Plotting career log wage profiles

    Complete the mi2 code chunk to estimate log wage profiles for women and men and create Figure 2.

  17. Career log wage profiles (8 points)

    Complete the mi2.5 chunk to display Figure 2.

  18. Estimating wage differences over a career

    Complete the mi3 code chunk to create Table 2. This involves:

    • Creating males and females objects using filter and rename
    • Merging the two into the diff_fvm object using inner_join
    • Calculating the difference between average log wages using mutate
    • Grouping by age_group
    • Using kable() to organize the results into the table2 object
  19. Evolution of the gender wage gap (8 points)

    Complete the mi3.5 code chunk to display Table 2.

  20. Discussing the gender wage gap evolution (2 points)

    Summarize the results presented in Figure 2 and Table 2.

    Limit your summary to 3 lines of text on the slide.

  21. Explaining the gender wage gap

    Section divider.

  22. Fitting the log wage profiles

    Complete the reg1 code chunk to create Figure 3 by fitting the career profiles with a quadratic in age.

  23. Log wage profiles with quadratic fits (8 points)

    Complete the reg1.5 chunk to display Figure 3.

  24. Gender differences in education (8 points)

    Complete the ed_vars code chunk to create and display Table 3 (educational attainment by gender).

  25. Gender differences in demographics (8 points)

    Complete the demo_vars code chunk to create and display Table 4 (demographic characteristics by gender).

  26. Documenting differences in characteristics (4 points)

    • Summarize the educational attainment differences presented in Table 3
    • Summarize the differences in demographic characteristics presented in Table 4

    Limit your documentation to 6 lines of text on the slide.

  27. Controlling for education and demographic characteristics

    Complete the reg2a code chunk to create:

    • The singles subset of the analysis data
    • The models object containing five regression models that incrementally add controls for education and demographic characteristics, with the fifth restricting to unmarried workers without children under 6 (“Only Singles”)

    In the regression analysis, you’ll distinguish between personal and household characteristics among the demographic variables.

    Pro tips
    • Moving from top to bottom in the models object, the list of controls grows as indicated by the name of the model — with the exception of the final model “Only Singles”, which estimates the same relationship as the “Add Person” model but on the new subset.
    • To decide whether a variable belongs in the “Add Person” or “Add Household” model, ask yourself: “Can I define this variable for a particular individual irrespective of other individuals who may or may not exist in their life?” If yes → Person. If no (it depends on someone else, like a spouse or a child) → Household.
  28. Reporting the results

    Complete the reg2b code chunk to create Table 5. This involves:

    • Constructing the coefficient map object to display only the gender and age coefficient estimates
    • Constructing the goodness-of-fit object to display sample size and \(R^2\) values
    • Constructing a rows object to distinguish the regression specification associated with each column
    Pro tip

    Make sure you report robust standard errors and indicate that you do in a table note.

  29. Explaining the gender wage gap (8 points)

    Complete the reg2.5 chunk to display Table 5.

  30. Documenting the findings (4 points)

    Summarize how the estimated average gender wage gap changes as you add education, personal, and household characteristics to the regression — and then how it changes when the sample is restricted to singles.

    Limit your documentation to 6 lines of text on the slide.

    Pro tips

    Start your write-up with the baseline model and describe subsequent results relative to the baseline. Focus on the Female coefficient estimate and its standard error. Remember how the coefficient estimate is correctly interpreted.

  31. Conclusion

    Section divider.

  32. Summary (4 points)

    Briefly summarize the objective of the project and its main findings. Note:

    • The sample on which your analysis is based
    • The overall gender wage gap
    • How it evolves over a career
    • How it varies when controlling for education and demographic characteristics

    Limit your documentation to 7 lines of text on the slide.

  33. Appendix

    Section divider.

  34. Data documentation

    Complete the var_doc code chunk to create a table of the main variables used in this project with their definitions.

  35. List of main variables with definitions (4 points)

    Once you complete the var_doc chunk on the previous slide, set eval = TRUE on this slide to render the table of main variables with their definitions.

3.4 Common pitfalls

If you completed the Pre-project Exercise successfully, you have already bypassed the most common mistakes. If you nevertheless are having trouble completing the Main Project, Common Errors provides a comprehensive listing of problems we have seen students encounter, along with explanations and solutions. Some typical issues with the Main Project are:

  • Variable name confusion — defining cpsmar_a but referencing cpsmar_e (or vice versa) in a later chunk. See Common Errors.
  • Forgetting to run cpsmar_e.R before knitting. See Common Errors.
  • Touching the setup chunk’s include = FALSE — same scaffolding error as the pre-project. See Common Errors.
  • Editing the YAML beyond the author line. See Common Errors.

3.5 What happens next

Whether you are at the Progress Check stage or at the end, the next step is to prepare to submit your work to Gradescope. Submission walks you through the entire process.