Organization

Standardized practices for file organization and storage save time and ensure consistency, enhancing the overall quality of research outputs. A simple and flexible folder structure not only promotes long-term data stability but also supports seamless project growth, adaptability, and researcher transitions. Such an approach reduces the complexity of project management and aligns effectively with version control systems, enhancing collaborative efforts and preserving institutional knowledge.

1 Summary

  • Keep raw data in a dedicated folder (raw/) and never modify it directly.

  • Use a simple, descriptive folder structure that reflects project logic and is easy for new users to understand.

  • Support version control by organizing code and outputs in a way that works with Git, i.e. by creating separate repository folders for each user.

  • Design for flexibility, allowing new data, methods, or collaborators without major reorganization.

  • Name files/folders to be human- and machine-readable: use clear names, avoid spaces and special characters, and include numeric prefixes for order.

2 Directory structure

Regardless of the specific method deployed, your data project organization should have the following qualities:

  • Raw data are kept in a distinct folder and never modified or overwritten. Always keep an unaltered version of original data files, “warts and all.” Avoid making any changes directly to this file; instead, perform corrections using a scripted language and save the modified data as separate output files. Consider making raw datasets “read-only” so they cannot be accidentally modified.
  • Simple: The folder structure should be easy to navigate and understand, even for someone new to the project. It should mirror the logical flow of the project and use clear, descriptive names that reflect the contents and purpose of each folder.
  • Flexible: The structure should be adaptable to evolving project needs, allowing for the addition of new data, methods, or collaborators without disrupting the existing organization. It should support different types of data and workflows, making it easy to integrate new elements as the project evolves.

These qualities also facilitate version control practices. There is additional guidance on organization for version control here.

Below is an example of a directory structure that would be compatible with version control implementation on RFF’s L:/ drive. This version illustrates a personal repository folder model of code version control, which operates best on the L:/ drive.

project_name/
├── data/
│   ├── raw/
│   ├── intermediate/
│   ├── clean/ (optional)
├── results*/
├── repos/
│   ├── edgar/
│   │   ├── scripts/
│   │   │   ├── processing/
│   │   │   ├── analysis/
│   │   │   ├── tools/
│   │   ├── results**/
│   │   ├── docs/
│   ├── caudle/
│   │   ├── scripts/
│   │   │   ├── processing/
│   │   │   ├── analysis/
│   │   │   ├── tools/
│   │   ├── results**/
│   │   ├── docs/

2.1 Data folder

  • The data folder contains all project data sets.
  • Raw data is preserved in its own subfolder.
  • The intermediate folder contains datasets created during data cleaning and processing.
  • If practical, a clean folder can contain cleaned output datasets, but note that it’s often unclear when datasets are truly “clean” until late project stages.
Note

For projects with small datasets, this folder can be version controlled, with larger files ignored using the .gitignore file. In this case, this folder would be a subfolder of individuals’ repository folders.

2.2 Results folder

  • The results folder contains analysis results, and model outputs. For example, it can be used to store tables, figures, and model estimates.
  • *For internal RFF projects, the results folder can be stored in the main directory and not within repositories.
  • **For projects with external collaborators, it may be useful to store the results folder within repositories. These files are generally smaller than 100 MB and can be stored in a main repository using GitHub; however, some formats (such as SVG) can be quite large. The .gitignore file can be configured to ignore certain file types (such as SVG files, when both PNG and SVG file formats are generated).

2.3 Repos folder

  • This directory structure is based on personal repositories. The repos, or repositories, directory contains version-controlled files. To allow individuals to work on version controlled files without interfering with others’ versions, each researcher should have their own folder that’s linked to the GitHub remote repository. Team members can then clone the project’s Git repository into their respective folders and work exclusively within their own copies. Changes can be synced and reconciled using GitHub, preventing simultaneous edits of the same file and ensuring effective version control.
  • Individual folders (e.g. Edgar, Caudle) should have mirrored (the same) directory structures.

2.4 Scripts folders

  • The scripts folder contains all code files (e.g. .R, .py, .do) for the project. These scripts provide explicit instructions for processing data and performing analyses. It should be possible to reproduce the entirety of the project’s processed data sets and results using only these scripts and the raw data.
  • The scripts folder can be parsed into subfolders containing scripts that process raw data (processing), analyze processed data (analysis), and tools. The tools folder (sometimes called util, modules, or helpers) contains scripts with distinct functions that can be “called” (referenced) in the main processing scripts. This is especially useful if functions are used multiple times or are lengthy. Separately storing functions that may be used in multiple source code scripts is an important practice in creating quality software.

2.5 Docs folder

The documents folder should contain any version-controlled shared documents (e.g. LaTeX, Markdown, Overleaf).

2.6 Other project files

This template does not include specific folders for meeting notes, literature reviews, presentations, project management, etc., because those types of files are not the focus of this guidance.

We recommend storing these types of files in the project-level Microsoft Teams folder. See the RFF Communication Norms and Guidance for more information.

3 Other organization practices

3.1 Subfolders

Organizing files into subfolders can help manage complexity and improve workflow. Subfolders are particularly useful when a single folder grows too large, making it hard to locate specific scripts, data, or results. By creating logical groupings you can keep related files together and streamline collaboration. Examples of logical groupings for subfolder names are by

  • data source (e.g., usda),
  • variable (precipitation),
  • processing step (merge), or
  • results category (e.g., regressions or projections).

Ideally, project folders should be organized so that each file’s complete path is informative about the role of the folder’s contents in the project. Some examples are:

  • A script for cleaning Corelogic transactions data: scripts/corelogic/01_cleaning/01_clean_trans_data.R
  • Raw Corelogic transactions data from Los Angeles County (FIPS code 06037): raw_data/corelogic/transactions/06037.dta
  • A regression table for the outcome variable property values, with heterogeneity results by region: results/regressions/property_value/table_region.tex

4 Naming folders, files and scripts

When creating folders:

  • Avoid ambiguous/overlapping categories and generic catch-all folders (e.g. “temp” or “new”).
  • Avoid creating or storing copies of the same file in different folders.

When creating data or script files, make them:

  • Human readable: Create brief but meaningful names that can be interpreted by colleagues.

    • Make names descriptive: they should convey information about file/folder content. For example, if you’re generating output visualizations of the same metric, instead of county_means_a and county_means_b, use county_means_map and county_means_boxplot.
    • Avoid storing separate versions of files (e.g. county_means_map_v2), and instead rely on version control tools to save and document changes.
    • If you must create different versions of files, make sure to document the distinction in a README file.
    • If you use abbreviations or acronyms, make sure they are defined in documentation such as a project-level README file.
  • Machine readable:

    • Use only ASCII characters (letters, numbers, and underscores).
    • Do not include spaces or special characters (/  : * ? <> &).
    • Use hyphens or underscores instead of spaces (the “snake_case” method).
    • Files and folders should be easy to search and filter based on name using structured file names. The ability to sort and read files by name is useful and helps organization but requires specific conventions. See examples in the table below, keeping in mind that:
      • It is not recommended to use dates in script file names to denote when a script was created or modified. Instead, leverage version control to save and document different script versions.
      • Make sure to “left pad” numbers with zeros. For example, use 01 instead of 1. This is to allow default sorting to still apply if and when the file name prefixes enter the double digits.
Practice for structured file names Description Example
ID-based names Structure file paths and names in a way that makes them easy to access programmatically—e.g., enabling batch loading or iteration across identifiers like years/dates, counties, or scenarios. For example, storing county-level data as shown would allow data to be read into memory by simply looping over FIPS codes as they appear in the directory file names. 53019_data.csv
53033_data.csv
53061_data.csv
Chronological order Use ISO 8601 format for date-based files: YYYY-MM-DD. This ensures dates sort correctly by default. 2021_01_01_precipitation_mm.csv
2021_01_02_precipitation_mm.csv
2021_01_01_temperature_statistics_f.csv
2021_01_02_temperature_statistics_f.csv
Logical processing order For scripts or folders that must run in sequence, use numeric prefixes to indicate the intended order of execution. 01_clean_raw_data.R
02_merge_clean_data.R
03_descriptive_statistics.R
04_regressions.R