Workflows
Now that we know some basic Git principles, how do we make a workflow for a given project? First, let’s start by talking about some key principles for any workflow, then we can get into some of the specific considerations for a project team at RFF.
1 Key Principles
- Users should commit changes frequently, with descriptive messages.
- Frequent commits make it so that it is easier to find discrete check-points in the repository, rather than having everything committed at the same time.
- Try to have commits provide incremental improvements. For example, a commit might include the addition of a single function, or a 2-line fix to an existing script.
- If you find yourself coding for hours between commits, consider trying to break up your task into discrete sub-tasks.
- Pull often to ensure that you are incorporating the changes of others.
- A local clone of a repository should generally ONLY have a single user.
- This is why in the Directory Structure section of the Data Management guidance, we recommend that each user works from a repository clone stored in their own individual subfolder within the L:/ drive project folder.
- This is why in the Directory Structure section of the Data Management guidance, we recommend that each user works from a repository clone stored in their own individual subfolder within the L:/ drive project folder.
- Take care with encoding file paths within a repository. You want other people to be able to use the repository without having to move files around or change all the file paths in the repository. You could either:
Use relative file paths to point to other files that are located in the repository.
Use absolute file paths (i.e. on a shared network drive like the L drive) and consider making a variable for the path directory, especially if used multiple times. That way, if a collaborator has their data at a different path on their system, they only need to change a single path.
data_path = “L:/ProjectName/Data” data_file_1 = joinpath(data_path, “data_file_1.csv”) data_file_2 = joinpath(data_path, “data_file_2.csv”)
If you use relative file paths pointing to files outside the repository, specify in the project README how the external files and the repository should be positioned relative to each other.
2 Workflow Planning
When a project team is deciding to use Git, it is important to think through the following considerations. We recommend answering these questions at the beginning of a version-controlled project.
2.1 Where should each team member store their local repositories?
Here are two examples of how a project / team might choose to store their repositories.
- Generally Recommended: Each team member keeps their own working version of a repository on a shared drive, i.e. the L drive, which could (but probably shouldn’t) be accessed by other team members., i.e. inside of a larger project directory
- Pros: individual users’ repositories are available on all lab computers; repos are stored within the project folder, allowing easy access to shared project files (e.g. raw or processed data files)
- Cons: files could be accidentally changed by others if they are working in the incorrect version of the repository
- Each team member keeps their own working version of a repository somewhere private, either on the
C:/
drive of their personal computer, or on the RFF servers, with the intention of no other team members having access to it. This is suitable for teams that have many repositories where it would get very confusing to store all the repositories in the same location for the whole team.- Pros: it is “clean”, with no accidental file changes from other users.
- Cons: It would require care to be taken for access (file paths) to non-version-controlled data files.
2.2 Where should the GitHub repository be hosted?
- Do you work on a team with their own GitHub Organization (i.e.
E4ST
)?- If so, verify with the other members of the team that it would be acceptable to use the organization to host the repository
- If your team doesn’t already have one, does it make sense to create an organization for your team? Are there likely to be multiple repositories that would make sense to group under the same organization? If so, follow the steps in the GitHub Documentation for creating organizations.
- Does this repository belong in the RFF Organization?
- Generally, if the repository is associated with a publication, the code should be forked or transferred to the organization.
- We suggest creating the repository as a personal repository and changing ownership or forking.
- See the Data Management section on code publication for additional guidance.
- If it doesn’t make sense to host the repository in an organization, we recommend having one of project members host it with their GitHub account.
2.3 Should our repository be public or private?
Public repositories are visible to anyone, whereas private repositories allows for controlled access. For public repositories, it is still possible to control who is able to contribute to them, while private reporitories allow you to control who can even see them.
Generally, select for your repository to be private for repositories that will contain sensitive/proprietary information, and public if the project requires it to be public. Availability of data/code is important to ensure reproducability so making a repository public is a good thing in most cases. If you have sensitive data but would like for the repository to remain public, consider storing code and documentation in the repository, and sensitive data in a different way. See the Sensitive and Proprietary data section It is easy to change from private to public later on, so when in doubt choose private.
If you have chosen to make your project public, you will need to also select which license to use. See the section on software licenses to help make this decision.
2.4 Should our team use branches?
To learn about branches, what they are for, and how to use them, see the Branching section of the Tutorial. There are a few aspects to consider with this question.
- How many people will be making commits?
- If only one person will be making commits, it may save time and effort to maintain a single branch, without creating additional branches.
- With more people making commits, using multiple branches may help avoid unnecessary merge conflicts.
- Is it important to have a fully functional version at all times?
- Branches can be very good for preserving a functional version of the code, while allowing for risk-free experimental features to be developed.
- Is it important to have code review for this project?
- Branches facilitate streamlined code review, where reviewers only review sections that have changed (via Pull Requests).
Here are some examples of when a branch might be useful:
- Major Changes/Experimental Feature: Say you have an idea on how to improve your code, and it requires significantly re-vamping the code. It would make sense to create a branch for this experimental feature, so that you can develop the feature without worrying about making changes to the main branch. Then the whole team can use the Pull Request to evaluate whether or not they’d like to merge into the main branch.
- New Team Member: You have an existing model that is already working well. A new researcher joins the project team and is tasked with a code improvement. They are new to the model, so you want to review their changes carefully before applying them to the main branch. It makes sense to have them create a branch for their development, and review it with the pull request.
- Concurrent Development: Your team has a couple of features that need to be developed in tandem. To prevent extensive conflicts and streamline the process of combining the new features, each feature gets its own branch.
- Project: If your team has a project that requires making project-specific changes, it could be a good idea to create a project branch, with no intention to merge the project branch into
main
. ⚠️However, for features that would be useful to have in themain
branch, implementing those features in the project branch would make it difficult to merge them back into themain
branch later on. In that case, it is best to implement those features in themain
branch or in a separate branch (to be merged with main), and merge themain
branch back into the project branch. It is difficult to selectively merge changes from the project branch into themain
branch.
2.5 If using branches, how do we want to handle code review?
It would be good to talk about the following questions with your team:
- Should code review be required for every change to the code?
- One example of this would be requiring that every change be made in a branch and submitted via Pull Request, and that at least one person who did not author the Pull Request must approve it in order for the branch to be merged into the main branch.
- Your team may decide to only require a Pull Request and code review for major features.
- Who is allowed/required to approve Pull Requests?
- One example would be to require every Pull Request created by a newer team member to be reviewed by a more experienced team member.