GEOG 489
Advanced Python Programming for GIS

1.8 Version control systems, Git, and GitHub

PrintPrint

Version control systems

Software projects can often grow in complexity and expand to include multiple developers. Version control systems (VCS) are designed to record changes to data and encourage team collaboration. Data, often software code, can be backed up to prevent data loss and track changes made. VCS are tools to facilitate teamwork and merging of different contributor’s changes. Version control [1] is also known as “revision control.” Version control tools like Git can help development teams or individuals manage their projects in a logical, procedural way without needing to email copies of files around and worry about who made what changes in which version.

Differences between centralized VCS and distributed VCS

Centralized VCS, like Subversion (SVN), Microsoft Team Foundation Server (TFS) and IBM ClearCase, all use a centralized, client-server model for data storage and to varying degrees discourage “branching” of code (discussed in more detail below). These systems instead encourage a file check-out, check-in process and often have longer “commit cycles” where developers work locally with their code for longer periods before committing their changes to the central repository for back-up and collaboration with others. Centralized VCS have a longer history in the software development world than DVCS, which are comparatively newer. Some of these tools are difficult to compare solely on their VCS merits because they perform more operations than just version control. For example, TFS and ClearCase are not just VCS software, but integrate bug tracking and release deployment as well.

Distributed VCS (DVCS) like Git (what we’re focusing on) or Mercurial (hg), all use a decentralized, peer-to-peer model where each developer can check out an entire repository to their local environment. This creates a system of distributed backup where if any one system becomes unavailable, the code can be reconstructed from a different developer’s copy of the repository. This also allows off-line editing of the repository code when a network connection to a central repository is unavailable. As well, DVCS software encourages branching to allow developers to experiment with new functionality without “breaking” the main “trunk” of the code base.

A hybrid VCS might use the concept of a central main repository that can be branched by multiple developers using DVCS software, but where all changes are eventually merged back to the main trunk code repository. This is generally the model used by online code repositories like GitHub or Bitbucket.

Basics of Git

Git is a VCS that stores and tracks source code in a repository. A variety of data about code projects is tracked such as what changes were made, who made them, and comments about the changes [3]. Past versions of a project can be accessed and reinstated if necessary. Git uses permissions to control what changes get incorporated in the master repository. In projects with multiple people, one user will be designated as the project owner and will approve or reject changes as necessary.

Changes to the source code are handled by branches, merges, and commits. Branching, sometimes called forking, lets a developer copy a code repository (or part of a repository) for development in parallel with the main trunk of the code base. This is typically done to allow multiple developers to work separately and then merge their changes back into a main trunk code repository.

Although Git is commonly used on code projects with multiple developers, the technology can be applied to any number of users (including one) working on any types of digital files. More recently, Git has gained in popularity since it is used as the back end for GitHub among other platforms. Although other VCS exist, Git is frequently chosen since it is free, open source, and easily implemented.

Dictionary

Git has a few key terms to know moving forward [2]:

  • Repository (n.)- place where the history of work is stored
  • Clone (n.)- a copy you make of someone else’s repository which you may or may not intend to edit
  • Fork (v.)- the act of copying someone else’s repository, usually with the intent of making your own edits
  • Branch (n.)- similar to a clone, but a branch is a copy of a repository created by forking a project. The intent with a branch is to make edits that result in either reconciling the branch to the parent repository or having the branch become a new separate repository.
  • Merge (v.)- integrating changes from one branch into another branch
  • Commit (n.)- an individual change to a file or set of files. It’s somewhat similar to hitting the “save” button.
  • Pull (v.)- integrating others’ changes into your local copy of files
  • Pull request (n.)- a request from another developer to integrate their changes into the repository
  • Push (v.)- sending your committed changes to a remote repository

Basic Git progression

A Git repository begins as a folder, either one that already exists or one that is created specifically to house the repository. For the cleanest approach, this folder will only contain folders and files that contribute to one particular project. When a folder is designated as a repository, Git adds one additional hidden subfolder called .git that houses several folders and files and two text files called .gitignore and .gitmodule as highlighted in Figure 1.29

screenshot of .git folder, .gitignore ad .gitmodules
Figure 1.29 The highlighted portions are the folder and files that Git adds when a repository is created

These file components handle all of the version control and tracking as the user commits changes to Git. If the user does not commit their changes to Git, the changes are not “saved” in the version control system. Because of this, it’s best to commit changes at fairly frequent intervals. The committed changes are only active on one particular user’s computer at this point. If the user is working on a branch of another repository, they will want to pull changes from the master repository fairly often to make sure they’re working on the most recent version of the code. If a conflict arises when the branch and the master have both changed in the same place in different ways, the user can work through how to resolve the conflict. When the user wants to integrate their changes with the master repository, the user will create a pull request to the owner of the repository. The owner will then review the changes made and any conflicts that exist, and either choose to accept the pull request to merge the edits into the master repository or send the changes back for additional work. These workflow steps may happen hundreds or thousands of times throughout the lifetime of a code project.

On its own, Git operates off a command line interface; users perform all actions by typing commands. Although this method is perfectly fine, visualizing what’s going on with the project can be a bit hard. To help with that, multiple GUI interfaces have been created to visualize and thus simplify the version control process, and some IDEs include built-in version control hooks. Currently, GitHub is the most popular front-end for Git and offers a free version for basic users.


Resources:
[1] https://en.wikipedia.org/wiki/Version_control
[2] https://github.com/kansasgis/GithubWebinar_2015 
[3] https://en.wikipedia.org/wiki/Git