PPOL564 - Data Science I: Foundations

Lecture 2

Version Control

Concepts we'll cover today:

  • Conceptualize what version control and how it works;
  • Basics of git and operating it with command line arguments;
  • Generating repositories on Github;
  • Linking local repositories with online remotes;
  • Dealing with merge conflicts;
  • Advanced functionality:
    • Branching

Cheatsheets

Note that the purpose of this notebook is to serve as a cheatsheet/reference for future use. Though there are better one out there than what I've outlined here. I've listed a few below for your use.

Working from the Command Line

The command line offers an easy way in which to navigate the computer. From it, we can:

  • create, move, edit files
  • install new functionality onto our computer
  • run scripts in R or python

The "command line" line can differ, however, given what machine you're running.

  • If you're on a Mac a unix command line comes installed on your machine. This is your terminal, which is an application available on all macs.

  • If you're on a Windows machine, you'll need to activate your Ubuntu terminal by turning on the developer mode on your computer. Instructions on how to do that can be found here. (Note that there are also other alternatives, such as putty)

The command line offers more control when interacting with your machine. Moreover, we'll need to leverage the command line when using most cloud computing connections. It takes some getting used to, but well worth it once you get the hang of it.

The point of it (w/r/t our purposes) is that it'll help us:

  1. Understand file paths on your computer
  2. Serves as a common hub from which to work
  3. Allow for us to generate reproducible coding sequences (via running scripts)
  4. Streamline work flow
    • set projects up
    • work between languages
    • batch process heavy loads
  5. Vital when speaking to a computing cluster, working on a virtual machine, or ssh-ing into a local computer

Common command line commands

The following outlines a few common commands that will be useful as you move forward. Disclaimer: some of these commands may differ given your operating system, but it's only quick Google search to find out how things are done on your machine.

- `pwd`: check working directory - `cd `: change working directory + `cd ..`: go back to the last directory + `cd `: go to the top directory + `cd -`: go back to where you once where - `ls`: list all files in the working directory - `mkdir `: make a directory - `mv `: move file from old path to new path - `cp `: copy file from old path to new path - `ctr + c`: stops current execution. - `cat `: print the entire file - `head`: view the start of a file to some $N$ number of lines - `head -n 3 file` - `tail`: view the end of a file to some $N$ number of lines - `tail -n 3 file` - Making a file: - `touch ` - `echo 'text' > file` - Renaming a File: - `mv ` - Asking for help: - `man ` - ` -h`

Version Control with Git

[Everyday] Git commands

**Set up your identity** - `git config --global user.name "myname"`: set your user name - `git config --global user.email your-email@georgetown.edu`: set your email account **Starting or getting a git repository on your machine** - `git init`: start a new repository from a working directory - `git clone `: clone an existing repository - `git status`: get the current status of the repository. **Staging any changes you made** - `git add `: stage a file to be committed - `git add .`: stage _all files_ to be committed - `git reset HEAD `: un-stage all files to be committed **Saving the staged changes** - `git commit -m "some message"`: commit staged changes to repository - `git commit`: commit staged changes to repository (will be prompted to leave a message) **Getting current state from the remote (Github) or sending changes to it.** - `git fetch`: download recent changes in the remote repository (but do not explicitly merge with your local version) - `git pull`: download recent changes in the remote repository and merge with your local version) - `git push`: push commits to remote (e.g. github repository) **Getting Help** - `git help ` - `man git-`

[Advanced] Git commands

**Accessing the logs → who did what to which file and when?** - `git log`: look at the commit history + Useful arguments: + `--oneline`: view a condensed summary + `--all`: view the entire commit history + `--graph`: view a text graph of the commit sequence + `--stat`: abbreviated stats for each commit + `--since=2.weeks`: review commits within some temporal range + Easily format the log + `git log --pretty=format:"%h - %an, %ar : %s"` + see [Git Basics on Viewing the Commit History](https://git-scm.com/book/en/v2/Git-Basics-Viewing-the-Commit-History) for more insight into the different possible configurations and customizations **Tracking Differences** - `git diff` : explore the differences between files - Use the hash hexidecimal code to compare commits + e.g. `git diff 44d14b2 2adbea3` - `git whatchanged` **Tracking Movement**: If we were to just rename or move a file, Git doesn't necessarily know that it was already tracking that file. - `git mv old-file-location new-file-location`: Move files around so that the git history is retained - `git mv old-file-name new-file-name`: Rename files so that the git history is retained **Time Traveling** - `git checkout `: Move to prior snapshots of the project - `git revert `: Revert the project to a prior point **Branching**: A branch in Git is a lightweight, movable pointer to a commit. Default branch is named "_master_" - `git branch `: create a new branch - `git checkout `: checkout a branch - `git checkout -b `: create & checkout a branch simultaneously - `git merge `: merging branches - `git branch -d `: deleting branches - `git branch -v`: seeing the last commit on each branch

Remotes

**Git Remote** - `git remote add origin https://github.com/user/repo.git`: connect a local git repository to a Github repository - generic version: `git remote add ` - We can add another remote to say another git repository service, like [bitbucket](https://bitbucket.org/). **Looking at our different remotes** - `git remote`: print available remotes in the console - `git ls-remote`: Displays references available in a remote repository along with the associated commit IDs. - `git remote -v`: shows the URL of the remotes **Fetching from a remote** - `git fetch ` **Pushing changes to the remote** - `git push -u `: telling it which remote we are pushing to. - `git push -u origin master`: telling it which remote we are pushing to. **Inspecting Remotes** - `git remote show origin` - `git remote show` **Renaming Remotes** - `git remote rename origin my-go-to-remote` **Removing Remotes** - `git remote remove `

Merge & Merge Conflicts

The true power of Git shines as a tool for project collaboration and coordination. Often we want to make local changes to a file and then push those changes to the online remote. In order to push our files to the remote, we'll need to merge our version of the repository with the current state of the repository. If none of the files we changed were changed previously by others, then a merging of files will occur smoothly and automatically.

However, sometimes there are conflicts between branches or remote versions of a repository. Say you changed the some part of a file by deleting a function and a colleague changed the same file by modifying the function. This would be an example of a conflict. Git does not know which version is the correct one, so it will mark the file as having a conflict using a special delimiter.

<<<<<<< HEAD
def my_function(x):
    a = []
    for i in x:
        a.append(i)
    return a
=======
def my_function(x):
    for i in x:
        print(i) 
>>>>>>> new-branch

It's up to you to manually decide the appropriate path of resolution. Above we have an example where one user changed the internal layout of a function. We'll now have both versions of that section of code and will need to manually edit which version we wish to keep (e.g. the upper or bottom part). The point is that Git is very careful to force you to check when and where discrepancies exist and resolve them yourself.

When updating a local repository, we need to pull or fetch changes made to the remote. Note that fetch will download the available data without merging it into your current workflow, whereas pull will download and then integrate the versions.

.gitignore: ignoring specific files or file types

Sometimes we do not want to track certain file types.

For example, Github has an upload rate of 100mb, meaning that we wouldn't want to push really big data sources up to the repository. We might want to avoid uploading any data files to our Github repository for this reason. To do this, we may want to ignore specific file types, such as .csv (comma separated values) or .Rdata (an R data file type). To do this, we need to make a special file that Git reads to tell it which files not to track.

We can exclude these files by adding a .gitignore file to our project folder.

*.ipynb_checkpoints 
*.Rdata
*.csv

Git Graphical User Interfaces (GUIs)

Keep in mind that there are graphical ways to probe a repositories record:

References

Scott and Ben Straub. (2014). ‘Pro Git’. Ed. 2: https://git- scm.com/book/en/v2