Den of Antiquity

Dusting off old ideas and passing them off as new.

Git & Gitorious

I’m trying to promote code sharing and code review and open source ideas, internally at my company. I setup a copy of gitorious on our intranet and we’re starting to commit projects into it.

I whipped up a quick overview of Git and Gitorious to share with my coworkers to make them more comfortable with the systems and get them started. Its not entirely technically accurate but it reflects how I view the systems.

Gitorious

Gitorious is web-based project and repository management software. It just lets us create multiple repos and manage them into projects, teams and whatnot.

Warning: All Code Public
All code uploaded to the site is visible to everyone within the corporate network. So scrub code for passwords and other credentials before uploading.

Gitorious Elements

There are only 4 different major types of “things” in Gitorious.

  • Users
  • Teams
  • Projects
  • Repositories

Users

You and me. Just create your account. Don’t forget to add an ssh key after you login so you can upload code.

Teams

This is just a group of users. A user can be in many teams at once. These teams are just logical units to give access to whole projects and repositories at once. Without having specify individual members. Projects can also be ‘owned’ by a team instead of just a user.

Projects

Projects can be owned by teams or by users. Its just a place to group your repositories into. You may have a project for each repo. Or put a couple related repos in the same project.

Each project also gets its own mini wiki, which is not full features but convenient for small documentation, project and repo descriptions. And the project’s mini-wiki is versioned using git.

When you create a repo, it has to be created in an existing project. Which is just good practice anyways. But when you clone a repo, it has no project associated with it (for convenience as well).

Repositories

git repositories. This is the nitty gritty. Its where the source actually goes. There are some good-practices outlined further below on how to decide what goes in a repository.

gitorious provides several methods for access to the repository. The 2 methods we have enabled are ssh and git-protocol. The git protocol has no authentication mechanism so it provides read-only access to the repo. The ssh protocol uses a public key you upload to gitorious, to authenticate you and allow for read/write access to your repos (but only your repos).

getting started

1. Create an account

The site isn’t connected to LDAP so you’ll have to create your account manually. No email verification is required though, you can log in immediately.

2. Join or create any useful teams.

This is optional. You don’t need to be on a team in Gitorious to us the site to its fullest. But teams allow you to give write permission to your projects and repos to multiple people easily. It also acts as a useful form of organization.

3. Create projects and repos

Generally you’ll create a repo for each app you want to share the source for. And you may create a project for it as well. But you may create multiple repos in the same project if or put multiple applications/source trees in the same repo. Its your personal preference.

4. Share your code.

You can upload your code to your repositories. Remember to clone the gitorious repo first using the ssh url, and to upload an ssh public key so you can get write access to your repo.

Create Away
All objects and metadata can be easily remove or replaces so feel free to create them with recless abandon.
We can always split projects and repos later, or move other things around.

Web Access to Repositories

As any decent source repository management site, gitorious provides a simple web-based repository browser, so that you can browse the repository without having to clone it first. You can also browse the history of the repo.

Access Control

Warning: All Code Public
All code uploaded to the site is visible to everyone within the corporate network. So scrub code for passwords and other credentials before uploading.

Projects

You can set who can edit and write to projects. You can add teams or individuals to a project.

Repositories

You can assign teams and individuals in 3 different capacities when it comes to repositories.

  • Commiters
  • Reviewers
  • Administrators

Users and teams can be assigned any combination of these 3 roles for a repository. Commiters can update changes to the git repository. Reviewers can manipulate merge-requests. And administrators can edit the repo metadata in gitorious.

Code Review and Fork-based Development

Cloning Repos

Just as you can use the 2 methods described further above to download/clone git repos to your local machine. You can also request that gitorious clone the repo for you in gitorious. Doing this, it can track metadata about the repo relationships and enable some more advanced features, such as watching repos and performing merge-requests.

Merge-Requests

This is an advanced feature, but it bares mentioning. If you have a repo in gitorious cloned from another repo also in gitorious, and you’d like to submit your changes to be pushed up the line back to the origin of the source. You can submit a merge requests, and the original authors can work with you in a simple form of code-review.

Git

This isn’t a tutorial on using Git. Its just a overview of the DVCS, the concepts there-in, and how DVCS differ from traditional VCS. You can find a great deal of presentations, tutorials, and documentation online for git which will better serve your needs if your trying to get started using it.

Git Parlance
git documentation and proponents use alot of lexicon that they pretend describe new concepts in version control. When in actually they are just refering to concepts that have existed in VCS for decades but are simply for flexible now with the advent of DVCS. I’ll try to point out when I’m using this vocabulary.

DVCS vs VCS

“Traditional” version control, or VCS

ala subversion or CVS.

  • centralized authoritative location for the source code. defined by both owners, and the software.
  • client and server are seperate entities.

The New Way, or the “D” in DVCS

ala git or mercurial

The “D” refers to “decentralized” version control.

  • no central authoritative location. defined by the software.
  • the authoritative location for the source defined only by convention.
  • no difference between client and server. we’ll just call it a “client” here though.

This means that the software and internals of the VCS are designed so there is no enforcement about who controls the source.
Everyone working on the source is equal (in the eyes of the VCS). This includes the location that the team defines as the authoritative (canon) location for the source code.

Its still common to have a central repository even in DVCS. Where the team members submit their final code changes to. But the benefits come in having the VCS designed around not mandating a central location.

Everyone has a repo

In traditional VCS the user usually gets just the minimal amount of data required to work on the code at the moment.

When they pull a “working copy” of the source tree, they get just the branch they need. And they get just the subtree that they need of that branch. They usually don’t get any history in their “working copy”, just the immediately files and HEAD version. All of the other details are easily accessible from the central server if needed for a complex operation.

In git, the “repository” and the “working directory” are one in the same. When you get a working copy from some other repo. Your just copying the entire repository to a new local one. And that repository becomes your working copy. The files are ready to be worked with. All the history and other VCS details are hidden in a .git subdirectory.

This may seem a rather heavy-weight operation, but at the same time, in git, you tend to create smaller repositories. Unlike traditional VCS where you put all your projects and applications in the same repo. In git you’ll create a separate repo for each project or application. So the size remains manageable. Furthermore, in git, while you can’t clone just a subtree of a repo, you can clone only a single branch or subset of the repos branches. In fact its quite common just to clone the single branch your interested in… say the “master” branch (same thing as “trunk” in svn or cvs).

If your working on the code in several locations, each location will actually be a separate repo. Possibly with a different set of branches in it.

This is another important concept in git. Even though we all have a repo for the project, the repos are not identical. And they don’t need to be.

Each repo is different.

This is not just because we have our own private changes in our repo, that we haven’t yet shared with others. But each repo also knows its a different one from all the others. This is how we keep from stepping on each others toes when we are all working on similar branches/code. git knows that Marks “master” branch is actually a different branch than Sue’s “master” branch. And you have to address them as such if you really want to work with both in the same operation.

Hashes as a GUID

To deal with the issues of multiple repos, git tracks everything in the repo using secure message digests (hashes) of their content. This acts as a good GUID for these objects so that no matter where they came from or when, if they’re identical, then it can identify them as such. This also makes comparing objects faster. The system uses this for files, directories, and history as well.

Unfortunately there are a few gotchas to keep in mind with this. For instance the same source tree can results from 2 different histories. So that can cause some complications, when deciding who’s history is more useful to keep around. But most users don’t have to worry about that at the beginning.

DVCS Commands and Concepts

The Old

Generally DVCS work the same as VCS and you’ll see similar commands and concepts.

  • branch
  • commit
  • log/history
  • diff
  • tag

The New

Where DVCS is different tends to be in what it adds on top of the existing VCS. Here are some commands you’ll find new.

  • push
  • pull
  • clone
  • amend

But not entirely new. push, pull, and clone are similar to checkout and commit, but between repositories, since there is no separate working copy. And amend is an advance concept you may never use.

For Subversion Users

I found this mapping between svn and git commands helpful, when I first switched to using git for some projects.

http://git.or.cz/course/svn.html

git can also work with svn repos directly. you usually do this by replicating the svn repo into a git repo.
then working with the git repo. But you can do bi-directional replication between the two, and continue to use
both repos for future versioning, side-by-side.

http://www.kernel.org/pub/software/scm/git/docs/git-svn.html

It can be a little complicated to use and there are better example workflows of it online as well.

Advanced Concepts

Some advanced concepts you don’t need to understand to use git. but that might be interesting if your going to delve further into git.

The Index

git commands can work on a staging area for the changeset before its commited to the repository. This staging area is called the index. You can flag files/directories for commital, as your working on the source. And when you do, it will actually copy the state (content) of those files and directories, at that time, into the staging area. Then, later, when you commit, that staging area is what is commited to the repository. not the actual current state of the source tree. This allows or a specifically control partial-source commit.

This can be useful or annoying. I find it the latter, and as such I completely ignore this feature. Instead I use git commands that ignore the staging area and commit the current state of the source tree all or none. automatically commiting changes to any files that were previously commited to the repo, and in some cases even commiting new files to the repo automatically.

But using the commands that work with the index can provide for some more advanced usage of git. Such as performing more fine-grained commit history, or quickly switching to temporary branches for quick fixes, and then switching back to your main work.

layers

The architecture of git is rather simple and it benefits to understand this when you start using git more heavily. Its effectively composed of a series of layers.

  1. Efficient Blob Storage
  2. Blob Database
  3. Filesystem Trees
  4. Version History Graphs
  5. Other VCS Metadata: branches, tags, head…

I won’t go into the details of whats stored in the blobs, and the filesystem/version graphs, as there are good presentations online that describe these more effectively.

Efficient Blob Storage

In relational databases, a Binary Large Object (BLOB) is a chunk of arbitrary binary data of arbitrary size.

Git starts with a system that efficiently stores blobs by diffing related/similar blobs, and then compressing them. But the details of this are always hidden to the user as its not necessary to understand them. And every VCS implements such a system anyways.

Blob Database

git then takes these blobs and puts them into a simple relational database of only one table. which contains the blob column and a few columns of metadata.

git uses a hash of the blob as the primary key and provides commands to manipulate the rows of this database directly from the command line. Most users never have to do this but sometimes it can become necessary if the repo becomes corrupt, data is deleted by accident, or someone messes up some advanced command.

File System Trees

Two types of data git will store in these blobs are files and directories.

How files are stored is obvious; it stores the content in the blob. By hashing the files, you get immediate duplicate reduction.

But to store directories it builds a tree of blobs, one for each directory. The blob contains the list of files and directories in the directory, and their hashes. This makes for a rather efficient storage of directories in the table, as subtrees that don’t change, never need to be modified or replicated in the database. Furthermore identical subtrees can automatically detected in this manner.

History Graph

Similar to the directories stored in blobs, the system stores in blobs the Directed Acyclic Graphs that comprise history trails for source versions. And by hashing the elements of the history, you get similar benefits to the filesystem storage.

Other VCS Metadata

Finally, on top of all of this, git adds the other CVS metadata required, such as branches and tags, and which version is the HEAD. All in order to round git up into a fully featured modern version control system.

History Editing and Amending

Since a git repo is really just a kind object or blob relational database, git commands expose that database directly for you to play with. And in fact make it easy to play with the objects and graphs stored in that database. One thing that git allows you to do easily is to modify commits that are already in the database, and to change the history graphs that lead to a particular version of the source tree.

These features are usually avoided because if your not careful you can permanently delete data that’s in the repository, or worse, surprise someone else who has already cloned the data that your changing.

Comments

Jason Stillwell
Thanks, I'm usually watchful of that, but I wrote this document quickly.
AHAntics
I'd (almost) do a search and replace s/your/you're/gi

You're is a contraction of "you are" - you're making a grammatical error and sending it to the team.
Your is a possessive pronoun - that repository is yours, not mine.
Jason Stillwell
Feedback is, of course, welcome.