How to understand a new codebase
18 January, 2020
We've all been there. You've started a new job, or been assigned to a new project. All of a sudden, you're looking at an existing codebase which has been worked on by many people. Not all of them are still working at the company. There are thousands of files, hundreds of thousands of lines of code, and you understand precisely none of it.
This is a particularly daunting situation when you're a junior software engineer, or you're wanting to make a good first impression in a new job. You want to feel like you're competent, and that you're making a valuable contribution to your team. But where to start?
This post is a collection of tips I've used in various roles and projects. This is my personal approach to reviewing a new codebase, but I'd love to hear about anything you'd suggest though, so reach out via Twitter, or subscribe to my mailing list below!
Understanding a new codebase
Why are you writing this code?
First things first, understand what the code actually does. Not on the level of what a function returns, or how the data moves around the system, but actually what it is designed to do. Perhaps you're working for an online marketplace, so you understand that the code is to make it easy for people to buy and sell something online. Or you're building a notification system, so a message needs sent from one entity to another. When you understand why you're making something, then it becomes a lot easier to figure out what the code is doing, and you know what comes next. This is essential domain knowledge.
A lot of companies are terrible at providing this to software engineers during on-boarding as the assumption is that you only 'write code'. You need to know why you're writing code, so don't hesitate to ask about how the company works.
What is the busiest part of the code?
This might be a little controversial, but I think that you can learn a lot from a codebase by seeing which parts of it are the busiest. Some files don't get touched much once created, like small utility functions, whereas areas that deal with business logic, data, and routing are far more likely to be altered on a frequent basis. These often form the 'meat' of the application and I like to review them first.
A handy way to find out which areas of the code change the most is through the
git effort command in Git Extras.
This shows you the number of commits per file which shows where the most activity is. You quickly get a visual overview of the codebase's activity.
What has come before?
Now that you've had a look around some of the busier parts of the codebase, it's time to see what activity there has been in the repo. If your team is disciplined then you should have informative commit messages, pull requests, and a strong code review culture.
I start with the pull requests to see how features have taken shape, and the comments provide a lot of context in terms of why some code has come to be there. How often have you looked at some lines of code and wondered what drove an engineer to write it that way?
Before you criticise either internally or externally (something I have been guilty of in the past), take the time to try to understand why it ended up that way. Yeah, the author might have been incompetent, or, more likely, there was a particular reason for the code ending up like that.
I try to look at the last few months of activity to see what has been prioritised, and what hasn't. Hopefully there's discussion in the repo about the approach, tradeoffs, and what is left to do. Even better if your team uses an issue tracker.
This is a great place to see the logic in action. Depending on the size of your codebase, there may be thousands of tests, so I'm not saying check them all. It's worthwhile however to see how the tests have been written and what the coverage looks like.
Also, tests are a brilliant way to get involved with the codebase in a non-blocking way. You can improve test coverage which forces you to start working within the application's logic and ensure that you get it right in order to have your test pass.
By writing some tests, you expose yourself to many different parts of the codebase and get to follow the logic around through various files which increases familiarity.
I find it helpful to have a look at how the database is structured in order to give more context to the way the application is written. Are we using a relational or non-relational structure? Are we using something like a graph database?
Understanding how data is structured and stored gives you more confidence when it comes to processing and transforming the data within your application. You can anticipate how it will be handled down the line, and allow you to write more performant code.
Talk to your colleagues
Now that you have a pretty reasonable overview of the codebase, no doubt you'll have more specific questions. If so, now is the time to ask! If you can take a colleague out to lunch and let them know that you'd like to pick their brains on the code that's great.
If you feel that you want a less direct approach, then I find a quick email / DM asking if they could make some time to run through something with you is great.
Finally, when submitting a pull request, that's a great time to add clarifying questions or comments. It has the added benefit of preserving knowledge for the next person to come along and read the commits.
Be the change you want to see in the world. If you felt there was information missing when you started, add it for the next person! Nobody is going to complain about having improved documentation and it is a valuable contribution to your team. It also has the added benefit of reducing the Bus Factor.
There are just a small collection of tips I have for people getting used to a new codebase that I've used in the past. There are obviously so many ways of doing this, and different things work for different people.
If you have any suggestions then let me know! I'd love to hear what approaches other people take.