Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Grokking big unfamiliar codebases (jeremyong.com)
149 points by ibobev on Jan 26, 2023 | hide | past | favorite | 86 comments



It is incredibly important to avoid putting tag-firebreaks in the code base — anything that stops a new developer from being able to navigate from symbol to definition, or from definition to all uses of that function.

A couple of examples:

  ACTIONS = {
    “swim”: Swimmer,
    “skate”: Skater,
  }
  
  # …pages of code…
 
  def f(action, *args):
    …
    mod = ACTIONS[action]
    mod(*args).do()
At face value, this pattern doesn’t do anything particularly wrong but anyone who arrives at f on their code safari won’t see any symbols for either Swimmer or Skater. It breaks their flow the same way switching languages might do. Sure, ACTIONS is only a hop away, but that’s a level of indirection that could have been avoided. Even worse, what if the class name was derived from a string?

The fact that the action implementations are chosen at runtime also makes life a bit awkward but the primary issue is not being able to immediately see the action classes and jump to their implementations.

Another one is calling shell scripts. We all have projects, I’m sure, where we have to shell out to some common legacy tool:

  def f():
    run(“legacy.sh”)

  def g():
    run(“legacy.sh”)
Wrapping the script in a native function does wonders for making it easier to find, especially for the poor, sacred soul who deprecates it from the codebase one day:

  def f():
    run_legacy()

  def g():
    run_legacy()

  # …pages of code…

  def run_legacy():
    run(“legacy.sh”)
Any unfriendly pattern that breaks tag-hopping / IDE-integration is yet another small chafing of your colleagues‘ productivity. Think twice before you introduce that YAML file which defines behavior, instead of implementing it natively — your friends (and their jump-to-definition muscle memories) will thank you.


I agree, when learning new codebases, the ones that take me the longest time to learn are the ones with many layers of indirection.

I view over-indirection as an anti-pattern and I often see it come from engineers who take the DRY philosophy as dogma.

I have seen countless cases of your second example throughout my time programming, and I'm convinced it comes from the DRY mantra that first-year CS students get drilled into their brains.


Best example of "good practice" becoming a hassle: the "interface for everything" habit. Even when the majority of those interfaces have only one implementation.


I can't stand this. I once worked on a project where there were seven layers of interfaces that all led down to a wrapper around a library and each step was just adding another parameter onto the call in the library. The entire PROJECT didn't need to exist, the owning project could have easily just called the library directly as it already had all the context, maybe with one method to abstract in case we needed to swap the underlying library. I remember thinking I was a bad developer for not "getting it," but armed with 10 more years of experience I've realized it wasn't me.


I've come around on this one. I (now) like having a separate interface, having just method signatures separated out makes it easy to get an overview of the class. It also makes it a bit easier to agree on the interface.


You have an interface already, without separating it from the implementation: the public methods of the class. In java any IDE will quickly give you the class outline showing you the public methods.


DRY is the easiest rule to tell if you are following, so it is one that people are inclined to follow.


Especially in languages with string literal types! There's no excuse not to just have the thing be the thing when the type system can do the heavy lifting of preventing typos.


omg yes. This is my single biggest complaint about large Python projects, or anything with complex implicit type conversions. Fun to write, a nightmare to read.


Really good. Think about the programmer reading this at write time.

How to write large code bases that can be understood, every bit as important as knowing how to read them


Software Engineering at Google has lots of valuable advice and best practices to accomplish this - https://abseil.io/resources/swe-book/html/toc.html


I work in a large java code base, where I'm often exposed to brand new code (written a while ago, or written by other teams).

I built this tool to help me understand the unfamiliar: https://packagemap.co

There's a parser that extracts all the classes and methods from the java code and a site that renders these in a directed graph. I use a kind of query syntax to filter the graph, and added features based on how I want to explore the code.

The nodes in the graph carry an name like: "my.company.package.MyClass.myMethod"

Then I can filter by wildcards: *MyClass*

Or see just the classes with a line terminator: *.MyClass$

Or see the code that leads up to, or away from a method/type: ->*MyClass, *MyClass->

I've found making the problem visual has massively improved how I navigate the code. In code review I spot things that are very hard to see in an IDE or file-tree review UI (like github). Things like; where abstractions are being broken/leaking, or where types are being mis-used across packages.

I wrote a summary of the common cases I find using this tool in a blog post[0]

[0] https://blog.packagemap.co/posts/common-code-coupling-mistak...


I suggest you position the free tier more prominently on your pricing page. Took me a while to realize it's even there


Yes! Thanks


Most of this makes sense, but ends up not being followed for non-technical reasons like the person's confidence, etc.

I'd say, just know two things:

1. (From the post) "Let your instincts guide you if you think something feels much more difficult than you’d expect", and talk to your team about it, and change it for them.

2. There is a LOT of dead (unreachable / redundant) code in any enterprise codebase. Especially recently launched flag that haven't been "cleaned up yet". Just start cleaning this stuff up. Even the very obvious stuff gets you very well introduced to the codebase.

Where I disagree with the post:

> Taking notes is non-negotiable for me, not strictly as a memory aid, but as a means to communicate findings and verify them later.

If you need to take a 2 line note, you're better off having written a 2 line code review. Even something simple like "metric.add("DOES_THIS_HAPPEN?", 1);" can be done with 0 confidence / testing or with minimal-confidence / minimal-testing you could instead write "Preconditions.checkState(..., "Will this be caught by our integration tests?");". The best notes are the ones that get pushed to production to validate assumptions.


> There is a LOT of dead (unreachable / redundant) code in any enterprise codebase. Especially recently launched flag that haven't been "cleaned up yet". Just start cleaning this stuff up. Even the very obvious stuff gets you very well introduced to the codebase.

Ooof, good way to get well known in the team when the smoke comes out. :)


Lol yes agreed completely. "Cleaning things up" before you understand the domain, codebase, and gotchas sounds like a highly unproductive way to get onboarded. I think suggesting that a new hire start week 1 by adding a whole bunch of metrics that add unnecessary overhead for the team and all users just to see "if something happens" is pretty bad advice, to the point of being something I'd probably wish I had screened at the interviewing stage.


One thing I like to do is go through everything in makefiles, and any commands in the readmes or documentation. In almost every case there's stuff that doesn't work, sometimes because it's out of date and you can fix it, sometimes you'll be missing some one-time-only configuration that everybody else already did so long ago they didn't remember to document it, sometimes the docs are just outdated so you can ask if it's no longer relevant and clean it up. And if you mess anything up it usually breaks very early the pipeline, the tests won't run against the dev branch or something like that.


"Cleanup tasks" will always be appreciated I am sure, but again I would caution that there is a trap here. It's easy to clean up any number of things as you go, and construct an illusion that somehow, you are being productive.

When I hire people, aside from fixing mistakes in the documentation they encounter, I want them to have all the time they need to learn, ideally with a set of curated onboarding tasks that progressively give them expertise in the system they are intended to operate in. There is a time to "make things better" and, in my opinion, that's after getting up to speed in a sense that lets me collaborate with the team in a productive manner. I won't begrudge someone the opportunity to clean things up, but each individual is the ultimate arbiter of how to be effective at the end of the day.


If you have the time to create a “curated onboarding experience that results in progressively more expertise”, yes I agree please do that instead.

I’ve never onboarded to a team that had such an experience, I’ve never heard of a team with something like it consistently, every time I’ve seen it proposed and tried it takes time away from an experienced person (who could have done those same tasks faster) and goes out of date as soon as one new teammate completed the curated set of tasks.

If you can make it durable, my utmost respect!

In the meanwhile, deleting obviously dead code has never failed me as a strategy for myself or a new hire (typically steered by a simple oncall-type task to start the investigation off).

And best part, there is always more to clean up for the next new hire lol.


I don't mean refactor everything or spend six months until everything is perfect. I've just found it's a pretty reliable way to find low hanging fruit and get an easy first commit in. "Get your feet wet" I believe is the English expression. You have to get familiar with the build process anyway, you might as well improve it while your attention is on it.

If you have a curated onboarding experience that's great, but then you'll probably have good documentation and nice setups, then my suggestion obviously doesn't apply. I wish that was common but in my experience that has been a rarity, not the common case.


In enterprise where there is poor documentation and lots of tribal knowledge, noting down just those 2 lines for every new info is a quick way to break down knowledge gaps created by just that.

It is exactly due to such disdain for documentation that most people find it hard to navigate large codebases. Documentation is not just for noting things down pedantically but also a thinking tool and a temporary thought buffer.

And no one pushes code to production to validate assumptions. Not if you have 100 clients and you are not doing CD.


> And no one pushes code to production to validate assumptions.

I always have, with the rare issue occurring and by and large rewarded for it.

> Documentation is not just for noting things down pedantically but also a thinking tool and a temporary thought buffer.

Sure, but why not treat your codebase as a temporary thought buffer? I do, and it’s consistently worked and improved every system I’ve worked with. No teammate has ever complained about this strategy. If anything it’s typically adopted by teammates.

eg “oh this list is never modified” rather than taking a 2 line note, I’ll push a code change to use an ImmutableList.

The knowledge is now documented, enforced by the compiler/code reviews if the type changes people talk about it, and allows me to keep improving that part of the code base months later without code conflicts or needing to re-make my changes from a notes file.

1-2 line reactors - exclusively better than 1-2 line notes. This scales to any number of lines where the size of the note is equivalent to the size of the possible code change.

Meanwhile, please do take notes and document when it’s at least 10x shorter to grok than the current code or possible code change.


Re dead code: unless you've got an expert on hand and they know what they're talking about (less likely than anyone wants to admit), I've had fantastic success simply emitting metrics on any line of code I think is unused.

And then ignore it for a month and go do other things.

If it hasn't been hit by then, you've got pretty good evidence that it's not very important. Start deleting.

You can always replace it with an alerting metric if you really suspect it'll matter months later, since you can restore from history when you see it + have better knowledge of the system in the future. And if volume is truly that low and the alert fires, just do it by hand - if it matters by that point it's probably worth a lot of money, and that's time well spent.


Has this ever backfired?


In minor ways, yes. In return though I've successfully used it to make some pretty significant cleanups, removing years-old abandonware that nobody was willing to touch. I consider it to be extremely worth the risk, and the utterly inconsequential cost - once you set it up, it takes less than a minute to decide "nope, let's just wait and find out" and write that one line of code, and make a calendar event a month or three later.

The main ways it has backfired have been:

1) Once, the removed-thing was used for one team which used it to make one annual report that they didn't realize they were still running. It threw a pretty major wrench in that one report... but they were surprised too, so they quickly modernized it (the old version had known issues), and then that team was able to remove a bunch of dead code because we (the only recipients of its output) could show that X had not been used in a year.

2) And once it revealed bugs in metrics systems / dependency chains... by leading to deleting code that we thought was unused, but was in fact used semi-regularly. An unfortunately-less-than-quick rollback fixed that, and we then went on to find a number of other bugs because moderate chunks of the codebase were receiving no-op metrics emitters for some reason.

Frankly I consider having only ^ those two cases to be luck, but they were both definitely worth it in aggregate. And 2 taught me to verify that surrounding metrics existed / correlated with logs before trusting a lack of signals. I've found multiple other gaps that way :|

3) More than once it has led to not removing code that everyone thought was unused (but we had developed a minor "check first" habit due to past successes), thus needing to maintain some convoluted garbage for almost zero benefit. Bleh, politics.

4) Others blindly applying the habit has led to some careless commits that added it to surprisingly hot loops, adding delay to the point that it caused significantly more timeouts (generally: rapidly filled buffers and paused to flush, then repeat 1000x). The good news is that once they've done this once, essentially every author immediately starts paying attention to their hot loops forever, and we moved more things to batch emissions rather than incremental. Also a net win, it's a pretty cheap lesson.


Defending on what your system does, there can be code that only gets run monthly, quarterly, or annually, or when the other thing that "never" changes gets changed.

There was recently a minor annoyance at my work where some firewall rules got deleted because they weren't used during an audit window, and then X months later a deploy got held up because the build server couldn't reach the target servers for that one rarely-changed system.


a person who doesn't understand a codebase shouldn't be adding "does this happen" all over the place. run in a debugger and set a breakpoint.


Agree on the first part but there are going to be plenty of cases where running in a debugger isn't sufficient to satisfy one's self that something never happens. Unless we're talking about "does this happen under this specific scenario I know about".


if something never happens, that's not a part of the codebase you need to understand right now. If your task is cleaning up, sure, but your task here is learning what happens, not what doesn't happen.


And it's relevant in more important ways than "I'm cleaning up and this isn't used, so I should delete it". Quite often, when cleaning up, you find pieces of code that make assumptions that contradict each other. It's a real head scratcher until you figure out that one of these pieces of code is obsolete and never executed anymore (or has never been executed), so its assumptions are irrelevant. The cleanup resulting from that might not look like much, but it removes a major trap when trying to make sense of the code.


This really depends, but for things that really are confusing you / your team won't help, why not add the metric? You add observability and can learn something about how that flow is utilized


Imagine a metric firing on something that happens a 1000 times a millisecond. Imagine a metric firing in a way that grabs a mutex at an inopportune time that creates a deadlock. Imagine a metric firing that suddenly spams your metrics server with data that nobody except you cares about in this one temporal moment. There are so many reasons this is a bad idea. If you must do it, do so locally but there are so many easier ways to determine "if something is used."


If adding a few counters to your code base cause any of these issues, your current team is already dysfunctional and will get more productive my improving the system.

eg “Metrics server goes down” metrics should be sent pre-aggregated, and in batches, even possibly polled for. Having 10 metrics that fire 10^N times a second shouldn’t impact the metrics server, ever! Metrics servers are impacted by cardinality, so yes don’t create 10^N unique metrics.

eg “something that fires 1000times a millisecond” if the rest of the team doesn’t know where their hot-loops are and has them commented or caught in a code review, it’s very likely someone will add some logic to it that will hurt. The metric is just one such change, and might as well find out sooner than later. If it will cripple your service, that is why CI/CD encourages canary deployments / other automated performance testing.

eg “metric causes a deadlock” - umm, this is just too contrived. I’ve never used or heard of any metric library that was built this way. This type of metrics client would cripple even a very experienced developer on that team.

Summary, if adding a metric can cripple your service or team, reconsider priorities. It will speed you up even in the medium term.


I guess every game I've ever shipped was shipped on a dysfunctional team!


I don't mean to diminish any of your achievements. Shipping amazing products is orthogonal to creating a robust codebase/system.

I also don't mean to be negative towards you or your experience. Instead, I'll pose the question to you: What would you call a team/system that describes itself as "at risk of being crippled by a single line change like adding a metric"?

At the end of the day, the product is the only thing that matters and you can be and should be proud of the products you've launched. That said, are you proud of the way you and your team built them? Have you or anyone on your team proclaimed it was a joy to build/improve that product?

idk... after all I'm just some guy on the internet. Best of wishes to you! I truly hope you did not, do not, and never do work on a dysfunctional team/system.


Recording a metric is not a cheap operation in many contexts. I work with code we typically instrument at a microsecond to millisecond granularity. Sampling profilers typically record instruction samples at most every 200 us or so, so adding instrumentation at a finer granularity than this has a dramatic effect on performance. If a game operates at 60 fps, adding a metric at an inopportune time will render the game unplayable.

No, I haven't been on teams that dysfunctional. These restrictions are born out of necessity. My point was that you seem to have a somewhat narrow view on software as a whole, and you are extrapolating from your personal experience general advice that simply doesn't apply.


This is not what the post was talking about though, if you have a service that truly is being fired 1000 things a millisecond, I can almost guarantee you have other metrics already up, and thus won't need anything else.


I work with realtime graphics. Thousands of things are indeed happening every millisecond and I sure wouldn't want a new hire blindly instrumenting those things just to "see if they happen." There are appropriate tools to measure performance at this granularity.


This is a fantastic post on the challenges associated with diving into large codebases (sometimes called "source diving" or "code archaeology"). The vast majority of code in the world is in large private codebases. The size of legacy private code dwarfs that of the open source world (sizable as that might be these days) and the rate at which you can read/understand code is the limiting factor here for dev velocity.

Some companies have invested in dedicated tools (like Code Search and Grok at Google) because they recognize the importance of the problem. Such systems were a primary source of inspiration for us when we started Sourcegraph—we wanted to bring the tech for large-scale code understanding to every company. In-depth posts like this are a big source of inspiration and ideas for us, so thanks to the author for writing this up.


I've long had a goal to go through a similar explanation of how to do this, and I figured the best way to do this was to pick a large project whose source I had never seen before and record myself fixing a bug in it. This plan is foiled by the difficulty of finding large projects whose source I had never perused before just for fun, especially in my core area of competence.

But some salient points I think I can make is:

* A lot of people have pointed this out, and I think it's the most important thing: don't go into the project asking the high-level question "how does this work?" No, you want to start by figuring out something specific. If you're picking apart a command-line tool, for example, start by asking where one specific subcommand is implemented, and what it actually does. Once you've navigated yourself to such a starting position, you can start building a mental map from that particular location, and expand it to higher or lower levels of detail from there as necessary.

* Use code search (e.g., grep) liberally, and start guessing for the keywords you want to search for in the code to reach your next piece of information. In my previous example of looking for a subcommand implementation, searching for a probably-unique part of the help message is a good way to guess where the code might be located.

* If you don't understand when code fires, something like assert(false && "Did I happen?"); is useful for figuring out examples that execute a particular codepath (assuming, of course, that your codebase has a competent testsuite). Now you have tests that you know execute that code path, and you can start using a debugger to probe what program state looks like at that point. Also, get very acquainted very early in the process how to satisfactorily dump out that state.

* Play the git blame game. Look up the history of the code you don't understand--track down the exact revision that added that code, and then go spelunking in code review or bug tracker history to understand why it came to be. If code seems to no longer be used, and you don't understand why, play the reverse blame game--look at where it was added, where it was used when it was added, and find where that code was removed, and (as above) go spelunking in more detail to understand why it came not to be.

* Practice, practice, practice! While I don't think these are difficult skills to acquire, it's hard to talk about this in generalities because so much of the process relies on your experience knowing what things generally should be like as you try to match up the codebase to your (possibly incorrect) mental models of how it works.


> don't go into the project asking the high-level question "how does this work?" No, you want to start by figuring out something specific

Should this be an or? For me, understanding big code bases has always been a pendulum between the big picture and details. Survey the overall scope, and then pick some very specific thing and examine it all the way through. Then go back to the big picture, integrate what I've learned, and come up with another very specific thing to investigate.

For me this cycle is especially fun when on the trail of a bug. Get the big picture, and then dive in on a specific bug and beat it to death. Then ask what he bug told me about not just the code base and the architecture, but the development practices and the company culture. Pick another bug and run it to ground, seeing which of my notions were validated and which were challenged. Eventually, produce recommendations for broad, long-term changes. It's been a while since I did a gig like that, but they're such a joy when I get the chance.


> assuming, of course, that your codebase has a competent testsuite)

A big, very big, assumption!


> assert(false && "Did I happen?");

If i may offer a simplification of that:

assert(!"Did I happen?");


I hate the experience of a new code base when everything is so alien that at first it seems horrible. Then when you learn the ideas they're using you start to like the way everything is put together. Then you start finding navigation really easy. You're able to make changes quickly and can be sure they're safe. Then that project ends and start again. Each time it get easier as you see the same patterns but so much time gets sunk into learning something that you'll never see again.


> I hate the experience of a new code base when everything is so alien

Personally I love that experience, of a fresh new codebase with some patterns and other things I haven't seen before, seeing what works well and what doesn't, what stinks and what shines.

Once the exploration have been done, and things start to be obvious on how to achieve, I lose a bit of motivation as I can solve the problems in my head and now the boring part of actually typing it out begins.


Ha, I wish there was a way we could split the labor on that. You do the learning, somehow transfer that to me and I do the bug fixing. I love fixing up a legacy system but don't enjoy the first few weeks of feeling unproductive. That has more to do with the pressure to get things done than the actual exploration.


I do that with my teams. I have a technique where I turn existing code bases into literate programs. Largest I ever did this on was around 250k SLOC, but that didn't take me too long (you get fast with it if you practice). The result was a set of documents that I turned into presentations and diagrams (better than autogenerated ones) on the system to share with the team.

I'll be doing that with one part of our system starting tomorrow, actually, because no one understands it and everyone is afraid of it (there is one automated test which is a massive end-to-end test that exercises, theoretically, every capability and takes 2 hours or so to run).

The underlying code and repo are unaltered, this is a collection of org files that sit in tandem. I run org-tangle to verify that I haven't actually altered the code in my various adjustments to the literate program version. So people still see their regular cpp or whatever files and don't have to learn the tools I use (though I share if they ask, no one has ever asked). But they get the benefit of understanding the programs better.

EDIT: By "didn't take me too long", it still took a few weeks. But it was a new-to-us system that the original contractor provided sans tests (there were directories and references to numerous automated tests that we weren't given and they wouldn't provide) and with useless auto-generated UML diagrams as "documentation". But a few focused weeks was probably a lot better than a few years of confusion and frustration.


Ok I'll ask, what tools do you use?

Is your technique described somewhere?


Tools used: emacs, org mode, org babel, and git. `git grep` or an IDE (these days mostly an IDE because it works well) to do code searches to find references to functions/classes/structs/whatevers.

Org mode lets you create code blocks like:

  #+BEGIN_SRC cpp :noweb yes :tangle yes
    // place all code here
  #+END_SRC
By default org-tangle tangles a file foo.org into foo.cpp or whatever appropriate extension, for each code type you have in a block set to tangle. You can be more explicit with:

  #+BEGIN_SRC cpp :noweb yes :tangle foobar.cpp
Useful for explicitness or if the name is different than the org file (for my purposes, I try to keep it one-to-one).

For every source file I generate a .org file that contains a single code block which will start as the contents of the original corresponding file. I may not always do this automatically, sometimes I do it manually (it takes a few seconds) as I step through if it's a more focused effort (vice trying to understand the entire program).

After that I select various points of interest. I find `main` or other entry points, if there's a known issue I may dive in there to start with but eventually it gets back to some equivalent of `main`. I generate a todo list which is just a list of all the files, it will be expanded over time. In org mode you can link a file with:

  [[file:path/to/foo.org][foo.org]]
So the todo list actually becomes an index to various points in the program. I can add text if appropriate, a lot of files are named well enough though so that's not always necessary. Sometimes I delete thing that aren't really that important but are linked elsewhere. I may create a table of contents that's more focused than the raw index if I want to preserve the raw index (it is convenient).

Diving into a particular source file I start extracting portions out. Org supports noweb syntax for references. Naming a block I can reference it in it's original location by surrounding the name with `<<` and `>>`:

  #+NAME: main
  #+BEGIN_SRC cpp
    // copy of main()
  #+END_SRC

  The rest of the source:

  #+BEGIN_SRC cpp :noweb yes :tangle footer.cpp
    // bunch of code

    <<main>>

    // remaining source
  #+END_SRC
Periodically run org-tangle and git diff. If the code is changed, more than whitespace (sometimes I lose blank lines, that doesn't change the meaning of programs in any language I use), then I botched the extraction. Go back and fix it. You can present a file path, not just the name, in the filename part of `:tangle` so you can do this in a parallel git repo so the work is under version control but not impacting the real project repo.

Repeat this, shifting file contents around to draw attention to interesting, important, or complex bits. Uninteresting and boilerplate stuff gets shoved to the bottom in "appendices", as I usually name them,:

  * Appendix I: All the includes, nothing interesting

  #+BEGIN_SRC cpp
    // `,` needed before # within org-babel blocks, but doesn't show up in generated file
    ,#include <iostream>
  #+END_SRC
I use links and references to cross reference most of it, but probably not all since, as with most things, there is a point of diminishing returns. Org can generate HTML and other document formats, so I take advantage of that to produce something shareable. I add documentation that covers critical things, especially non-obvious or complex ones. Got a complex set of equations? I write them out in TeX notation so it's clearer than the raw code, explaining the variables, or adding a reference document.

The todo list gets expanded with subtasks (to the containing file) as I see the contents of files. These might be class names, function names, or a name capturing some trait or purpose of a collection of functions. Not every function is worth documenting, many are obvious. But any longer or more complex ones will usually get an entry and a link and be extracted to their own source block. Their contents may be further extracted since LP permits it so I can draw attention to the things that I think are most important.

----------

Primary deficiency of this method: I'm the only one who does it, it's a separate repo, and if I'm not a primary contributor to the actual project this will not be maintained. If the system is somewhat stable that's not a big deal. But it will become out of date and scrapped eventually. It's good for kickstarting a project though because you can either guess at what you're doing, or try to understand the system and be deliberate about your changes, extensions, etc. I prefer to be deliberate. The guessing approach never worked out well for me.

----------

This approach also lets me identify certain critical points in the system, like "seams" in Michael Feathers terms. This is helpful for getting to the actual work (writing or changing code), which will usually require introducing tests that don't exist or were removed by some asshole contractors. I'll document these things, since I'm not usually actively changing the code yet these are notes. I'll also draw attention to potential insecure code or questionable code.

----------

This isn't the only thing I do (or try to do). I've described it in the past as dissecting and vivisecting. The above is dissection. The code isn't "live". The whole process can, and should, be paired with various tracing and debugging tools to actually exercise the system and get the real control flow. Especially if how a function/method is triggered isn't obvious by looking at the static system. Which series of actions on the real system will bring us to this point? Ok I can make a note of it, and maybe it's actually statically traceable I just missed something but the trace or stepping through with the debugger gives me the details I missed.

----------

For smaller programs (I deal with systems of systems these days, so individual programs may actually be pretty small even if the whole system is still "large" by some definition) I may put this into a single org file for all the source files. The explicit filename parameter to `:tangle` is helpful here, but it's also a lot easier to link everything together.

I'll also use a single org file to create a focused document presenting some critical thing. How do we get from main to X? Here's the path, eliminating everything else. This version may not tangle into a compilable solution because of what I leave out, but it's a good presentation format.

----------

Text-based graphical presentation tools like graphviz and plantuml play well with this method. I can embed graphs and diagrams as texts in the same org file which will be rendered when exported to HTML or another format.


You could totally get a great article out of that.


Thanks, maybe one day. I've started pushing for "lunch & learn" events at work again so I may dig out a smaller program to demo the process on and a medium sized one to show a more substantial result. If I use non-proprietary source code for it then I could throw it on my github to share.


I'd say that is one way to have a 10x impact.


I like the section on Document and verify.

Particularly in startup situations without processes (i.e. formal product requirements documentation, SW architecture descriptions), the codebase can be largely undocumented.

As a new member of the team, it can be beneficial to start generating this documentation to:

- buy more time to review the code as having some sort of work product output assures management that you are indeed still coming up to speed on the code versus playing video games

- is actual beneficial to the team especially as it grows as likely new additions to the team will need onboarding documentation


When working in a big unfamiliar code base, I:

* have a goal to do something like, save some image data for example

* just dig around trying to find where the right place is to get the data I need

* hack in code to see if I can get the damn thing to work at all

* once its working, try to do do it neatly if I can

* don't worry too much about not understanding everything - it can take months to work out how a system works. This only comes through doing lots and lots of iterations of hacking in little fixes and along the way learning what you can


You can send this list to everyone, and people will still struggle. Sometimes I'm just amazed at some folks that join the org and are able to finish tickets in no time. Other times, it doesn't matter how much help you provide: they start to get productive in one part of the system, but moving to another component is a challenge. Without help, they are incapable of advancing.

> Let your instincts guide you if you think something feels much more difficult than you’d expect, drawing from prior experience as necessary.

I wonder if this is the knack that so few software engineers possess, and I'm not talking about the the 10x bunch.


I think I have the knack. I don’t think it’s anything special.

Read the code. Believe the code. Not what you think it’s doing, not what the comments imply, not even what variable/function/class names imply. What is the code actually doing?

Great, now you can fix it.

And remember: when you’ve eliminated the impossible, whatever is left, no matter how weird, is what the code is doing. Don’t be afraid to console log.

I see all the time engineers wasting minutes and hours trying to reason about code when a good print statement could answer their question in seconds.


> Read the code. Believe the code. Not what you think it’s doing, not what the comments imply, not even what variable/function/class names imply. What is the code actually doing?

100% agree, but in my experience there's one more step after this. What should the code be doing? If you're lucky, you have tests or human-readable documentation that explains business rules, but in legacy applications, odds are the only way to know is find an enduser that possesses the knowledge. These are the worst time sinks.


One way to be sure what the code is doing is writing unit tests.

Just dumb black boxing is fine: `assert f(0) == 0`. When it returns "Pancake", change to `assert f(0) == "Pancake"`

...repeat until you understand :)


I don't think many can read code efficiently, so people who can do that are special. People who can't have to rely on documentation or slowly trudging through.


There is a mix. Some things you can learn from the code other things you need documentation.

Often it's the domain knowledge that is hard to glean from the code. Especially when bug fixing. For example:

   The domain expert told me this output was wrong but the code appears to be doing exactly what it claims to. 
Trying to get a domain expert to set down and explain why the output is wrong is a skill that took me way to long to develop. Now sitting down with a pen and paper running through the logic with an expert is one of my most treasured information gathering techniques.


You are right. I think that part comes with experience.

I’ve been coding my whole life (since 9 yo, am 35 now) so a lot of this stuff is very deeply ingrained.


There seems to be a large contingent of people who learn almost exclusively by rote (this is not just in software engineering). A major problem they have is generalizing that knowledge to other systems or other parts of the same system. See the many people who struggle to learn a new programming language. I have worked with people who were baffled by *nix systems (especially the terminal interfaces) because their comprehension of "folders" or "directories" was a strictly graphical one (usually based on Windows' model, but they could usually adapt to another file browser at least). They even struggled to interact with filesystems programmatically as a consequence.

I've had some success helping people break out of this, but it usually boils down to individual motivation. These are the people I struggle to work with (in a mentoring/teaching capacity) the most because they get extremely frustrated by variations and novelty. Give them a well-written procedure, though, and they're usually good. Just hope there aren't any knowledge assumptions you made in writing it that don't hold.


I wonder if we can use the vector representations from the inner layers of LLMs to create some kind of semantic code and text analysis tools. Maybe such a tool can highlight salient areas, and tag regions of code as "boilerplate", "standard idiom in this language", "business logic", "performance optimization", et alia.

As for taking notes, is there a IDE (or better yet a Vim plugin) that lets me annotate specific lines or regions with my own notes and comments? Or even annotate functions and classes that way, so I can see those comments in a pop-up box whenever I hover over an annotated function.


One of the best pieces of writing I've come across that I still come back to about grokking complex codebases was written by Mitchell Hashimoto.

https://mitchellh.com/writing/contributing-to-complex-projec...


My two big pieces of advice (which I’ve also given to team members growing into senior roles):

1. Learn how and when to step into “external” code when debugging. How being mainly oriented around navigating call stacks that might be obfuscated by default, while establishing pattern recognition for cases where problems in your code show up in third party call sites versus simple cases of misconfiguration.

2. Sorry but… learn regex. Even if your language is static everything, you’re probably going to hit a wall using code navigation tooling. If you know the general format of things you’re looking to find and call stacks aren’t getting you there, regex is an incredibly powerful way to find what you’re looking for. Don’t rely on it for aggressive mass changes, but definitely get comfortable with it for finding potential root causes.


> Sorry but… learn regex.

Do not be sorry.

find and grep.

How is it possible to exist without them?


I’m not really sorry, but people really hate regex. For reasons I comprehend but I use them on the job pretty much daily.


I think something like Github Copilot can be a huge boost to getting up to speed in big unfamiliar codebases -- it will autocomplete what you're trying to do with lots of "unknown unknowns", patterns and APIs in the codebase that you don't yet know about.


I've been thinking about LLMs' ability to help with this. If they really did "memorize" data as well as some people think they do, it'd be a great boost to enterprise software development.


I've been working on a semantic search engine for code that's particularly useful to those navigating unfamiliar codebases. Just write an English description of whatever you're looking for, and it'll show you the most relevant/similar functions from the repository.

Afterall, it's often useful to see how things are done, and why. Search at least makes possible the former.

Email me at govind <dot> gnanakumar <at> outlook <dot> com if you'd like to try out a beta—on account of potentially being hugged to death, I don't want to publicly display the link.


Huh, I wrote about this a few months ago too (with a lot more whimsy): Eating Elephants -- https://docs.google.com/document/d/1c07-Zj6bUbYPwx7Zttd1N74o...

The thrust of both our articles is similar, except I think I ended up using many more words.


My first step is to learn the data model for the application. Then everything starts to make sense as code to read/write/present that model.


Yes. If there is a coherent data model.


This is parallel topic—how do you grok an unfamiliar API?

You have API access and a blank piece of paper, what are the first three things you’d write down?


No source, no docs? I'm assuming some app uses it: connect my phone to a proxy (I use mitmweb to inspect), close all apps except that one, and tinker with the app to see what requests come up.


Aside - where was the diagram made?


Looks like something one could make pretty easily with Excalidraw (https://excalidraw.com/) although the font seems slightly different.


Maybe not the case, but often that style was made using Excalidraw: https://excalidraw.com. Playing around in it, the font at least appears the same.


Haha, I had the same thought about the program and the opposite conclusion about the font! Many characters like g and p look similar, but the lower case t has a curl on the bottom in Excalidraw, and is straight in the article's diagram.

Edit: on second look, it is the same t. I saw some images in a search for Excalidraw with a different t, but when I tried the actual site it was the same. Sorry for the noise!


I'll reply here (because I simultaneously replied to your other comment but removed it). I also thought it was funny we had come to opposite conclusions based on the same data.

Type out "Bottom-up" in excalidraw, it matches perfectly. The lower case `t` has a curl on the bottom in the diagram too.

EDIT: Re Edit:

No worries about noise. I had a good chuckle with our near simultaneous replies which was nice after the way our project has been going this year.


This sounds like an extremely fun job (or jobs?), how can I get it?


Just ask chatgpt to summarize it. How hard can it be?


It can't do that yet but I'm really excited about the day coming when AI can understand an entire code base.

ChatGPT has already completely revolutionized programming for me, I loove forward to further advances.

It's truly unbelievable, like have a permanent pair programmer with me who I can ask how things work, how to fix problems, and more importantly than anything else, ask for examples of how to do things, which I then hack into my code.


I too look forward to machines rewriting my code and telling me I'm not smart enough to understand it when I object to their questionable output. I can see a 0.1x PeonGPT telling me that it got a ShipIt from a 10x TeamLeadGPT and we are moving forward with the code. Any concerns? Create a jira ticket which JiraGPT will promptly label as "tech debt" and will throw into the "backlog" for it never to look at it again.


A whole new take on `make world`




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: