5 quick steps to immediately make an (older) NLP research codebase more usable

Nov 3

Reproducing studies is a big part of AI program curriculums. If we are lucky enough for the authors to have published some of their code as part of the paper reveal, it can often be very outdated and difficult to run, with a complicated flow or just a blob of undocumented code. Here are five things that can bring immediate change.

Linting

Whenever we try to understand a new codebase, we spend a lot of time reading code. Reading code is an essential skill and one that will bring us forward in any project really quickly. A tidy, (for Python) PEP8 compliant codebase can ease of the high tension caused by the constant fits over weird naming, inconsistent comments and long-forgotten #todos.

I like to use Black, since that is what I am used to. There are plenty of other linters out there to pick from.

I am so excited for Dr. Felienne Hermans’s new book on reading code, the author of “The programmer’s brain” to deepen my practice, so check it out if you are interested.

Clear out all the warnings

Being deprecation warnings, weird calculations you are doing, or hardware issues - warnings are there to tell us something might be wrong with our code. If that is a deprecation warning, you might be missing out on a new, better way to do the exact same thing you are doing. If you get warnings about weird values and you are doing deep learning, that’s definitely something to check out. Even if the warning is something you can’t fix or you will end up just ignoring, it is worth spending time on it to make that decision.

Clearing out warnings will give you code that is (seemingly) reliable and that is updated and maintained for longer. Additionally, I know for myself that I love to look at my runs clean and not littered with unnecessary messages to get the information from my learning algorithms that I need.

In the end, clearing warnings from other people’s code might make you think twice about some of their choices, and this is worthwhile in itself.

Use low-hanging fruit from optimisation guides

I've become insanely one dimensional when it comes to solving ML problems: accuracy too low? Make the model faster you can run more experiments. model has bugs? Make the model faster you can run more tests. CI taking too much time? Make the model faster.
— Mark Saroufim (@marksaroufim) October 24, 2022

All of the reasons above are a great incentive to optimise your code, but one other might be even more important to students/academic researchers or people learning and that is: computational resources. If you’d like to have results in a reasonable time or at least attempt to make that deadline, the models need to run faster on the little computational resources you have.

There is a lot online on how to do this, and it is all worthwhile reading. Some optimisations require a bit more digging and knowing what the code does (if you change the way some core calculation is made for example). And some, like the low-hanging fruit found in optimisation guides that could just work on that outdated code you have, less.

Here is an example from the PyTorch documentation.

Documenting

Do you think you will remember all the little things about the new codebase you are just slowly discovering? Which data files need to be generated first, which scripts do what, and how do you get it all running? And how about if you have to move to another machine as this one has the wrong CUDA and you don’t have admin rights, or your computer suddenly dies? How long will it take you to replicate the setup from memory only? All the little packages that needed installing, the environment variables settings, etc., etc., etc.

The solution to all those problems: documentation. Dumping a bunch of words in that README once you learn something in a hopefully organised manner will pay off, many times. To not even begin how much it will improve the people after you that will attempt exactly the same.

Improve the train/test workflow

Software project organisation is a skill that sometimes requires a bit more work upfront that cannot always be justified on tight research deadlines. Essential parts of the project might be discovered as the researchers go, reaching limits of computation possibility, data usefulness, model performance, and any other surprise under the sun. It might often mean last-minute scripts and added steps to the training/test workflows. Researchers, unlike software engineers working on company code, have less incentive to come back and streamline/perfectly integrate those changes once a project is finished and they have moved on to work on the next big thing.

Making the training/test workflows integrated scripts which include fetching/transforming data, packaging everything, and eventually running train/test might bring the biggest change to your everyday work with this particular project. It is highly likely that you will have to run those scripts many times.

Bonus: Make careful incremental changes

At this stage of rapid development and big impactful changes from optimisation to workflow changes to just making it run, making small, contained, incremental changes might not be the first thing on your mind.

But, that git revert is beautiful and will feel amazing when a problem comes up. You will always know what change broke the run, and it will always be easy to undo. Wasting time on debugging would be super counter-productive and understandably frustrating in this initial stage of working on an older code base.

Thanks for reading! Here are two more articles on reproducing NLP/ML studies: how to make a successful reproduction project from more of a project management perspective, and how to take it beyond that stage with your own contribution.

Tamara Atanasoska https://www.atanasoska.com