Ahh, the ocean.
trip on the Mediterranean Sea, I discovered myself mendacity on the seaside, staring into the waves. Girl Luck was having a superb day: the solar glared down from a blue and cloudless sky, heating the sand and salty sea round me. For the primary time shortly, I had downtime. There was nothing associated to ML within the distant area the place I used to be, the place the tough roads would have scared away anyone who’s used to the even pavements of western international locations.
Then, away from work and, partially, civilization, someplace between zoning out and full-on daydreaming, my ideas started to float. In our day-to-day enterprise, we’re too, effectively, busy to spend time doing nothing. However nothing is powerful phrase right here: as my ideas drifted, I first recalled latest occasions, then contemplated about work, after which, finally, arrived at machine studying.
Perhaps traces of my earlier article—where I reflected on 6.5 years of “doing” ML—have been nonetheless lingering at the back of my thoughts. Or possibly it was merely the entire absence of something technical round me, the place the ocean was my solely companion. Regardless of the purpose was, I mentally began rehearsing the years behind me. What had gone effectively? What had gone sideways? And—most significantly—what do I want somebody had informed me firstly?
This submit is a group of these issues. It’s not meant to be a listing of dumb errors that I urge others to keep away from in any respect prices. As a substitute, it’s my try to put in writing down the issues that might have made my journey a bit smoother (however solely a bit, uncertainty is important to make the long run simply that: the future). Elements of my listing overlap with my earlier submit, and for good purpose: some classes are value repeating, and studying once more.
Right here’s Half 1 of that listing. Half 2 is at the moment buried in my sandy, sea-water stained pocket book. My plan is to observe up with it within the subsequent couple of weeks, as soon as I’ve sufficient time to show it into a high quality article.
1. Doing ML Largely Means Getting ready Knowledge
It is a level I attempt to not assume an excessive amount of about, or it’s going to inform me: you didn’t do your homework.
After I began out, my inner monologue was one thing like: “I simply need to do ML.” No matter that meant; I had visions of plugging neural networks collectively, combining strategies, and operating large-scale coaching. Whereas I did all of that at one level or one other, I discovered that “doing ML” typically means spending quite a lot of time simply making ready the info so that you could finally prepare a machine studying mannequin. Mannequin coaching, paradoxically, is usually the shortest and final a part of the entire course of.
Thus, each time I lastly get to the mannequin coaching step, I mentally breathe a sigh of reduction, as a result of it means I’ve made it by the invisible half: making ready the info. There’s nothing “sellable” in merely making ready the info. In my expertise, making ready the info will not be noticeable in any method (so long as it’s performed effectively sufficient).
Right here’s the same old sample for it:
- You will have a undertaking.
- You get a real-world dataset. (If you happen to work with a well-curated benchmark dataset, you then’re fortunate!)
- You need to prepare a mannequin.
- However first… information cleansing, fixing, merging, validating.
Let me provide you with a private instance, one which I’ve told as a funny story (which it’s now. Again then, it meant redoing just a few days of machine studying work beneath time strain…).
I as soon as labored on a undertaking the place I wished to foretell vegetation density (utilizing the NDVI index) from ERA5 climate information. ERA5 is an enormous gridded dataset, freely accessible from the European Centre for Medium-Vary Climate Forecasts. I merged this dataset with NDVI satellite tv for pc information from NOAA (principally, the American climate company), fastidiously aligned the resolutions, and the whole lot appeared positive—no form mismatches, no errors have been thrown.
Then, I referred to as the info preparation performed and skilled a Imaginative and prescient Transformer mannequin on the mixed dataset. A couple of days later, I visualized the outcomes and… shock! The mannequin thought Earth was the wrong way up. Actually—my enter information was right-side up, however the goal vegetation density was flipped on the equator.
What had occurred? A delicate bug in my decision translation flipped the latitude orientation of the vegetation information. I hadn’t observed it as a result of I used to be spending quite a lot of time on information preparation already, and wished to get to the “enjoyable half” rapidly.
This sort of mistake hones in an vital level: real-world ML initiatives are information initiatives. Particularly outdoors tutorial analysis, you’re not working with CIFAR or ImageNet. You’re working with messy, incomplete, partially labellel, multi-source datasets that require:
- Cleansing
- Aligning
- Normalizing
- Debugging
- Visible inspection
And much more, that listing is non-exhaustive. Then repeating all the above.
Getting the info proper is the work. Every part else builds on that (sadly invisible) basis.
2. Writing Papers Is Like Getting ready a Gross sales Pitch
Some papers simply learn effectively. You may not be capable to clarify why, however they’ve a stream, a logic, a readability that’s arduous to disregard. That’s hardly ever accidentally*. For me, it turned out that writing papers resembles crafting a really particular type of gross sales pitch. You’re promoting your concept, your strategy, your perception to a skeptical viewers.
This was a stunning realization for me.
After I began out, I assumed most papers regarded and felt the identical. All of them have been “scientific writing” to me. However over time, as I learn extra papers I started to note the variations. It’s like that saying: to outsiders, all sheep look the identical; to the shepherd, every one is distinct.
For instance, evaluate these two papers that I got here throughout lately:
Each use machine studying. However they converse to totally different audiences, with totally different ranges of abstraction, totally different narrative kinds, and even totally different motivations. The primary one assumes that technical novelty is central. The second focuses on relevance for functions. Clearly, there is also the visible distinction between the 2.
The extra papers you learn, the extra you understand: there’s not one solution to write a “good” paper. There are a lot of methods, and the best way varies relying on the viewers.
And until you’re a type of very uncommon sensible minds (assume Terence Tao or somebody of that caliber), you’ll possible want assist to put in writing effectively. Particularly when tailoring a paper for a particular convention or journal. In follow, meaning working carefully with a senior ML one who understands the sphere.
Crafting a superb paper is like making ready a gross sales pitch. You have to:
- Body the issue the suitable method
- Perceive your viewers (i.e. goal venue)
- Emphasize the components that resonate most
- And polish till the message sticks
3. Bug Fixing Is the Method Ahead
Years in the past, I had that romantic concept of ML as exploring elegant fashions, inventing new activation features, or crafting intelligent loss features. Which may be true for a small set of researchers. However for me, progress typically regarded like: “Why doesn’t this code run?”. Or, much more irritating: “That code simply ran just a few seconds ago-why does it not run now?”
Let’s say your undertaking requires utilizing Imaginative and prescient Transformers on environmental satellite tv for pc information (i.e., the mannequin aspect of Part 1 above). You will have two choices:
- Implement the whole lot from scratch (not really useful until you’re feeling significantly adventurous, or must do it for course credit).
- Discover an present implementation and adapt it.
In 99% of the circumstances, choice 2 is the plain selection. However “simply plug in your information” nearly by no means works. You’ll run into:
- Completely different compute environments
- Assumptions about enter shapes
- Preprocessing quirks (equivalent to information normalization)
- Arduous-coded dependencies (of which I’m responsible, too)
Rapidly, your day can turn out to be an infinite collection of debugging, backtracking, testing edge circumstances, modifying dataloaders, checking GPU reminiscence**, and rerunning scripts. Then, slowly, issues start to work. Ultimately, your mannequin trains.
However it’s not quick. It’s bug fixing your method ahead.
4. I (Very Actually) Gained’t Make That Breakthrough
You’ve undoubtedly heard of them. The Transformer paper. The GANs. Secure Diffusion. There’s a small half in my that thinks: possibly I’ll be the one to put in writing the following transformative paper. And certain, somebody has to. However statistically, it in all probability gained’t be me. Otherwise you, apologies. And that’s positive.
The works that trigger a discipline to alter quickly are distinctive by definition. These works being distinctive immediately implies that the majority works, even good work, are barely acknowledged. Generally, I nonetheless hope that one in every of my initiatives would “blow up.” However, thus far, most didn’t. Some didn’t even get revealed. However, hey, that’s not failure—it’s the baseline. If you happen to count on each paper to be a house run, then you might be on the quick lane to disappointment.
Closing ideas
To me, Machine studying typically seems as a glossy, cutting-edge discipline—one the place breakthroughs are simply across the nook and the place the “doing” means sensible individuals make magic with GPUs and math. However in my day-to-day work, it’s hardly ever like that.
Extra typically, my day-to-day work consists of:
- Dealing with messy datasets
- Debugging code pulled from GitHub
- Redrafting papers, again and again
- Not producing novel outcomes, once more
And that’s okay.
Footnotes
The earlier article talked about: https://towardsdatascience.com/lessons-learned-after-6-5-years-of-machine-learning/
* In case you are , my favourite paper is that this one: https://arxiv.org/abs/2103.09762. I learn it one yr in the past on a Friday afternoon.
** To today, I nonetheless get mail notifications about how clearing the GPU reminiscence is unimaginable in TensorFlow. This 5-year old GitHub issue offers the small print.

