AI Papers to Read in 2025

with my sequence of AI paper suggestions. My long-term followers may recall the 4 earlier editions ([1], [2], [3], and [4]). I’ve been away from writing for fairly a while, and I couldn’t consider a greater solution to return than resuming my most profitable sequence — and the one I loved writing essentially the most.

For the uninitiated, this can be a very opinionated record, filled with views and tangents, meant to maintain you up to date on AI as a complete. This isn’t a state-of-the-art fashions record however actual insights on what to search for within the coming years and what you might need missed from the previous. The purpose is that will help you assume critically concerning the state of AI.

In whole, there are ten paper options, every with a short description of the paper’s contribution and specific the explanation why these papers are price studying. Furthermore, every has a devoted additional studying part with a number of tangents to discover.

Earlier than we transfer on, again to my 2022 article, I kicked off saying “we don’t want bigger fashions; we want options” and “don’t anticipate me to recommend GPT nonsense right here.” Again then, I used to be fairly certain I might repeat myself sooner or later, {that a} new GPT mannequin would simply be a bigger and marginally higher mannequin, however removed from groundbreaking. Nevertheless, credit score the place credit score is due. Since launch, ChatGPT has sparked many new options and is actually a turning level in all of laptop science.

Final however not least, as a small disclaimer, most of my AI work facilities round Laptop Imaginative and prescient, so there are probably many wonderful papers on the market on matters akin to Reinforcement Studying, Graphs, and Audio which are simply not beneath my radar. If there’s any paper you consider I ought to know, please let me know ❤.

Let’s go!

#1 DataPerf: A Benchmark for Knowledge Centric AI

Mazumder, Mark, et al. “Dataperf: Benchmarks for data-centric ai development.” arXiv preprint arXiv:2207.10062 (2022).

From 2021 to 2023, Andrew Ng was very vocal about data-centric AI: to shift our focus from evolving fashions over static datasets in the direction of evolving the datasets themselves — whereas holding fashions static or principally unchanged. In their very own phrases, our present model-centric analysis philosophy neglects the elemental significance of knowledge.

In sensible phrases, it’s usually the case that rising the dataset dimension, correcting mislabeled entries, and eradicating bogus inputs is much simpler at bettering a mannequin’s output than rising its dimension, variety of layers, or coaching time.

In 2022, the authors proposed DataPerf, a benchmark for data-centric AI growth, together with duties on speech, imaginative and prescient, debugging, acquisition, and adversarial issues, alongside the DataPerf working group. The initiative goals to foster data-aware strategies and seeks to shut the hole between the information departments of many firms and academia.

Purpose 1: Most, if not all, firms engaged on area of interest matters find yourself growing inside datasets. It’s wild how little analysis exists on how to do that correctly/higher.

Purpose 2: A mirrored image: what number of papers present a strong 2% enchancment over the State-of-the-Artwork (SOTA) these days? How a lot further knowledge would you should increase your accuracy by 2%?

Purpose 3: For the remainder of your profession, you may marvel, what if as an alternative of doing the proposed X, we simply collected extra knowledge?

Purpose 4: If you might be in academia, caught with some X or Y dataset, making an attempt to determine the right way to get 0.1% enchancment over SOTA, know that life could be way more than that.

Additional Studying: In 2021, all of it started with Deeplearning.AI internet hosting a data-centric AI competitors. You possibly can learn concerning the winner’s method here. Since then, there was loads of work devoted to the topic by different authors, as an example, 2023’s Data-centric Artificial Intelligence: A Survey. Lastly, if you’re a Talks sort of particular person, there are numerous by Andrew Ng on YouTube championing the subject.

#2 GPT-3 / LLMs are Few-Shot Learners

Brown, Tom, et al. “Language models are few-shot learners.” Advances in neural data processing methods 33 (2020): 1877–1901.

This NeurIPS paper offered GPT-3 to the world. OpenAI’s third-gen mannequin was in nearly each method only a greater GPT-2. With 116 instances extra parameters and educated on 50 instances extra knowledge. Their largest discovering wasn’t that it was simply “higher” however that the way you prompted it might drastically enhance its efficiency on many duties.

Machine Studying fashions are sometimes expressed as predictable features: given the identical enter, they are going to at all times yield the identical output. Present Giant Language Fashions (LLMs), alternatively, can pose and reply the identical query in many various methods — wording issues.

Purpose 1: Beforehand, we mentioned maintaining fashions static whereas we evolve the dataset. With LLMs, we will evolve the questions we ask.

Purpose 2: GPT-3 sparked the sphere of immediate engineering. After it, we began seeing authors proposing methods like Chain-of-Thought (CoT) and Retrieval-Augmented-Era (RAG).

Purpose 3: Prompting effectively is much extra necessary than understanding the right way to prepare or finetune LLMs. Some individuals say prompting is lifeless, however I don’t see that occuring ever. Ask your self: do you phrase requests the identical method when addressing your boss vs your mother or buddies?

Purpose 4: When transformers got here out, most analysis focused their coaching/inference pace and dimension. Prompting is a genuinely contemporary subject in pure language processing.

Purpose 5: It’s humorous whenever you notice that the paper doesn’t actually suggest something; it simply makes an statement. Has 60k citations, although.

Additional Studying: Prompting jogs my memory of ensemble fashions. As a substitute of repeatedly prompting a single mannequin, we might prepare a number of smaller fashions and mixture their outputs. Now practically three many years previous, the AdaBoost paper is a traditional on the subject and a learn that may take you again to method earlier than even phrase embeddings had been a factor. Quick ahead to 2016, a contemporary traditional is XGBoost, which is now on its v3 upgrade.

#3 Flash Consideration

Dao, Tri, et al. “FlashAttention: Fast and memory-efficient exact attention with io-awareness.” Advances in Neural Info Processing Programs 35 (2022): 16344–16359.

Because the 2017 groundbreaking paper “Attention is All You Need” launched the Transformer structure and the eye mechanism, a number of analysis teams have devoted themselves to discovering a sooner and extra scalable various to the unique quadratic formulation. Whereas many approaches had been devised, none has actually emerged as a transparent successor to the unique work.

The unique Consideration formulation. The softmax time period represents how necessary every token is to every question (so for N tokens, we have now N² consideration scores). The “rework” (within the title Transformer) is the multiplication between this N² consideration map and the N-sized V vector (very like a rotation matrix “transforms” a 3D vector)

On this work, the authors don’t suggest a brand new formulation or a intelligent approximation to the unique formulation. As a substitute, they current a quick GPU implementation that makes higher use of the (difficult) GPU reminiscence construction. The proposed methodology is considerably sooner whereas having little to no drawbacks over the unique.

Purpose 1: Many analysis papers get rejected as a result of they’re simply new implementations or not “novel sufficient”. Typically, that’s all we want.

Purpose 2: Analysis labs crave the eye of being the brand new Consideration, to the purpose it’s arduous for any new Consideration to ever get sufficient consideration. On this occasion, the authors solely enhance what already works.

Purpose 3: Looking back, ResNet was groundbreaking for CNNs again within the day, proposing the residual block. Within the following years, many proposed enhancements to it, various the residual block concept. Regardless of all that effort, most individuals simply caught with the unique concept. In such a crowded analysis discipline as AI, it’s finest to stay cautious about all issues which have many proposed successors.

Additional Studying: On occasion, I seek the advice of Sik-Ho Tsang’s list of papers he evaluations right here on Medium. Every part reveals the main concepts for every space through the years. It’s a bit unhappy what number of of those papers might need appeared groundbreaking and are actually utterly forgotten? Again to Consideration, as of 2025, the most popular attention-replacement candidate is the Sparse Attention by the DeepSeek workforce.

#4 Coaching NNs with Posits

Raposo, Gonçalo, Pedro Tomás, and Nuno Roma. “Positnn: Training deep neural networks with mixed low-precision posit.” ICASSP 2021–2021 IEEE Worldwide Convention on Acoustics, Speech and Sign Processing (ICASSP). IEEE, 2021.

Taking a flip to the world of {hardware} and low-level optimization, among the most necessary (however least horny) developments in AI coaching are associated to floating factors. We went from boring floats to halfs, then 8-bit and even 4-bit floats (FP4). The horsepower driving LLMs at present are eightfold ponies.

The way forward for quantity codecs goes hand-in-hand with matrix-matrix multiplication {hardware}. Nevertheless, there could be way more to this subject than simply halving bit-depth. This paper, as an example, explores a completely new quantity format (posits) as a possible substitute for good previous IEEE-754 floats. Are you able to think about a future sans floats?

Purpose 1: Whereas new algorithms take time to search out widespread adoption, {hardware} improves constantly yearly. All ships rise with the {hardware} tide.

Purpose 2: It’s price questioning how far we might be at present if we didn’t have as many GPU enhancements over the previous ten years. For reference, the AlexNet authors broke all ImageNet information in 2012 utilizing two high-end GTX 580 GPUs, a complete of three TFLOPs. These days, a mid-range GPU, akin to an RTX 5060, boasts ~19 TFLOPs — 6 instances extra.

Purpose 3: Some applied sciences are so widespread that we take them with no consideration. All issues can and must be improved; we don’t owe something to floats (and even Neural Networks for that matter).

Additional Studying: Since we’re mentioning {hardware}, it’s additionally a very good time to speak about programming languages. For those who haven’t been maintaining with the information, the Python workforce (particularly Python’s creator) is targeted on optimizing Python. Nevertheless, optimization these days appears to be a slang for rebuilding stuff in Rust. Final however not least, some hype was dedicated to Mojo, an AI/speed-focused superset of Python; nonetheless, I barely see anybody speaking about it at present.

#5 AdderNet

Chen, Hanting, et al. “AdderNet: Do we really need multiplications in deep learning?.” Proceedings of the IEEE/CVF convention on laptop imaginative and prescient and sample recognition. 2020.

What if we didn’t do matrix multiplication in any respect? This paper goes a completely totally different route, exhibiting it’s doable to have efficient neural networks with out matrix multiplication. The principle concept is to interchange convolutions with computing the L1 distinction between the enter and the sliding filters.

I like to think about this paper because the “alternate world” neural networks. In some parallel universe, NNs developed based mostly on addition, and amidst all of it, somebody proposed a multiplication-based mannequin; nonetheless, it by no means received traction since all of the tooling and {hardware} had been neck deep in optimizing large matrix addition and subtraction operators.

Purpose 1: We simply overlook there are nonetheless different algorithms on the market we have now but to search out, moreover CNNs and Transformers. This paper reveals that an addition-based neural community is feasible, how cool is that?

Purpose 2: Loads of our {hardware} and cloud infrastructure is tuned for matrix multiplication and neural networks. Can new fashions nonetheless compete? Can non-neural networks nonetheless make a comeback?

Additional Studying: A lot of you won’t be acquainted with what existed earlier than NNs took over most fields. Most individuals know staples like Linear Regression, Decision Trees, and XGBoost. Earlier than NNs turned standard, Support Vector Machines had been all the fad. It’s been some time since I final noticed one. On this regard, a cool paper to learn is Deep Learning is Not All You Need.

Assist Vector Machines be taught to separate two teams of factors with one of the best separation line doable. By utilizing the Kernel Trick, these factors are forged right into a higher-dimensional house, through which a greater separation aircraft is likely to be discovered, reaching a non-linear choice boundary whereas sustaining the linear formulation. Its a superb resolution price studying about. Source.

#6 Interpolation vs Extrapolation

Balestriero, Randall, Jerome Pesenti, and Yann LeCun. “Learning in high dimension always amounts to extrapolation.” arXiv preprint arXiv:2110.09485 (2021).

Someday in the past, I used to assume the large names on AI had been visionaries or had excellent educated guesses on the way forward for the sphere. This modified with this paper and all the controversy that adopted.

Again in 2021, Yann LeCun pushed this dialogue about interpolation vs extrapolation, claiming that in high-dimensional areas, like all neural networks, what we name “studying” is knowledge extrapolation. Proper after publication, many famend names joined in, some claiming this was nonsense, some that it was nonetheless is interpolation, and a few taking the extrapolation facet.

For those who by no means heard about this dialogue… it reveals how pointless it actually was. So far as I might see (and please write me in the event you assume in any other case), no firm modified course, no new extrapolation-aware mannequin was devised, nor did it spark new related coaching methods. It got here and it went.

Purpose 1: To be trustworthy, you may simply skip this one. I simply wanted to rant about this for my very own peace of thoughts.

Purpose 2: From a purely tutorial perspective, I take into account this an attention-grabbing tackle studying principle, which is certainly a cool subject.

Additional Studying: Yoshua Bengio, Geoffrey Hinton, and Yann LeCun had been awarded the 2018 Turing Award for his or her pioneering work on Deep Studying foundations. Again in 2023 or so, LeCun was centered on self-supervised learning, Hinton was involved with Capsule Networks, and Bengio was Generative Flow Networks. By late 2025, LeCun moved in the direction of world models whereas Hinton and Bengio moved in the direction of AI Safety. In case you are second-guessing your tutorial decisions, remember the fact that even the so-called godfathers change gears.

#7 DINOv3 / Basis Imaginative and prescient Fashions

Siméoni, Oriane, et al. “DINOv3.” arXiv preprint arXiv:2508.10104 (2025).

Whereas the world of language processing has developed to make use of massive common fashions that work for each job (aka basis fashions), the sphere of picture processing continues to be working its method as much as that. On this paper, we see the present iteration of the DINO mannequin, a self-supervised picture mannequin designed to be the muse for Imaginative and prescient.

Purpose 1: Self-supervised pretraining continues to be comparatively evolving in different downside areas when in comparison with textual content, particularly if achieved fully inside the issue area (versus including textual content descriptions to assist it).

Purpose 2: Don’t learn solely language papers, even when your job is working with LLMs. Selection is essential.

Purpose 3: Language fashions can solely go up to now in the direction of AGI. Imaginative and prescient is paramount for human-like intelligence.

Additional Studying: Persevering with on the Imaginative and prescient subject, it’s price understanding concerning the YOLO and the Segment-Anything Mannequin. The previous is a staple for object-detection (but in addition boasts variations for different issues) whereas the latter is for picture segmentation. Relating to picture era, I discover it humorous that a couple of years again we might all discuss GANs (generative adversarial networks), and these days it’s possible that a lot of you may have by no means heard of 1. I even wrote an inventory like this for GAN papers a few years in the past.

#8 Small Language Fashions are the Future

Belcak, Peter, et al. “Small Language Models are the Future of Agentic AI.” arXiv preprint arXiv:2506.02153 (2025).

The sphere of “Generative AI” is rapidly being rebranded to “Agentic AI”. As individuals attempt to grasp the right way to make cash with that, they bleed VC cash working behemoth fashions. On this paper, the authors argue that Small Language Fashions (< 10B params, on their definition) are the longer term for Agentic AI growth.

In additional element, they argue that almost all subtasks executed on agentic options are repetitive, well-defined, and non-conversational. Due to this fact, LLMs are considerably an overkill. For those who embody fine-tuning, SLMs can simply grow to be specialised brokers, whereas LLMs thrive on open duties.

Purpose 1: What we name “massive” language fashions at present may simply as effectively be the “small” of tomorrow. Studying about SMLs is future-proofing.

Purpose 2: Many individuals declare AI at present is closely backed by VC cash. Within the close to future, we would see an enormous improve in AI prices. Utilizing SMLs is likely to be the one choice for a lot of companies.

Purpose 3: That is tremendous straightforward to learn. In reality, I feel it’s the first time I’ve learn a paper that so explicitly defends a thesis.

Additional Studying: Smaller fashions are the one choice for edge AI / low-latency execution. When making use of AI to video streams, the mannequin + publish must execute in lower than 33 ms for a 30fps stream. You possibly can’t roundtrip to a cloud or batch frames. These days, there are a number of instruments like Intel’s OpenVINO, NVIDIA’s Tensor-RT, or TensorFlow-Lite for quick inference on restricted {hardware}.

#9 The Lottery Ticket Speculation (2019)

Frankle, Jonathan, and Michael Carbin. “The lottery ticket hypothesis: Finding sparse, trainable neural networks.” arXiv preprint arXiv:1803.03635 (2018).

As a follow-up to small fashions, some authors have proven that we most probably aren’t coaching our networks’ parameters to their fullest potential. That is “people solely use 10% of their brains” utilized to neural networks. On this literature, the Lottery Ticket Speculation is definitely one of the intriguing papers I’ve seen.

Frankle et al. discovered that in the event you (1) prepare an enormous community, (2) prune all low-valued weights, (3) rollback the pruned community to its untrained state, and (4) retrain; you’ll get a greater performing community. Placing it otherwise, what coaching does is uncover a subnetwork whose preliminary random parameters are aligned to fixing the issue — all else is noise. By leveraging this subnetwork alone, we will surpass the unique community efficiency. Not like primary community pruning, this improves the consequence.

Purpose #1: We’re accustumed to ”greater fashions are higher however slower” whereas “small fashions are dumb however quick”. Possibly we’re the dumb ones who insist on massive fashions at all times.

Purpose #2: An open query is how underutilized our parameters are. Likewise, how can we use our weights to their fullest? And even, is it even doable to measure a NN studying potential?

Purpose #3: What number of instances have you ever cared about how your mannequin parameters had been initialized earlier than coaching?

Additional Studying: Whereas this paper is from 2018, there’s a 2024 survey on the speculation. On a contrasting notice, “The Role of Over-Parameterization in Machine Learning — the Good, the Bad, the Ugly (2024)” discusses how over-parametrization is what actually powers NNs. On the extra sensible facet, this survey covers the topic of Knowledge Distillation, utilizing an enormous community to coach a smaller one to carry out as near it as doable.

#10 AlexNet (2012)

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural data processing methods. 2012.

Are you able to consider all this Neural Community content material we see at present actually began simply 13 years in the past? Earlier than that, NNs had been considerably in between a joke and a failed promise. For those who wished a very good mannequin, you’d use SVMs or a bunch of hand-engineered tips.

In 2012, the authors proposed the usage of GPUs to coach a big Convolutional Neural Community (CNN) for the ImageNet problem. To everybody’s shock, they gained first place, with a ~15% Prime-5 error charge, in opposition to ~26% for the second place, which used state-of-the-art picture processing methods.

Purpose #1: Whereas most of us know AlexNet’s historic significance, not everybody is aware of which of the methods we use at present had been already current earlier than the growth. You is likely to be stunned by how acquainted lots of the ideas launched within the paper are, akin to dropout and ReLU.

Purpose #2: The proposed community had 60 million weights, full madness for 2012 requirements. These days, trillion-parameter LLMs are across the nook. Studying the AlexNet paper provides us quite a lot of perception into how issues have developed since then.

Additional Studying: Following the historical past of ImageNet champions, you may learn the ZF Net, VGG, Inception-v1, and ResNet papers. This final one achieved super-human efficiency, fixing the problem. After it, different competitions took over the researchers’ consideration. These days, ImageNet is principally used to validate radical new architectures.

The unique portrayal of the AlexNet construction. The highest and backside halves are processed by GPU 1 and a couple of, respectively. An earlier type of mannequin parallelism. Supply: The Alexnet Paper

That is all for now. Be happy to remark or connect with me if in case you have any questions on this text or the papers. Writing such lists is A LOT OF WORK. If this was a rewarding learn for you, please be type and share it amongst your friends. Thanks!

Source link

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

STEM Education in Africa: Engineering Student’s Story

Backdoor infecting VPNs used “magic packets” for stealth and security

Google tightens certification rules for gambling and gaming advertisers