Authors: Antoni Olbrysz, Karol Struniawski, Tomasz Wierzbicki
of Contents
- Introduction
- New Dataset of Pollen Images
- Extraction of Individual Pollen Images
- Classification of Individual Pollen Images
- Conclusions
- Acknowledgement
1. Introduction
Pollen classification is an fascinating area in visible picture recognition, with a broad vary of use instances throughout ecology and biotechnology, akin to research of plant populations, local weather change, and pollen construction. Regardless of this, the topic is comparatively unexplored, as few datasets have been composed of pictures of such pollens, and people who exist are sometimes lackluster or in any other case inadequate for the coaching of a correct visible classifier or object detector, particularly for pictures containing mixtures of varied pollens. Moreover offering a classy visible identification mannequin, our undertaking goals to fill this hole with a custom-made dataset. Visible pollen classification is commonly sophisticated to resolve with out machine imaginative and prescient, as fashionable biologists are sometimes incapable of differentiating between pollen of various plant species based mostly on pictures alone. This makes the duty of shortly and effectively recognising harvested pollens extraordinarily difficult, offered the pollen particles’ supply is unknown beforehand.
1.1 Accessible Datasets of Pollen Photos
This part highlights the parameters of a number of freely out there datasets and compares them to the properties of our {custom} set.
Dataset 1
Hyperlink: https://www.kaggle.com/datasets/emresebatiyolal/pollen-image-dataset
Variety of courses: 193
Photos per class: 1-16
Picture high quality: Separated, clear pictures, generally with labelings
Picture color: Varied
Notes: The dataset appears composed of incongruent pictures taken from a number of sources. Whereas broad in courses, every incorporates solely a number of photographs, inadequate for coaching any picture detection mannequin.
Dataset 2
Hyperlink: https://www.kaggle.com/datasets/andrewmvd/pollen-grain-image-classification
Variety of courses: 23
Photos per class: 35, 1 class 20
Picture high quality: Nicely separated, barely blurry, no textual content on pictures
Picture color: Uncoloured, constant
Notes: Localised, well-prepared dataset for the classification of Brazilian Savannah pollen. Constant picture supply, but the variety of pictures per class could pose points when aiming for prime accuracy.
Dataset 3
Hyperlink: https://www.kaggle.com/datasets/nataliakhanzhina/pollen20ldet
Variety of courses: 20
Photos per class: Exorbitant
Picture high quality: Self-explanatory pictures, with separated and joined pollen pictures.
Picture color: Dyed, constant
Notes: An insurmountable quantity of well-labeled, constant, and high-quality pictures make this dataset of the best high quality. Nonetheless, the colouring current could also be a problem in particular purposes. Moreover, the magnification and skill of the pollens to intersect could pose issues in mixed-pollen situations.
2. New Dataset of Pollen Photos
Our dataset is a set of high-quality microscope pictures of 4 completely different courses of pollens belonging to widespread fruit crops: the European gooseberry, the haskap berry, the blackcurrant, and the shadbush. These plant species haven’t been a part of any earlier dataset, so our dataset contributes new knowledge in the direction of visible pollen classification.
Every class incorporates 200 pictures of a number of grains of pollen, every picture with out dye. It was obtained in collaboration with the Nationwide Institute of Horticultural Analysis in Skierniewice, Poland.
Variety of courses: 5 (4 pollens + blended)
Photos per class: ~200
Picture high quality: Clear pictures, pictures comprise a number of pollen fragments, blended pictures current
Picture color: Undyed, constant
Our dataset focuses on regionally out there pollens, class steadiness, and an abundance of pictures to coach on with out added dye, which can make the classifier unsuitable for some duties. Moreover, our proposed resolution incorporates pictures with mixtures of various pollen sorts, aiding in coaching detection fashions for field-collection purposes. Exemplary pictures from the dataset are represented as Figures 1-4.



The total dataset is on the market from the corresponding creator on cheap request. The info acquisition steps are composed of the pattern preparation and taking microscopic pictures, which have been ready by Professor Agnieszka Marasek-Ciołakowska and Ms. Aleksandra Machlańska from the Division of Utilized Biology at The Nationwide Institute of Horticultural Analysis, for which we’re very grateful. Their efforts have confirmed invaluable for the success of our undertaking.
We first extracted pictures of particular person pollens from the photographs in our datasets to coach numerous fashions to acknowledge pollens. Every of these photographs contained a number of pollens and different lifeforms and air pollution, making figuring out the pollen species far tougher. We used the YOLOv12 mannequin, a cutting-edge attention-centric real-time object detection mannequin developed by Ultralytics.
3.1 High-quality-Tuning YOLOv12
Because of YOLOv12’s modern nature, it may be skilled even on tiny datasets. We skilled this phenomenon firsthand. To arrange our personal dataset, we manually labeled the pollens’ areas on ten pictures in every of the 4 courses of our dataset utilizing CVAT, later exporting the labels into .txt recordsdata equivalent to particular person pictures. Then, we organized our knowledge right into a YOLOv12-appropriate format: we divided our knowledge right into a coaching set (7 image-label pairs per class, in whole 28) and a validation set (3 image-label pairs per class, in whole 12). We added a .yaml file pointing in the direction of our dataset. It may be observed that the dataset was really very small. The ensuing picture in prediction mode with detected particular person pollens with the boldness overlay is represented as Fig. 5. We additionally downloaded the mannequin (YOLO12s) from the YOLOv12 website. Then we began the coaching.

The mannequin proved to detect pollens with very excessive accuracy, however there was yet one more factor to contemplate: the mannequin’s confidence. For each detected pollen, the mannequin additionally outputted a worth of how particular its prediction is. We needed to determine whether or not to make use of a decrease threshold for confidence (extra pictures, however increased threat of malformed or non-pollen photographs) or the next one (fewer pictures, however decrease likelihood of non-pollen). We finally settled on attempting out two thresholds, 0.8 and 0.9 to guage which one would work higher when coaching classification fashions.
3.2 Exporting the individual-pollen datasets
To do that, we launched the mannequin’s prediction on all the class-specific pictures in our dataset. This labored very properly, however after exporting, we encountered one other difficulty — some photographs have been cropped, even on increased thresholds. Because of this, we added one other step earlier than exporting our particular person pollens: we eradicated pictures with a disproportionate side ratio (see instance as Fig. 6). Particularly, 0.8, dividing the smaller dimension by the bigger one.

Then, we resized all the photographs into 224×224, the usual dimension for enter pictures for deep studying fashions.
3.3 Particular person-pollen datasets — a brief evaluation
We ended up with two datasets, one made with a confidence threshold of 0.8 and the opposite with 0.9:
- 0.8 Threshold:
- gooseberry — 7788 pictures
- haskap berry — 3582 pictures
- blackcurrant — 4637 pictures
- shadbush — 4140 pictures
Complete — 20147 pictures
- 0.9 Threshold:
- gooseberry — 2301 pictures
- haskap berry — 2912 pictures
- blackcurrant — 2438 pictures
- shadbush —1432 pictures
Complete — 9083 pictures
A fast have a look at the numbers exhibits that the 0.9 threshold dataset is over twice as small because the 0.8 threshold one. Each datasets are not balanced — the 0.8 one on account of gooseberry and the 0.9 one on account of shadbush.
YOLOv12 was an efficient device for segmenting our pictures into two single-pollen picture datasets, regardless that we encountered some difficulties. The newly created datasets could also be unbalanced, but their measurement ought to compensate for this downside, primarily since each class is extensively represented. There may be a number of potential in them for future coaching of classification fashions, however we must see for ourselves.
4. Classification of Particular person Pollen Photos
4.1 An Overview of Mannequin Ranking Metrics
One should devise metrics to measure efficiency to correctly method coaching fashions, whether or not classical ones engaged on statistical options, or extra complicated approaches, akin to convolutional neural networks or imaginative and prescient transformers. By way of the years, many strategies have been devised for carrying out these duties — from statistical measures akin to F1, precision, or recall, to extra visible metrics akin to GradCAM, that enable a deeper perception into the mannequin’s interior workings. This text explores the grading strategies utilized by our fashions, with out going into pointless element.
Recall
Recall is described because the ratio of correct guesses of 1 class to the overall guesses of that class (see Eq. 1). It measures what proportion of pictures marked as a category belong to it. Engaged on separate courses makes it useful in each balanced and imbalanced datasets.
Eq. 1— Formulation for recall.
Precision
Versus recall, precision is the proportion of accurately marked gadgets amongst all gadgets belonging to the category (see Eq. 2). It measures the proportion of things in a category that have been guessed accurately. This metric performs equally to recall.
Eq. 2 — Formulation for precision.
F1 Rating
The F1 Rating is just the harmonic imply of precision and recall (see Eq. 3). It helps mix precision and recall right into a concise measurement. Therefore, it nonetheless performs excellently even on unbalanced datasets.
Eq. 3 — Formulation for F1.
Confusion Matrix
The confusion matrix is a visible measure evaluating the variety of guesses made for one class to the precise variety of pictures on this class. It helps as an example errors made by the mannequin, which can have hassle with solely particular programs (see Fig. 7).

GradCAM
GradCAM is a measure of CNN efficiency that visualises which areas of the picture affect the prediction. To do that, the tactic computes the gradients from 1 convolutional layer and determines an activation map that visually layers on high of the picture. It tremendously aids in understanding and explaining the mannequin’s “causes” for labelling a specific picture as a selected class (see instance in Fig. 8).

These metrics are only some of an enormous sea of measurements and visualisation strategies utilized in machine studying. But, they’ve confirmed ample for measuring the efficiency of the mannequin. In additional articles, metrics will probably be introduced up accordingly as new classifiers are used and launched within the undertaking.
4.2 Particular person Pollen Classification with Normal Fashions
With our pictures preprocessed, we may transfer on to the subsequent stage: classifying particular person pollen into species. We tried three approaches - normal, easy classifiers based mostly on options extracted from pictures, Convolutional Neural Networks, and Imaginative and prescient Transformers. This text outlines our work on normal fashions, together with the kNN classifier, SVMs, MLPs, and Random Forests.
Function extraction
To make our classifiers work, we first needed to get hold of options on which they may base their predictions. We settled for 2 important sorts of options. One was statistical measures based mostly on the presence of pixels with a specific color (one from the RGB mannequin) for a selected picture, such because the imply, normal deviation, median, quantiles, skew, and kurtosis — we extracted them for each coloration layer. The opposite was GLCM (Grey Degree Co-Prevalence Matrix) options: distinction, dissimilarity, homogeneity, vitality, and correlation. These have been obtained from grayscale-converted pictures, and we extracted every at completely different angles. Each single picture had 21 statistical options and 20 GLCM-based options, which quantities to 41 options per picture.
k-Nearest-Neighbors (kNN)
The kNN is a classifier that makes use of a spatial illustration of information to categorise knowledge by detecting the ok nearest neighbours of a function to foretell its label. The mentioned classifier is quick, but different strategies outperform it.
kNN Metrics:
0.8 Dataset:
F1: 0.6454
Precision: 0.6734
Recall: 0.6441
0.9 Dataset:
F1: 0.6961
Precision: 0.7197
Recall: 0.7151
Help Vector Machine (SVM)
Just like the kNN, the SVM represents knowledge as factors in a multi-dimensional area. Nonetheless, as a substitute of discovering nearest neighbours, it tries algorithmically separating the information with a hyperplane. This yields higher outcomes than the kNN, however introduces randomness and continues to be outclassed by different options.
SVM Metrics:
0.8 Dataset:
F1: 0.6952
Precision: 0.7601
Recall: 0.7025
0.9 Dataset:
F1: 0.8556
Precision: 0.8687
Recall: 0.8597
Multi-Layered Perceptron (MLP)
The Multi-Layered Perceptron is a mannequin impressed by the human mind and its neurons. It passes inputs by way of a community of layers of neurons with their very own particular person weights, that are altered throughout coaching. When well-optimized, this mannequin can generally obtain nice outcomes for the standard classifier. Nonetheless, pollen recognition was not one in all them - it carried out poorly in comparison with different options and was not constant.
MLP Metrics:
0.8 Dataset:
F1: 0.8131
Precision: 0.8171
Recall: 0.8173
0.9 Dataset:
F1: 0.7841
Precision: 0.8095
Recall: 0.7940
Random Forest
The random forest is a mannequin well-known for its explainability - it’s based mostly on choice bushes, which classify knowledge based mostly on thresholds, which people can analyze way more simply than, as an illustration, weights in neural networks. The Random Forest carried out pretty properly and constantly - we discovered that 200 bushes was optimum. Nonetheless, it was outclassed by extra complicated classifiers.
RF Metrics:
0.8 Dataset:
F1: 0.8211
Precision: 0.8210
Recall: 0.8233
0.9 Dataset:
F1: 0.8150
Precision: 0.8202
Recall: 0.8216
The classical fashions exhibited diverse efficiency levels- some carried out worse than anticipated, whereas others delivered pretty good metrics. Nonetheless, this isn’t but the top. We nonetheless have superior deep studying fashions to check out, akin to Convolutional Neural Networks and Imaginative and prescient Transformers. We anticipate that they may carry out considerably higher.
4.3 Particular person Pollen Classification with Convolutional Neural Networks
Classical fashions akin to MLPs, Random Forests, or SVMs in particular person pollen classification yielded mediocre to fairly good outcomes. Nonetheless, the subsequent method we determined to attempt was Convolutional Neural Networks (CNNs). They’re fashions that generate options by processing pictures and are recognized for his or her effectiveness.
As a substitute of coaching CNNs from scratch, we used a switch studying method — we took pre-trained fashions, particularly ResNet50 and ResNet152, and fine-tuned them to our dataset. This method makes coaching considerably quicker and fewer resource-demanding. It additionally permits for a lot simpler classification because of the fashions’ already being professionally skilled on massive datasets. Earlier than coaching, we additionally needed to normalize the photographs.
When it comes to metrics, we used Grad-CAM, a technique that makes an attempt to focus on the areas of a picture that influenced a mannequin’s prediction probably the most, along with normal metrics akin to F1 rating, precision, and recall. We additionally included confusion matrices to see if our CNNs battle with any specific class.
ResNet50
ResNet50 is a CNN structure developed by Microsoft Analysis Asia in 2015, which was a major step in the direction of creating far deeper and extra environment friendly neural networks. It’s a residual community (therefore the title ResNet) that makes use of skip connections to permit direct knowledge circulate. This, in flip, mitigates the vanishing gradient drawback.
We anticipated this mannequin to carry out worse than ResNet152. Our expectations have been shortly subverted because the mannequin delivered predictions on the identical stage as ResNet152 on each datasets, as represented within the below-listed metrics and confusion metrics (see Fig. 9 and Fig. 10), in addition to Grad-Cam visualization (see Fig. 11).
ResNet50 Metrics:
0.8 Dataset:
F1: 0.98
Precision: 0.98
Recall: 0.98
0.9 Dataset:
F1: 0.99
Precision: 0.99
Recall: 0.99



Relating to Grad-CAM, it didn’t present any worthwhile insights in regards to the mannequin’s interior workings - the highlighted zones included the background and seemingly random locations. As a result of it achieves very excessive accuracy, the community seems to note patterns undetectable by the human eye.
ResNet152
Additionally a growth of Microsoft’s researchers, the ResNet152 is a residual community and a CNN structure with vital depth and deep studying capabilities far exceeding these of the ResNet50.
Subsequently, our expectations for this mannequin have been increased than for ResNet50. We have been upset to see that it carried out on par with it. It carried out excellently (see Fig. 12 and Fig. 13 with confusion matrices and Fig. 14 with Grad-Cam visualizations).
ResNet152 Metrics:
0.8 Dataset:
F1: 0.98
Precision: 0.98
Recall: 0.98
0.9 Dataset:
F1: 0.99
Precision: 0.99
Recall: 0.99



Grad-CAM was not useful for ResNet152 both - we skilled the enigmatic nature of deep studying fashions, which obtain excessive accuracy however can’t be defined simply.
We have been shocked that the extra complicated ResNet152 didn’t outperform the ResNet50 on the 0.9 dataset. Each achieved the best metrics out of any fashions we’ve got tried to this point - they trumped the classical fashions, with the distinction between one of the best classical mannequin and the CNNs exceeding 10 proportion factors. It’s time to check probably the most modern mannequin - the Imaginative and prescient Transformer.
4.4 Particular person Pollen Classification with Imaginative and prescient Transformers
For particular person pollen classification, we tried out easy fashions, which offered a diverse stage of efficiency, from inadequate to passable. Then, we carried out convolutional neural networks, which utterly trumped their efficiency. Now it’s time for us to check out the modern mannequin generally known as the Imaginative and prescient Transformer.
Transformers, on the whole, originate from the well-known 2017 paper “Consideration Is All You Want” by researchers at Google, however they have been initially used primarily for pure language processing. In 2020, the transformer structure was utilized in pc imaginative and prescient, yielding the ViT — Imaginative and prescient Transformer. Its glorious efficiency marked the start of the top for Convolutional Neural Networks’ reign within the space.
Our method right here was much like what we used when coaching CNNs. We imported a pre-trained mannequin: vit-base-patch16–224-in21k, a mannequin skilled on ImageNet-21k. Then, we normalized our dataset pictures, fine-tuned them, and famous down the outcomes of metrics and confusion matrices (see Fig. 15 and Fig. 16).
vit-base-patch16–224-in21k outcomes:
0.8 Dataset:
F1: 0.98
Precision: 0.98
Recall: 0.98
0.9 Dataset:
F1: 1.00
Precision: 1.00
Recall: 1.00


Within the 0.8 dataset, the Imaginative and prescient Transformer introduced a stage of efficiency that didn’t exceed that of the Residual Networks, and it struggled with comparable issues — it misclassified Gooseberry as Blackcurrant. Nonetheless, on the 0.9 dataset, it achieved a virtually good rating. We witnessed innovation overcome extra dated options, which urged us to avoid wasting the mannequin and designate it as our mannequin of alternative for extra demanding duties.
4.5 Comparability of Metrics for Varied Fashions
For our pollen classification duties, we’ve got used many fashions: conventional fashions, together with the kNN, SVM, MLP, and Random Forest; Convolutional Neural Networks (ResNet50 and ResNet152), and a Imaginative and prescient Transformer (vit-base-patch16–224-in21k). This text serves as an summary and a efficiency rating (see Tab. 1).

Rating
6. kNN (k-Nearest-Neighbors)
The only classifier. As anticipated, it skilled shortly, however carried out the worst.
5. MLP (Multi-Layered Perceptron)
The mannequin’s structure is predicated on the human nervous system. The MLP was outperformed by different normal fashions, which we didn’t anticipate.
4. RF (Random Forest)
The Random Forest classifier carried out with the best consistency of all fashions, however its metrics have been removed from very best.
3. SVM (Help Vector Machine)
The sudden winner among the many typical classifiers. Its efficiency was random however yielded good outcomes for the standard classifier on the 0.9 dataset.
2. ResNet50 and ResNet152 (Residual Networks)
Each architectures achieved the identical excessive outcomes due to their complexity, far exceeding the capabilities of any normal classifier on each datasets.
1. ViT (Imaginative and prescient Transformer)
Probably the most modern resolution we’ve got tried trumped the classical fashions and caught as much as the Residual Networks on the 0.8 dataset. But the true problem was the 0.9 dataset, the place the CNNs reached an insurmountable accuracy of 0.99. To our shock, the Imaginative and prescient Transformer’s outcomes have been so excessive that they have been rounded to 1.00 - an ideal rating. Its outcomes are a real testomony to the facility of innovation.
Observe: the classification report rounded up the mannequin’s metrics - they don’t seem to be precisely equal to 1, as that may imply that each one pictures with out exception have been labeled accurately. We settled for this worth as a result of solely a marginal 5 pictures (0.27%) have been misclassified.
By way of evaluating completely different classifiers within the area of visible pollen recognition, we have been in a position to expertise the historical past and evolution of machine studying personally. We examined fashions with various levels of innovation, beginning with the best classifiers and the attention-based Imaginative and prescient Transformer, and observed how their outcomes elevated together with their novelty. Based mostly on this comparability, we unanimously elected the ViT as our mannequin of alternative for working with pollen.
5. Conclusions
The duty of visually classifying pollen, which has eluded biologists world wide and lay exterior the grasp of human means, has lastly been confirmed potential due to the facility of machine studying. The fashions introduced in our publication have all proven potential to categorise the pollens, with various levels of accuracy. Some fashions, akin to CNNs or the Imaginative and prescient Transformer, have reached close to perfection, recognising pollen with precision unseen in people.
To higher perceive why this accomplishment is so spectacular, we illustrate it in Fig. 17.

It’s extremely seemingly that the majority readers can not accurately classify these pictures into the 4 courses talked about beforehand. Alternatively, our fashions have confirmed to recognise them with virtually good accuracy, reaching a high F1 rating of over 99%.
One could surprise what such a classifier could possibly be used for, or why it was skilled within the first place. The purposes of this method are quite a few, from monitoring plant populations to measuring airborne allergen ranges on a neighborhood scale. We constructed the fashions to not solely present a device for palynologists to make use of to categorise pollens they could accumulate, but in addition to supply a analysis platform for different machine studying lovers to construct off of, and to reveal the ever-expanding purposes of this discipline.
On that word, that is the top of this publication. We sincerely hope the reader finds this useful info of their analysis endeavours and that our articles have sparked concepts for initiatives utilizing this expertise.
6. Acknowledgments
We’re very grateful to Professor Agnieszka Marasek-Ciołakowska from the Nationwide Institute of Horticultural Analysis, Skierniewice, Poland, for making ready samples and taking microscopic pictures of them utilizing the Keyence VHX-5000 microscope. The authors possess the whole, non-restricted copyrights to the dataset used on this analysis and all pictures used inside this text.

