MSc Group Project
Introduction
For a long time now, Guide Dogs UK has been using a simple estimator of 7 months for the interval period between breeding seasons. Despite this, the last 10 years of breeding data shows significant variation for individual dogs. This project is using data analysis techniques to produce a machine learning model uses each bitch’s details to give a personalised interval estimation.
Data Overview
There are 4693 instances in the dataset, with each having 22 features, such as dog ID, date of birth, breed name, colour, pedigrees' IDs, age, weight, diet. The table below provides a fabricated example.
Data ID | Dog ID | Name | Date of Birth | Breed Name | Colour | Pregnant Last Season | Pedigree (Sire) | Pedigree (Dam) | Season Start | Age at season | Time from previous season | Diet | BCS | Weight | HR Notes |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
23 | 54321 | Doggo | 23/03/2003 | Labrador | Black | 0 | Mike (12345) | Kim (12435) | 11/11/2007 | 4.641 | 233 | Brand3 Light | 4 | 27 | Mating season ? On medication X. |
The feature this project aims to predict is Time from previous season (TFPS). An exploratory analysis of the data revealed some important information and some guidelines for our model. Some key points:
- Mean TFPS ≈ 216.7 ≈ 7 months.
- TFPS Range 14 – 781
- Average TFPS by breed:
- Min German Shepard at 174
- Max Labrador x Golden Retriever* at 250
- Highest correlation coefficient to TFPS: Weight at -0.12
Methodology
Results
Metrics Models |
Mean AE | Median AE | Max AE | RMSE | Explained Variance | |
---|---|---|---|---|---|---|
Baseline | 41.517590 | 29.400500 | 350.400500 | 59.602455 | -0.012004 | 0.000000 |
Linear Regression | 27.666271 | 18.391864 | 304.292438 | 43.263306 | 0.466796 | 0.468531 |
Support Vector Regression | 30.706676 | 20.025322 | 345.603623 | 48.107523 | 0.340705 | 0.374717 |
Gaussian Process Regression | 27.939698 | 17.446348 | 311.712937 | 44.641516 | 0.432283 | 0.436579 |
Bagging K-NN Regression | 32.828029 | 21.709373 | 332.046778 | 50.697218 | 0.267813 | 0.277047 |
Random Forest Regression | 26.452921 | 17.948695 | 282.529606 | 40.515314 | 0.532381 | 0.536497 |
AdaBoost + Linear Regression | 35.831447 | 28.038653 | 290.827322 | 49.533579 | 0.301038 | 0.320889 |
AdaBoost + Decision Tree Regression | 28.560564 | 19.391304 | 289.521739 | 43.195489 | 0.468466 | 0.469669 |
Gradient Boosting Regression | 28.001512 | 20.156420 | 295.187005 | 41.264630 | 0.514924 | 0.517879 |
Neural Network | 26.541662 | 17.421860 | 290.635681 | 41.589225 | 0.507262 | 0.507433 |
Conclusion
The initial analysis of this work confirmed the currently accepted 7 month average for oestrus intervals in domesticated bitches. It also found that any model capable of capturing the full scope of variation from this average would need to be complex, and multidimensional, in order to give accurate predictions for future interval times. Over the course of this study, machine learning models were successfully built to predict bitches’ oestrus intervals. All point estimation models developed were shown capable of reducing error significantly from the baseline, in spite of the existence of large internal noises in the data. Out of tested models, the random forest and the neural network developed for this project reduced mean absolute error the most (from a mean of 41.5 days to a mean of 26.5 days) , whilst linear regression was shown to be a suitable method for those looking for a simpler implementation (mean error of 27.7 days).
Abstract
It has long been known that the domesticated bitch comes into season approximately once every 7 months. Whilst previous research has looked at which features of a bitch might cause variation from this mean, results have often be inconclusive or contradictory. This study uses several machine learning techniques to produce predictive models which estimate the time between each bitch’s oestrus periods, based on her unique features. Additionally, the paper comments upon which features influence this interval time the most, based on automated relevance detection methods. All data provided for this study comes from the Guide Dogs UK breeding programme with the interest of improving colony management and helping their production of assistance dogs. The data analysed consisted of 4693 observations of oestrus, between 877 unique bitches, over the years 2002 to 2019. Features analysed included age, breed, diet and 19 more. The best interval prediction model managed to limit the error to a mean of 26.45 days. This was a significant improvement over the mean 41.52 error produced by the current method. The best performing models were random forest regression, linear regression and a neural network built for this problem, with the random forest regression scoring the smallest mean error. On feature importance, the automated models found that the average of a bitch’s previous seasons, whether a bitch had attempted mating or been pregnant last season and the bitch’s breed all had the most significant impact on the length of her interval. Despite previous studies support for the concept, we did not find any evidence of seasonality in the oestrus intervals of these bitches.
Contacts
Callum Illkiw