
1. Pick a player who has played over 50 matches
Answer: The player selected is Aryna Sabalenka.
a. What is their loss record?
Answer: Aryna Sabalenka has won 53 matches and lost 16 matches.
b. Which is their surface & why?
Answer: Aryna Sabalenka performs best on Hard surface with a win rate of 79.5%, where she has the highest success rate due to factors such as her aggressive playing style, which is well-suited to the surface’s characteristics, and her extensive experience and comfort playing on hard courts.
c. How well do they serve when under pressure?
Answer: Aryna Sabalenka has a break points saved rate of of 60%, indicating her ability to maintain a strong serve under pressure situations.
2. Who is the best at serving in each of the men’s & women’s games?
Answer Men’s Game: John Isner, with an average serve ability score of approximately 1267.9 over 30 matches.
Answer Women’s Game: Ashleigh Barty, with an average serve ability score of approximately 1087.5 over 45 matches .
a. What are the differences between these players?
The main differences lie in their average serve ability scores:
Answer: John Isner has a higher average serve ability score, which is indicative of his dominance in serve power and effectiveness in the men’s game. Conversely, Ashleigh Barty, though possessing a lower serve ability score compared to Isner, demonstrates strategic service proficiency and consistency, contributing significantly to her success in the women’s game.
b. Create two different visualisations that demonstrate the difference between the top 10 male and female servers.
Answer:
Matches played between 01/06/2019 and 31/08/2023. Twitter: @AndresAnalytics. |
John Isner |
30 |
1,267.9 |
Milos Raonic |
18 |
1,239.6 |
Nick Kyrgios |
30 |
1,209.1 |
Roger Federer |
33 |
1,203.7 |
Reilly Opelka |
25 |
1,173.6 |
Ivo Karlovic |
9 |
1,167.0 |
Novak Djokovic |
103 |
1,164.1 |
Matteo Berrettini |
60 |
1,150.9 |
Juan Martin del Potro |
4 |
1,116.0 |
Stefanos Tsitsipas |
67 |
1,115.1 |
Matches played between 01/06/2019 and 31/08/2023. Twitter: @AndresAnalytics. |
Ashleigh Barty |
45 |
1,087.5 |
Naomi Osaka |
35 |
1,047.9 |
Serena Williams |
43 |
996.4 |
Elena Rybakina |
54 |
976.5 |
Johanna Konta |
22 |
968.0 |
Iga Swiatek |
81 |
946.6 |
Aryna Sabalenka |
69 |
944.5 |
Jennifer Brady |
25 |
931.0 |
Petra Kvitova |
48 |
928.6 |
Caroline Garcia |
43 |
921.6 |
3. Build and evaluate a simple model that predicts the number of aces in a given match.
Call:
lm(formula = total_aces ~ gender + surface + gender:surface,
data = train)
Residuals:
Min 1Q Median 3Q Max
-18.794 -4.042 -0.937 2.958 56.206
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.3417 0.3434 33.030 <2e-16 ***
genderfemale -7.4046 0.4864 -15.224 <2e-16 ***
surfaceGrass 8.4646 0.5158 16.411 <2e-16 ***
surfaceHard 8.4522 0.4270 19.793 <2e-16 ***
genderfemale:surfaceGrass -6.3597 0.7302 -8.709 <2e-16 ***
genderfemale:surfaceHard -6.0707 0.6088 -9.972 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.523 on 3405 degrees of freedom
Multiple R-squared: 0.433, Adjusted R-squared: 0.4322
F-statistic: 520 on 5 and 3405 DF, p-value: < 2.2e-16
Coefficients
Intercept (11.3417): The average number of aces served by males on clay surfaces is about 11.34.
Gender (female, -7.4046): Female players serve approximately 7.4 fewer aces than male players on the same surface.
Surface (Grass, 8.4646 and Hard, 8.4522): Playing on grass or hard surfaces increases the number of aces served by about 8.46 compared to clay.
Interaction Terms:
Gender (female) and Surface (Grass, -6.3597): The increase in aces due to playing on grass is about 6.36 less for females compared to males.
Gender (female) and Surface (Hard, -6.0707): Similarly, the increase in aces for playing on hard surfaces is about 6.07 less for females.
Assumptions for Linear Regression Model
- Linearity: The relationship between the predictor variables and the response variable should be linear.
- Homoscedasticity: The variance of the residuals should be constant across all levels of the predictor variables.
After plotting residuals versus fitted values, we observe a pattern in the spread of residuals, indicating that the variance is not constant across all levels of the predictor variables. This violates the assumption of homoscedasticity. Non-constant variance can lead to biased estimates and incorrect inferences. To address this issue, transformations of the response variable or robust regression techniques may be necessary.
- Normality of Residuals: The residuals should be approximately normally distributed.
Shapiro-Wilk normality test
data: resid(mod_updated)
W = 0.89811, p-value < 2.2e-16
Since the p-value from the Shapiro-Wilk test for normality of residuals is less than the chosen significance level, we reject the null hypothesis. This suggests that the residuals do not follow a normal distribution. Violation of the normality assumption may indicate that the model’s performance could be affected, and further investigation or alternative modeling approaches may be warranted. Transformation of the response variable or consideration of alternative regression models that do not rely on the normality assumption, such as generalized linear models, may be necessary to address the non-normality of residuals.
Calculate Cook’s distance for detecting influential observations
There are no influential points.
Results and Accuracy
MSE: 52.81551
RMSE: 7.267428
Conclusion: The model is relatively more accurate at lower value predictions but struggles with higher values, consistently underestimating them. This could necessitate a review and adjustment of the model to enhance its predictive accuracy and reliability across the full range of data.
General Methods Used
For each task, the general methods used include:
Task 1: Pick a Player Who Has Played Over 50 Matches (libraries used: dplyr)
Win-Loss Record: – Calculated the win-loss record for the selected player.
Best Surface Analysis: – Analyzed the player’s performance on different surfaces to determine their best surface and provided reasoning behind it.
Serve Under Pressure: – Evaluated the player’s performance under pressure, particularly focusing on their serving ability during critical moments in matches.
Task 3: Building and Evaluating a Simple Model for Predicting Aces
Model Building: – Built and evaluated a simple model to predict the number of aces in a given match, focusing on the approach rather than the accuracy of the model.
Model Evaluation: – Assessed the model’s performance using metrics such as Mean Squared Error (MSE) and Root Mean Squared Error (RMSE).
Interpretation and Communication: – Interpreted the model results and communicated findings effectively, potentially using visualizations to aid in understanding.