Twenty First Group - Tennis

Author

Andres Gonzalez

1. Pick a player who has played over 50 matches

Answer: The player selected is Aryna Sabalenka.

a. What is their loss record?

Answer: Aryna Sabalenka has won 53 matches and lost 16 matches.

b. Which is their surface & why?

Answer: Aryna Sabalenka performs best on Hard surface with a win rate of 79.5%, where she has the highest success rate due to factors such as her aggressive playing style, which is well-suited to the surface’s characteristics, and her extensive experience and comfort playing on hard courts.

c. How well do they serve when under pressure?

Answer: Aryna Sabalenka has a break points saved rate of of 60%, indicating her ability to maintain a strong serve under pressure situations.

2. Who is the best at serving in each of the men’s & women’s games?

Answer Men’s Game: John Isner, with an average serve ability score of approximately 1267.9 over 30 matches.

Answer Women’s Game: Ashleigh Barty, with an average serve ability score of approximately 1087.5 over 45 matches .

a. What are the differences between these players?

The main differences lie in their average serve ability scores:

Answer: John Isner has a higher average serve ability score, which is indicative of his dominance in serve power and effectiveness in the men’s game. Conversely, Ashleigh Barty, though possessing a lower serve ability score compared to Isner, demonstrates strategic service proficiency and consistency, contributing significantly to her success in the women’s game.

b. Create two different visualisations that demonstrate the difference between the top 10 male and female servers.

Answer:

TOP 10 MALE SERVERS
Matches played between 01/06/2019 and 31/08/2023. Twitter: @AndresAnalytics.
PLAYER MATCHES SERVE ABILITY
John Isner 30 1,267.9
Milos Raonic 18 1,239.6
Nick Kyrgios 30 1,209.1
Roger Federer 33 1,203.7
Reilly Opelka 25 1,173.6
Ivo Karlovic 9 1,167.0
Novak Djokovic 103 1,164.1
Matteo Berrettini 60 1,150.9
Juan Martin del Potro 4 1,116.0
Stefanos Tsitsipas 67 1,115.1
TOP 10 FEMALE SERVERS
Matches played between 01/06/2019 and 31/08/2023. Twitter: @AndresAnalytics.
PLAYER MATCHES SERVE ABILITY
Ashleigh Barty 45 1,087.5
Naomi Osaka 35 1,047.9
Serena Williams 43 996.4
Elena Rybakina 54 976.5
Johanna Konta 22 968.0
Iga Swiatek 81 946.6
Aryna Sabalenka 69 944.5
Jennifer Brady 25 931.0
Petra Kvitova 48 928.6
Caroline Garcia 43 921.6

3. Build and evaluate a simple model that predicts the number of aces in a given match.


Call:
lm(formula = total_aces ~ gender + surface + gender:surface, 
    data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.794  -4.042  -0.937   2.958  56.206 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                11.3417     0.3434  33.030   <2e-16 ***
genderfemale               -7.4046     0.4864 -15.224   <2e-16 ***
surfaceGrass                8.4646     0.5158  16.411   <2e-16 ***
surfaceHard                 8.4522     0.4270  19.793   <2e-16 ***
genderfemale:surfaceGrass  -6.3597     0.7302  -8.709   <2e-16 ***
genderfemale:surfaceHard   -6.0707     0.6088  -9.972   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.523 on 3405 degrees of freedom
Multiple R-squared:  0.433, Adjusted R-squared:  0.4322 
F-statistic:   520 on 5 and 3405 DF,  p-value: < 2.2e-16

Coefficients

Intercept (11.3417): The average number of aces served by males on clay surfaces is about 11.34.

Gender (female, -7.4046): Female players serve approximately 7.4 fewer aces than male players on the same surface.

Surface (Grass, 8.4646 and Hard, 8.4522): Playing on grass or hard surfaces increases the number of aces served by about 8.46 compared to clay.

Interaction Terms:

Gender (female) and Surface (Grass, -6.3597): The increase in aces due to playing on grass is about 6.36 less for females compared to males.

Gender (female) and Surface (Hard, -6.0707): Similarly, the increase in aces for playing on hard surfaces is about 6.07 less for females.

Assumptions for Linear Regression Model

  1. Linearity: The relationship between the predictor variables and the response variable should be linear.

  1. Homoscedasticity: The variance of the residuals should be constant across all levels of the predictor variables.

After plotting residuals versus fitted values, we observe a pattern in the spread of residuals, indicating that the variance is not constant across all levels of the predictor variables. This violates the assumption of homoscedasticity. Non-constant variance can lead to biased estimates and incorrect inferences. To address this issue, transformations of the response variable or robust regression techniques may be necessary.

  1. Normality of Residuals: The residuals should be approximately normally distributed.


    Shapiro-Wilk normality test

data:  resid(mod_updated)
W = 0.89811, p-value < 2.2e-16

Since the p-value from the Shapiro-Wilk test for normality of residuals is less than the chosen significance level, we reject the null hypothesis. This suggests that the residuals do not follow a normal distribution. Violation of the normality assumption may indicate that the model’s performance could be affected, and further investigation or alternative modeling approaches may be warranted. Transformation of the response variable or consideration of alternative regression models that do not rely on the normality assumption, such as generalized linear models, may be necessary to address the non-normality of residuals.

Calculate Cook’s distance for detecting influential observations

named integer(0)

There are no influential points.

Results and Accuracy

MSE: 52.81551 
RMSE: 7.267428 

Conclusion: The model is relatively more accurate at lower value predictions but struggles with higher values, consistently underestimating them. This could necessitate a review and adjustment of the model to enhance its predictive accuracy and reliability across the full range of data.

General Methods Used

For each task, the general methods used include:

Task 1: Pick a Player Who Has Played Over 50 Matches (libraries used: dplyr)

  1. Win-Loss Record: – Calculated the win-loss record for the selected player.

  2. Best Surface Analysis: – Analyzed the player’s performance on different surfaces to determine their best surface and provided reasoning behind it.

  3. Serve Under Pressure: – Evaluated the player’s performance under pressure, particularly focusing on their serving ability during critical moments in matches.

Task 2: Identifying the Best Servers in Men’s & Women’s Games (libraries used: dplyr, gt, gtExtras)

  1. Comparison of Players: – Analyzed the differences between the best servers in men’s and women’s games, considering factors such as serve ability, match statistics, and playing style.

  2. Visualization: – Created two different visualizations to demonstrate the difference between the top 10 male and female servers, such as scatter plots comparing serve ability or bar plots showing serve statistics.

Task 3: Building and Evaluating a Simple Model for Predicting Aces

  1. Model Building: – Built and evaluated a simple model to predict the number of aces in a given match, focusing on the approach rather than the accuracy of the model.

  2. Model Evaluation: – Assessed the model’s performance using metrics such as Mean Squared Error (MSE) and Root Mean Squared Error (RMSE).

  3. Interpretation and Communication: – Interpreted the model results and communicated findings effectively, potentially using visualizations to aid in understanding.