--- title: 'prettyglm: Beautiful Visualisations for Generalized Linear Models' author: - name: Jared Fowler output: html_document: toc: yes toc_float: collapsed: false smooth_scroll: true toc_depth: 2 number_sections: no vignette: > %\VignetteIndexEntry{prettyglm: Beautiful Visualisations for Generalized Linear Models} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r include=FALSE} # knitr::opts_chunk$set( # collapse = TRUE, # comment = "#>" # ) ``` # Introduction When working with Generalized Linear Models it is often useful to create informative and beautiful summaries of the fitted model coefficients. The goal of `prettyglm` is to provide a set of functions to visualize the Generalized Linear Models coefficients and performance in interactive plots which can easily be embedded in rmarkdown reports or separately exported and shared with stakeholders. This document introduces `prettyglm`’s main sets of functions, and shows you how to apply them. Please see the website [prettyglm]( https://jared-fowler.github.io/prettyglm/) for more detailed documentation with html outputs, some of the outputs have been excluded from this documentation for publication on CRAN. If you don't find the function you are looking for in `prettyglm` consider checking out some other great packages which help visualize the output from glms: * `tidycat` * `jtools` # Installation You can install the latest CRAN release with: ``` r install.packages('prettyglm') ``` # Important Pre-Processing ### Model Building To explore the functionality of `prettyglm` we will use the titanic data set to perform logistic regression. This data set was sourced from [kaggle](https://www.kaggle.com/c/titanic/data) and contains information about passengers aboard the titanic, and a target variable which indicates if they survived. ```{r load data, echo=TRUE, message=FALSE, warning=FALSE} library(dplyr) library(prettyglm) data('titanic') head(titanic) %>% select(-c(PassengerId, Name, Ticket)) %>% knitr::kable(table.attr = "style='width:10%;'" ) %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed")) ``` ### Pre-processing A critical step for this package to work is to **set all categorical predictors as factors**. ```{r preprocessing 1, echo=TRUE, message=FALSE, warning=FALSE} # Easy way to convert multiple columns to a factor. columns_to_factor <- c('Pclass', 'Sex', 'Cabin', 'Embarked', 'Cabintype') meanage <- base::mean(titanic$Age, na.rm=T) titanic <- titanic %>% dplyr::mutate_at(columns_to_factor, list(~factor(.))) %>% dplyr::mutate(Age =base::ifelse(is.na(Age)==T,meanage,Age)) ``` ### Building a glm For this vignette we will use `stats::glm()` to build a logistic regression model. Currently working on support for `parsnip` and `workflow` model objects which use the `glm` model engine. ```{r build model, echo=TRUE} survival_model <- stats::glm(Survived ~ Pclass + Sex + Fare + Age + Embarked + SibSp + Parch, data = titanic, family = binomial(link = 'logit')) ``` # `pretty_coefficients()` ## Table of model coefficients {.tabset} The function `pretty_coefficients()` allows you to create a pretty table of model coefficients, which by default includes categorical base levels. ### Simple Example The simplest way to call this function is just with the model object. ```{r visualise coefficients, echo=TRUE, eval=FALSE, include=TRUE} pretty_coefficients(model_object = survival_model) ``` ### Type III Significance Tests You can also complete a type III test on the coefficients by specifying a `type_iii` argument. Warning `Wald` type III tests will fail if there are aliased coefficients in the model. You can change the significance level highlighted in the table with `significance_level`. ```{r visualise coefficients type iii, echo=TRUE, eval=FALSE, include=TRUE} pretty_coefficients(survival_model, type_iii = 'Wald', significance_level = 0.1) ``` ### Changing Variable Importance Method By default `pretty_coefficients` shows "model" variable importance. But `vimethod` also accepts "permute" and "firm" methods from \link[vip]{vi}. Additional parameters for these methods should also be passed into `pretty_coefficients`. ```{r visualise coefficients vi,echo=TRUE, eval=FALSE, include=TRUE} pretty_coefficients(model_object = survival_model, type_iii = 'Wald', significance_level = 0.1, vimethod = 'permute', target = 'Survived', metric = 'auc', pred_wrapper = predict.glm, reference_class = 0) ``` # `pretty_relativities()` `pretty_relativities()` will create a plot of the desired model variable. A different plot will be generated depending on the class of the variable. ## Plot coefficients {.tabset} A model relativity is a transform of the model estimate. By default `pretty_relativities()` uses 'exp(estimate)-1' which is useful for GLM's which use a log or logit link function. The term 'relativity' is some times referred to as "odds-ratio" or "Likelihood". You can customize the label with the `relativity_label` input. ### Categorical Variables For categorical variables `pretty_relativities()` creates an interactive duel axis plot, which plots the fitted relativity on one y axis, and the number of records in that category on the other y axis. ```{r visualise rels, echo=TRUE, eval=FALSE, include=TRUE} pretty_relativities(feature_to_plot= 'Embarked', model_object = survival_model, relativity_label = 'Liklihood of Survival' ) ``` ### Continuous Variables For continuous variables `pretty_relativities` will plot the relativity over the variables range, and the density of that variable on a duel axis. If desired you can cut off the tail end of the distributions with `upper_percentile_to_cut` or `lower_percentile_to_cut`. ```{r visualise rels 2, echo=TRUE, eval=FALSE, include=TRUE} pretty_relativities(feature_to_plot= 'Fare', model_object = survival_model, relativity_label = 'Liklihood of Survival', upper_percentile_to_cut = 0.1) ``` ## Plot interactions To highlight some more of `prettyglm`'s functionality we will now build a logistic regression model with some interactions. ### Model Building ```{r build model 2, echo=TRUE} survival_model2 <- stats::glm(Survived ~ Pclass:Fare + Age + Embarked:Sex + SibSp + Parch, data = titanic, family = binomial(link = 'logit')) ``` ### Factor:Factor Interactions {.tabset} #### Facet You can also choose to facet the plots by one of the variables. ```{r visualise ff F, echo=TRUE, eval=FALSE, include=TRUE} pretty_relativities(feature_to_plot= 'Embarked:Sex', model_object = survival_model2, relativity_label = 'Liklihood of Survival', iteractionplottype = 'facet', facetorcolourby = 'Sex' ) ``` #### Colour You can also choose to colour the plots by one of the variables. ```{r visualise ff C, echo=TRUE, eval=FALSE, include=TRUE} pretty_relativities(feature_to_plot= 'Embarked:Sex', model_object = survival_model2, relativity_label = 'Liklihood of Survival', iteractionplottype = 'colour', facetorcolourby = 'Embarked' ) ``` #### Standard You can create these relativity plots as you would for a non-interaction. ```{r visualise ff N, echo=TRUE, eval=FALSE, include=TRUE} pretty_relativities(feature_to_plot= 'Embarked:Sex', model_object = survival_model2, relativity_label = 'Liklihood of Survival' ) ``` ### Continuous:Factor Interactions {.tabset} #### Colour By default continuous and factor interaction plots will colour by the factor variable. ```{r visualise cf C, echo=TRUE, eval=FALSE, include=TRUE} pretty_relativities(feature_to_plot= 'Pclass:Fare', model_object = survival_model2, relativity_label = 'Liklihood of Survival', upper_percentile_to_cut = 0.03 ) ``` #### Facet You can also facet by the factor variable. ```{r visualise cf F, echo=TRUE, eval=FALSE, include=TRUE} pretty_relativities(feature_to_plot= 'Pclass:Fare', model_object = survival_model2, relativity_label = 'Liklihood of Survival', iteractionplottype = 'facet', upper_percentile_to_cut = 0.03, height = 800 ) ``` ## Plot splines To highlight some more of `prettyglm`'s functionality we will now build a logistic regression model with a **spline**. ### Create the splines `prettyglm` includes a function `splineit` to help construct splines. This can be incorporated in the dplyr workflow as follows. For splines to work nicely in `prettyglm` use the naming convention Variable#Start#End where # represents your desired separator. ```{r fit the splines} titanic <- titanic %>% dplyr::mutate(Age_0_18 = prettyglm::splineit(Age,0,18), Age_18_35 = prettyglm::splineit(Age,18,35), Age_35_120 = prettyglm::splineit(Age,35,120)) %>% dplyr::mutate(Fare_0_55 = prettyglm::splineit(Fare,0,55), Fare_55_600 = prettyglm::splineit(Fare,55,600)) ``` ### Fit Model With Splines ```{r build model 4, echo=TRUE} survival_model4 <- stats::glm(Survived ~ Pclass + Sex:Fare_0_55 + Sex:Fare_55_600 + Age_0_18 + Age_18_35 + Age_35_120 + Embarked + SibSp + Parch, data = titanic, family = binomial(link = 'logit')) ``` ### Creating a table of model coefficients with pretty_coefficients For interactions variables are grouped on the left pane. ```{r visualise coefficients type spline, echo=TRUE, eval=FALSE, include=TRUE} pretty_coefficients(survival_model4, significance_level = 0.1, spline_seperator = '_') ``` ### Create plots of fitted coefficients using pretty_relativities #### Splines You also need to provide a `spline_seperator` input in `pretty_relativities`. ```{r visualise age spine, echo=TRUE, eval=FALSE, include=TRUE} pretty_relativities(feature_to_plot= 'Age', model_object = survival_model4, relativity_label = 'Liklihood of Survival', spline_seperator = '_' ) ``` #### Interacted Splines {.tabset} ##### Colour By default `pretty_relativities` will colour by the factor variable. ```{r visualise fare spine, echo=TRUE, eval=FALSE, include=TRUE} pretty_relativities(feature_to_plot= 'Sex:Fare', model_object = survival_model4, relativity_label = 'Liklihood of Survival', spline_seperator = '_', upper_percentile_to_cut = 0.03 ) ``` ##### Facet If you prefer to facet by the factor variable, change `iteractionplottype` to "facet" ```{r visualise fare Facet, echo=TRUE, eval=FALSE, include=TRUE} pretty_relativities(feature_to_plot= 'Sex:Fare', model_object = survival_model4, relativity_label = 'Liklihood of Survival', spline_seperator = '_', upper_percentile_to_cut = 0.03, iteractionplottype = 'facet' ) ``` # `one_way_ave()` ## Visualising one-way performance {.tabset} ### Continuous Variable For continuous variables `one_way_ave` will bucket value into 30 buckets by default, and plot the density on a dual axis. ```{r oneway cts, echo=TRUE, eval=FALSE, include=TRUE} one_way_ave(feature_to_plot = 'Age', model_object = survival_model4, target_variable = 'Survived', data_set = titanic, upper_percentile_to_cut = 0.1, lower_percentile_to_cut = 0.1) ``` ### Discrete Variable ```{r oneway discrete, echo=TRUE, eval=FALSE, include=TRUE} one_way_ave(feature_to_plot = 'Cabintype', model_object = survival_model4, target_variable = 'Survived', data_set = titanic) ``` ## Customising one-way performance plots {.tabset} ### Faceting You can facet the `one_way_ave` plot by providing a variable to facet by in `facetby`. ```{r oneway cts facet, echo=TRUE, eval=FALSE, include=TRUE} one_way_ave(feature_to_plot = 'Age', model_object = survival_model4, target_variable = 'Survived', facetby = 'Sex', data_set = titanic, upper_percentile_to_cut = 0.1, lower_percentile_to_cut = 0.1) ``` ### Custom Predict Function By default `one_way_ave` uses \link[stats]{predict.glm}. If you would like to use `one_way_ave` with another model type (which is not compatible with predict.glm), or provide modified predictions, `one_way_ave` allows a custom prediction function. This function must return a data.frame with two columns: "Actual_Values" and "Predicted_Values". ```{r custom predict, echo=TRUE, eval=FALSE, include=TRUE} # Custom Predict Function and facet a_custom_predict_function <- function(target, model_object, dataset){ dataset <- base::as.data.frame(dataset) Actual_Values <- dplyr::pull(dplyr::select(dataset, tidyselect::all_of(c(target)))) if(class(Actual_Values) == 'factor'){ Actual_Values <- base::as.numeric(as.character(Actual_Values)) } Predicted_Values <- base::as.numeric(stats::predict(model_object, dataset, type='response')) to_return <- base::data.frame(Actual_Values = Actual_Values, Predicted_Values = Predicted_Values) to_return <- to_return %>% dplyr::mutate(Predicted_Values = base::ifelse(Predicted_Values > 0.4,0.4,Predicted_Values)) return(to_return) } one_way_ave(feature_to_plot = 'Age', model_object = survival_model4, target_variable = 'Survived', data_set = titanic, upper_percentile_to_cut = 0.1, lower_percentile_to_cut = 0.1, predict_function = a_custom_predict_function) ``` # `actual_expected_bucketed()` ## Actual vs Expected Bucketed By Prediction Percentile {.tabset} ### Standard ```{r bucketed aves, echo=TRUE, eval=FALSE, include=TRUE} actual_expected_bucketed(target_variable = 'Survived', model_object = survival_model4, data_set = titanic) ``` ### Faceted ```{r bucketed aves facet, echo=TRUE, eval=FALSE, include=TRUE} actual_expected_bucketed(target_variable = 'Survived', model_object = survival_model4, data_set = titanic, facetby = 'Sex') ```