Browsing by Department "Statistics"

Now showing 1 - 20 of 149

A behavioral choice model of the use of car-sharing and ride-sourcing services
(2017-05-04) Ferreira Dias, Felipe; Lin, Tse-min; Bhat, Chandra R.
There are a number of disruptive mobility services that are increasingly finding their way into the marketplace. Two key examples of such services are car-sharing services and ride-sourcing services. In an effort to better understand the influence of various exogenous socio-economic and demographic variables on the frequency of use of ride-sourcing and car-sharing services, this paper presents a bivariate ordered probit model estimated on a survey data set derived from the 2014-2015 Puget Sound Regional Travel Study. Model estimation results show that users of these services tend to be young, well-educated, higher-income, working individuals residing in higher-density areas. There are significant interaction effects reflecting the influence of children and the built environment on disruptive mobility service usage. The model developed in this paper provides key insights into factors affecting market penetration of these services, and can be integrated in larger travel forecasting model systems to better predict the adoption and use of mobility-on-demand services
A hybrid reduced approach to handle missing values in type 2 diabetes prediction
(2016-05-06) You, Xinqi; Saar-Tsechansky, Maytal; Gawande, Kishore
Diabetes gains more attention among medical institutions and health care organizations as the increasing trend of diabetes around the world. In the United States, 29.1 million people or 9.3% of U.S. population are diagnosed with diabetes. About 86 million people are categorized as pre-diabetes and 15-30% of them will develop diabetes within 5 years. To tackle this challenge, National Diabetes Prevention Program (DPP) was introduced in 2002 and it reduces risk of diabetes by 58% through lifestyle change program. In order to help select a better group of prediabetes for intervention and maximize the cost-effectiveness of the program, we propose a Hybrid Reduced approach to handle missing values when predicting type 2 diabetes. This approach deals with 4 challenges in electronic medical records: missing values, missing not at random, class imbalance and predicting at a longer window (2-year). We select three ensemble predictive models: AdaBoost.M1, Gradient Boosting and Extremely Randomized Trees and apply this approach across 7 years to assess its robustness. The Hybrid Reduced approach includes two sub-approaches: Hybrid Reduced Organic and Hybrid Reduced Imputed. Throughout the experiments, Hybrid Reduced Imputed is the best performer and achieves a 5-7% improvement in precision. By simply using this approach, we could save $278 million for healthcare and improve people’s health condition
A prototype-oriented framework for deep transfer learning applications
(2023-04-05) Tanwisuth, Korawat; Zhou, Mingyuan (Assistant professor); Mueller, Peter; Ho, Nhat; Qian, Xiaoning
Deep learning models achieve state-of-the-art performance in many applications but often require large-scale data. Deep transfer learning studies the ability of deep learning models to transfer knowledge from source tasks to related target tasks, enabling data-efficient learning. This dissertation develops novel methodologies that tackle three different transfer learning applications for deep learning models: unsupervised domain adaptation, unsupervised fine-tuning, and source-private clustering. The key idea behind the proposed methods relies on minimizing the distributional discrepancy between the prototypes and target data with the transport framework. For each scenario, we design our algorithms to suit different data and model requirements. In unsupervised domain adaptation, we leverage the source domain data to construct class prototypes and minimize the transport cost between the prototypes and target data. In unsupervised fine-tuning, we apply our framework to prompt-based zero-shot learning to adapt large pre-trained models directly on the target data, bypassing the source data requirement. In source-private clustering, we incorporate a knowledge distillation framework with our prototype-oriented clustering to address the problem of data and model privacy. All three approaches show consistent performance gains over the baselines.
Accounting for multiple membership data in adolescent social networks : an analysis of simulated data
(2016-05) Peek, Jaclyn Kara; Beretvas, Susan Natasha; Powers, Daniel A.
Multilevel modeling allows for the modeling of nested structures such as students nested within middle schools and middle schools nested within high schools. These kinds of hierarchies are common in social science research. Pure hierarchies may exist, where one variable is completely nested within another. Multiple membership (MM) structures occur when some lower level units are members of more than one higher level clustering unit (e.g., a student attends more than one high school). An extension to the conventional multilevel model, the multiple membership random effects model (MMREM) can be used to handle MM data. I compare a random effects model with and without multiple membership effects to demonstrate the possible benefit of accounting for the MM structure. We replicate an existing study on student academic outcomes (Tranmer et al., 2013) which assumes a multiple membership data structure, and add a comparison to a non-MM (i.e. single membership) model in order to assess the improvement in model fit. The original study investigated the effect of school, area, and social network membership in friendship dyads and triads on academic achievement in adolescents, with age, gender, and ethnicity as covariates. Our models retain the MM structure found in the original social network data. The original data is confidential and unavailable for use – therefore, a major component of this report is the simulation of this dataset in R. Results indicate that multiple membership does not necessarily lead to better goodness-of-fit as measured by DIC. Accounting for MM data structure initially produced a worse-fitting model. Artificially inflating the fixed and random effects that generated the simulated academic performance outcome led to the opposite effect. We conclude that the scale of random effects is important in determining the DIC measure of fit, and propose a full simulation study to more conclusively test our original hypothesis.
Analysis of circular data in the dynamic model and mixture of von Mises distributions
(2013-05) Lan, Tian, active 2013; Carvalho, Carlos Marinho, 1978-
Analysis of circular data becomes more and more popular in many fields of studies. In this report, I present two statistical analysis of circular data using von Mises distributions. Firstly, the maximization-expectation algorithm is reviewed and used to classify and estimate circular data from the mixture of von Mises distributions. Secondly, Forward Filtering Backward Smoothing method via particle filtering is reviewed and implemented when circular data appears in the dynamic state-space models.
The analysis of influence factors of GDP in United States
(2015-05) Ping, Ying, M.S. in Statistics; Greenberg, Betsy S.
Government expenditures have a close relationship with the gross domestic product (GDP). Understanding the contribution of different expenditures to GDP and improving the efficiency of the fiscal policy is really important for a country's development. The Rahn Curve theory, Rahn, R. and Fox, H. (1996), suggests that there is a level of government spending that can maximize the economic growth. Insufficient spending or over spending will hold back the economy. The main purpose of this report is through analysis of the impact of government spending during the financial crisis to determine how and why different spending levels affect the economy. The data we analyzed include the GSP (Gross state product) and the composition of the expenditures in each state in United States, for the year 2009. We use multiple linear regression to build our model and use stepwise method to select the variables to include. The result shows that pension, education and transportation have a significant positive effect on GSP. This indicates that the expenditure of pension, education and transportation play the important role in the US economy, especially in the recession period. Consequently, instead of increasing the welfare spending, which may reduce people’s motivation to work and hardly stimulates the economy, this paper recommends more allocation of expenditures to the services which can stimulate the economy and create more job opportunities, such as pension, education and transportation. In particular, depending on the survey Diana G. Carew and Dr. Michael Mandel reference have conducted, transportation, which is invested in less than its actual importance and contribution to the economy in the last decade, should get more attentions form the policy makers.
Approaches to modeling self-rated health in longitudinal studies : best practices and recommendations for multilevel models
(2012-05) Sasson, Isaac; Powers, Daniel A.; Umberson, Debra J.
Self-rated health (SRH) is an outcome commonly studied by demographers, epidemiologists, and sociologists of health, typically measured using an ordinal scale. SRH is analyzed in cross-sectional and longitudinal studies for both descriptive and inferential purposes, and has been shown to have significant validity with regard to predicting mortality. Despite the wide spread use of this measure, only limited attention is explicitly given to its unique attributes in the case of longitudinal studies. While self-rated health is assumed to represent a latent continuous and dynamic process, SRH is actually measured discretely and asymmetrically. Thus, the validity of methods ignoring the scale of measurement remains questionable. We compare three approaches to modeling SRH with repeated measures over time: linear multilevel models (MLM or LGM), including corrections for non-normality; and marginal and conditional ordered-logit models for longitudinal data. The models are compared using simulated data and illustrated with results from the Health and Retirement Study. We find that marginal and conditional models result in very different interpretations, but that conditional linear and non-linear models result in similar substantive conclusions, albeit with some loss of power in the linear case. In conclusion, we suggest guidelines for modeling self-rated health and similar ordinal outcomes in longitudinal studies.
Areas of endemism for rare fauna in karst regions of Hays County, Texas
(2014-08) Mainali, Kumar Prasad; Powers, Daniel A.
An area of endemism contains many species restricted to the area and therefore it is rich in species diversity. Consequently, an area of endemism is an area of high conservation priority. An area of endemism is always determined with reference to a bigger landscape using various algorithms and mathematical approaches. Using parsimony analysis of endemicity (PAE) and endemism (NDM), this study analyzed distribution of 45 rare fauna -- aquatic and terrestrial salamanders and arthropods -- in karst regions of Hays county, Texas. PAE sought for the most parsimonious solutions heuristically by creating 97,216 trees. The method stored 16 best solutions from which a consensus was generated. NDM analyzed 285 potential areas of endemism. The area of endemism with highest endemicity score determined by NDM and the consensus tree generated by PAE select the identical geographic range as the best area of endemism. The two methods have many differences in the specifications of determining endemicity but have a common fundamental principle: determining geographic ranges with many species largely confined to it. The two methods select 12% of the karst region with species records as area of endemism, which has 64% of the total species, with 38-40% species being endemic to the area.
Assessing and measuring the impact of self-accountability activation on prosocial choice : can efforts to encourage ethical purchases be counter-productive?
(2015-08) Cleveland, Heath Foster; Keitt, Timothy H.; Irwin, Julie
In this report, I discuss one method of prosocial marketing; evaluate it from a theoretical perspective; identify significant questions about its measurement and application; present a study and explain how the design and measurements included in that study could elucidate answers to the identified questions, pending some analysis; and discuss my current data collection plans. The method, ethical self-accountability activation, was proposed and evaluated by John Peloza, Katherine White, and Jingzhi Shang in their article titled "Good and Guilt-Free: The Role of Self-Accountability in Influencing Preferences for Products with Ethical attributes," which was published in the January 2013 issue of the Journal of Marketing.
Assigning g in Zellner's g prior for Bayesian variable selection
(2015-05) Wang, Mengjie; Walker, Stephen G., 1945-; Lin, Lizhen
There are numerous frequentist statistics variable selection methods such as Stepwise regression, AIC and BIC etc. In particular, the latter two criteria include a penalty term which discourages overfitting. In terms of the framework of Bayesian variable selection, a popular approach is using Bayes Factor (Kass & Raftery 1995), which also has a natural built-in penalty term (Berger & Pericchi 2001). Zellner's g prior (Zellner 1986) is a common prior for coefficients in the linear regression model due to its computational speed of analytic solutions for posterior. However, the choice of g is a problem which has attracted a lot of attention. (Zellner 1986) pointed out that if g is unknown, a prior can be introduced and g can be integrated out. One of the prior choices is Hyper-g Priors proposed by (Liang et al. 2008). Instead of proposing a prior for g, we will assign a fixed value for g based on controlling the Type I error for the test based on the Bayes factor. Since we are using Bayes factor to do model selection, the test statistic is Bayes factor. Every test comes with a Type I error, so it is reasonable to restrict this error under a critical value, which we will take as benchmark values, such as 0.1 or 0.05. This approach will automatically involve setting a value of g. Based on this idea, a fixed g can be selected, hence avoiding the need to find a prior for g.
Associations between health behaviors and adolescents life satisfaction using structural equation modeling (SEM)
(2016-08) Wang, Wanyi; Lin, Lizhen, Ph. D.; Whittaker, Tiffany A
Life satisfaction is an important indicator in suicidal behavior. The purpose of this study was to investigate the influences of health-related behaviors on adolescent life satisfaction using structural equation modeling (SEM). Data were obtained from the Health Behavior in School-Age Children (HBSC), 2001-2002. Because of the complex nature, SEM was preferred to be used over regression models in the present study. The results indicated that good eating habits and high scores of self-reported health played the greatest roles in promoting life satisfaction. The effects of both factors on life satisfaction were also mediated by academic achievement. Physical activity was a positive predictor of life satisfaction, but its effect appears to be mediated by health and academic achievement, rather than affecting life satisfaction directly. Moreover, physical activity was positively associated with good eating habits. These results generated from SEM were also compared with that from multiple linear regressions. Slight differences in the standardized coefficients for the total effects between SEM and regression models were detected due to the existing latent variable in SEM, but the general proportion variance accounted for in each outcome variable were similar across the two analyses. In summary, although there were some limitations for the study design and the building of the model, this study suggested that good habits with respect to diets may be beneficial for improvements in health and academic achievement, which in turn lead to positive scores of adolescent life satisfaction. High frequent physical activity and low BMI were poor but acceptable predictors of life satisfaction.
Bayes And Empirical-Bayes Multiplicity Adjustment In The Variable-Selection Problem
(2010-10) Scott, James G.; Berger, James O.; Scott, James G.; Berger, James O.
This paper studies the multiplicity-correction effect of standard Bayesian variable-selection priors in linear regression. Our first goal is to clarify when, and how, multiplicity correction happens automatically in Bayesian analysis, and to distinguish this correction from the Bayesian Ockham's-razor effect. Our second goal is to contrast empirical-Bayes and fully Bayesian approaches to variable selection through examples, theoretical results and simulations. Considerable differences between the two approaches are found. In particular, we prove a theorem that characterizes a surprising aymptotic discrepancy between fully Bayes and empirical Bayes. This discrepancy arises from a different source than the failure to account for hyperparameter uncertainty in the empirical-Bayes estimate. Indeed, even at the extreme, when the empirical-Bayes estimate converges asymptotically to the true variable-inclusion probability, the potential for a serious difference remains.
Bayesian approaches for inference after selection and model fitting
(2020-09-11) Woody, Spencer Arlen; Scott, James (Statistician); Murray, Jared S.; Carvalho, Carlos M; Zigler, Corwin M; Hoff, Peter
This thesis presents a set of methods unified around the theme of providing valid inference when data are used to answer multiple questions of interest. The first portion takes on the case where the data are used twice, first to select targets of inference, and then a second time to form estimates for these targets. The proposed method uses a Bayesian formulation to give more efficient (shorter) confidence intervals which properly account for selection in order to retain nominal frequentist coverage. The second portion, comprising the bulk of this thesis, formalizes the approach of posterior summarization, unifying an set of ideas originating from the early 2000s. Posterior summarization is the process by which a model is fit to the relevant underlying outcome, and the model is interpreted through a post hoc explortation via lower-dimensional functionals. The data are used only once, to fit the model in the first stage. This approach is applied to interpret predictive trends within nonparametric regression models, select important confounders and perform model specification sensitivity analyses in linear models for causal effect estimation, and detect the presence of heterogeneous treatment effects in observational studies. These methods are applied to several real and simulated datasets.
Bayesian forecasting of motor recovery following cortical infarcts
(2015-12) Woodie, Daniel Aaron; Walker, Stephen G., 1945-; Jones, Theresa
Globally, about 15 million people suffer a stroke each year. Of these affected, about 5 million die and another 6 million are left with long-term disability. The cause of this disability is often due to motor, or muscle, impairments that make everyday tasks like walking or opening a door difficult or even impossible. Improvements in motor function after an injury is due in large part to reorganization of spared neural tissue. To better understand the physiological changes relevant to recovery of motor function, experimental stroke models have been developed. Many studies have focused on neural reorganization as it relates to improvements in motor function following stroke but little has been done to explore the neurovascular remodeling as it relates to these alterations in motor function. To better understand the relationship between restoration of cortical blood flow and improvements in motor function, we first developed a mouse model of stroke that results in recoverable forelimb impairments and then construct statistical models to best link stroke severity and functional outcomes.
Bayesian hierarchical linear modeling of NFL quarterback rating
(2015-05) Hernandez, Steven V.; Walker, Stephen G., 1945-; Mahometa, Michael J
With endless amounts of statistics in American football, there are numerous ways to evaluate quarterback performance in the National Football League. Owners, general managers, and coaches are always looking for ways to improve quarterback play to increase overall team performance. In doing so, one may ask: Does the performance in the first quarter have any effect on the fourth quarter performance? This paper will investigate the linear dependence of the first quarter NFL QB rating on the fourth quarter NFL QB rating for 17 NFL starting quarterbacks from the 2014-2015 season. The aim is to use Bayesian hierarchical linear modeling to attain slope and intercept estimates for each quarterback in the study and attempt to determine what is causing the dependence, if any. Then, if a linear dependence is detected, investigating whether or not the statistic used is a viable measure of performance.
Bayesian hierarchical modelling of pavement performance
(2015-05) Serigos, Pedro Antonio, M.S. in Statistics; Müller, Peter, 1963 August 9-; Prozzi, Jorge A
A challenge currently faced by local, state and federal transportation agencies is the constantly increasing traffic demand, combined with a less increasing availability of funds for the maintenance of the highway infrastructure. A key factor for the success of a pavement management system is that it contains accurate and reliable pavement performance models. Inadequate prediction of the highway infrastructure future condition can lead to an inappropriately estimated budget or misallocation of funds. This study had the main objectives of quantifying the uncertainty of pavement performance model parameters and proposing a hierarchical model specification in order to account for heterogeneity across different subpopulations of pavements. The uncertainty of each pavement performance parameter was quantified by estimating their marginal posterior distribution using both a non-hierarchical and a hierarchical specification of the model. The posterior distribution of each model parameter was sampled using a combination of the Gibbs and Metropolis-Hastings techniques. The hierarchical model was specified in order to capture the different damaging effect that environmental factors and traffic characteristics have on pavements between the subpopulations with thinner and thicker hot-mix asphalt layer. The results from the study showed a significant dispersion of the pavement performance parameters. In addition, accounting for the heterogeneous effect between subpopulations resulted in a significant improvement of the fitting of the model as opposed to assuming complete pooling across pavement sections.
Bayesian hierarchical parametric survival analysis for NBA career longevity
(2012-05) Lakin, Richard Thomas; Scott, James (Statistician); Powers, Daniel
In evaluating a prospective NBA player, one might consider past performance in the player’s previous years of competition. In doing so, a general manager may ask the following questions: Do certain characteristics of a player’s past statistics play a role in how long a player will last in the NBA? In this study, we examine the data from players who entered in the NBA in a five-‐year period (1997-‐1998 through 2001-‐2002 season) by looking at their attributes from their collegiate career to see if they have any effect on their career longevity. We will look at basic statistics take for each of these players, such as field goal percentage, points per game, rebounds per game and assists per game. We aim to use Bayesian survival methods to model these event times, while exploiting the hierarchical nature of the data. We will look at two types of models and perform model diagnostics to determine which of the two we prefer.
Bayesian inference for random partitions
(2013-08) Sundar, Radhika; Müller, Peter, 1963 August 9-
I consider statistical inference for clustering, that is the arrangement of experimental units in homogeneous groups. In particular, I discuss clustering for multivariate binary outcomes. Binary data is not very informative, making it less meaningful to proceed with traditional (deterministic) clustering methods. Meaningful inference needs to account for and report the considerable uncertainty related with any reported cluster arrangement. I review and implement an approach that was proposed in the recent literature.
Bayesian mediation analysis for partially clustered designs
(2013-05) Chu, Yiyi; Beretvas, Susan Natasha
Partially clustered design is common in medicine, social sciences, intervention and psychological research. With some participants clustered and others not, the structure of partially clustering data is not parallel. Despite its common occurrence in practice, limited attention has been given regarding the evaluation of intervention effects in partially clustered data. Mediation analysis is used to identify the mechanism underlying the relationship between an independent variable and a dependent variable via a mediator variable. While most of the literature is focused on conventional frequentist mediation models, no research has studied a Bayesian mediation model in the context of a partially clustered design yet. Therefore, the primary objectives of this paper are to address conceptual considerations in estimating the mediation effects in the partially clustered randomized designs, and to examine the performances of the proposed model using both simulated data and real data from the Early Childhood Longitudinal Study, Kindergarten Class of 1998-99 (ECLS-K). A small-scale simulation study was also conducted and the results indicate that under large sample sizes, negligible relative parameter bias was found in the Bayesian estimates of the indirect effects and of covariance between the components of the indirect effect. Coverage rates for the 95% credible interval for these two estimates were found to be close to the nominal level. These results supported use of the proposed Bayesian model for partially clustered mediation in conditions when the sample size is moderately large.
Bayesian methods for complex data structures, with applications to precision medicine in women’s healthcare
(2020-05) Starling, Jennifer Elizabeth; Scott, James (Statistician); Murray, Jared S; Carvalho, Carlos M; Aiken, Abigail RA
This thesis explores novel Bayesian non-parametric regression techniques for data with complex structures, developed in response to challenges in women's health and obstetrics. Nearly all pregnancy-related research shares a key statistical issue: that most outcomes vary smoothly with gestational age. Models which reflect this smoothness aid in interpretability by aligning model choices with clinical knowledge; from a statistical perspective, smoothing can reduce variance without inflating bias. Existing models tend to smooth over all covariates, or require specification of parametric forms and interactions based on a priori knowledge of maternal and fetal covariates. Current literature does not provide an especially nuanced characterization of these functional forms. Chapter 1 frames these issues in the context of current statistical modeling practices in women's health and obstetrics. Chapter 2 introduces a model for estimating patient-specific stillbirth risk over the course of gestation, with the aim to help obstetricians prevent fetal mortality. In this chapter, we introduce BART with Targeted Smoothing (tsBART), a nonparametric regression model which extends the Bayesian Additive Regression Trees (BART) prior to introduce smoothness over a single target covariate t. TsBART extends BART by parameterizing each tree's terminal nodes with smooth functions of t, rather than independent scalars. Both BART and tsBART capture complex nonlinear relationships and interactions among the predictors, but tsBART guarantees that the response surface is smooth in the target covariate. This improves interpretability and helps regularize the estimate. After introducing and benchmarking the tsBART model, we apply it to pregnancy outcomes data from the National Center for Health Statistics. Our aim is to provide patient-specific estimates of stillbirth risk across gestational age (t), based on maternal and fetal risk factors (x). The results of our analysis show the clear superiority of the tsBART model for quantifying stillbirth risk, thereby providing patients and doctors with better information for managing the risk of fetal mortality. Chapter 3 extends these ideas into the causal inference setting to analyze a new clinical protocol for early medical abortion. We introduce Targeted Smooth Bayesian Causal Forests (tsBCF), a nonparametric Bayesian approach for estimating heterogeneous treatment effects which vary smoothly over a single covariate in the observational data setting. The tsBCF method also induces smoothness by parameterizing terminal tree nodes with smooth functions, and allows for separate regularization of treatment effects versus prognostic effect of control covariates. Smoothing parameters for prognostic and treatment effects can be chosen to reflect prior knowledge or tuned in a data-dependent way. Our aim is to assess the relative effectiveness of simultaneous versus interval administration of mifepristone and misoprostol over the first nine weeks of gestation. The model reflects our expectation that the relative effectiveness varies smoothly over gestation, but not necessarily over other covariates. We demonstrate the performance of the tsBCF method on benchmarking experiments. In Chapter 4, we aim to characterize the relationship between birth weight and maternal pre-eclampsia across gestation at a large maternity hospital in urban Uganda. Key scientific questions we investigate include: 1) how pre-eclampsia compares to other maternal-fetal covariates as a predictor of birth weight; and 2) whether the impact of pre-eclampsia on birthweight varies across gestation. We propose a nonparametric regression model called Projective Smooth BART (psBART), which addresses several key statistical challenges. First, our model correctly encodes the prior medical knowledge that birth weight should vary smoothly and monotonically with gestational age. It also avoids assumptions about functional forms and about how birth weight varies with other covariates. Finally, psBART accounts for the fact that a high proportion (83%) of birth weights in our dataset are rounded to the nearest 100 grams. Such extreme data coarsening is rare in maternity hospitals in high resource obstetrics settings but common for data sets collected in low and middle-income countries (LMICs); this introduces a substantial extra layer of uncertainty into the problem and is a major reason why we adopt a Bayesian approach. The results of our analysis show that pre-eclampsia is a dominant predictor of birth weight in this urban Ugandan setting and is therefore an important risk factor for perinatal mortality. Chapter 5 summarizes our contributions and describes directions for future research.