top of page

Overview

Inverse Probability Weighting (IPW) is a widely used method due to its ability to adjust sample weights and ease of implementation. However, more complex tools like generalized raking can offer greater precision. To address this, Dr. Gustavo Amorim developed a novel method that combines IPW with influence functions, aiming for a tool that offers the power of Generalized Raking with the simplicity of IPW. To evaluate this new method, a simulation was conducted to compare the effectiveness of all three models.

Methods

  • Inverse Probability Weighting(IPW)

    • A statistical method used in modeling to account for the underrepresentation of rare data points within a dataset.

  • Generalized raking  (GR)

    • This method, while serving a similar purpose to IPW, demands more intricate setup and specialized knowledge than IPW, posing a hurdle for non-statistical domain experts who might benefit from its application.

  • New proposed method 

    • This method combines IPW with inverse probability weighting to achieve more precise results than IPW alone, while remaining more accessible than GR.

How the Data was simulated

Strata Generation

  • The strata represents different sub-groups in the data, for example, the strata could be which city of nashville the observation is from.

  • You can choose the size and amount of strata to create your simulated population

​

Creating “ clean data”

  • In this step you choose how many variables you want and if they’re binary or continuous

  • The variables are generated from a normal and binomial distribution

  • This is the true value of an observation and is used as a base to create error prone data

​

Creating “error prone data”

  • This data is created to simulate if some of the continuous  data is error prone

  • First a copy of the continuous variable is created and then a value is randomly generated from a normal distribution and added to the error prone copy of the continuous variable

​

B1 generation

  • The b1 variables determine how much a variable will affect the outcome variable

  • These are created at the discretion of  the person running the simulation

​

Outcome Variable

  • Based off of the clean version of variables and their B1 values a probability of the binary outcome var was decided

  • The Outcome variable is either 0 or 1

​

Sampling 

  • Observations were chosen to go into the model training sub sample randomly based off of predetermined settings

  • The data we were trying to simulate had a 10% rate of having someone having a 1 in the outcome variable.  So for each strata being sampled 90% had 0 for the outcome and 10% had 1

  • With these restrictions  observations were then randomly chosen to be in the sub-sample

  • The combined amount of sample observations chosen from each strata is approximately 500

​

Weighting 

  • Two types of weights were calculated, known and estimated

  • Known weights were calculated based off the actual full data proportions for the outcome variable 

  • The estimated weights are calculated based off the likelihood of being in the sub-sample given the observation’s strata  

  • Both are IPW 

  • Influence Function Calculation

​

The influence function were calculated from the error prone data

  • These influence functions were then used to make IPW ( the method is the same as the estimated weights accept now influence function data is also used )

  • Models used for regression

    • A logistic regression model with all of the error prone data no weights

    • A Logistic regression model on the clean sub sample data with the design weights

    • A Logistic regression model on the clean sub sample data with the estimated weights

    • A Logistic regression model on the clean sub sample data with the weights with influence function

    • A generalized ranking model with the inf weights

***Note that all models with weights are using a style of Inverse Probability weighting

​

Simulation settings

  • There were 3 continuous and 1 binary variable

  • The population size 20,000

  • There was a High strata simulation and a low strata simulation

    • The High strata had 20 different strata  while the low strata had 8

    • Even though the stratification was different the population size was approximately the same

  • The simulation was ran 100 times

Results & discussion

***The number are the percent biased.  Lower percentages means the model was less bias and more precise in estimating the effect of a variable

The difference between generalized raking and target model bias  is minimal and the target model is significantly better than the other 3 models.   The results also show that for the target, generalized raking, and error prone models the amount of strata seems to affect the amount of bias where as the design and estimated weight models seem to have consistent bias across stratification types.  Lastly the difference in the bias seems to become more extreme for the binary variables

Conclusion

The final simulation results show that adding an influence function  into Inverse Probability weighting can significantly decrease the amount of bias when compared to using standard inverse probability weighting methods; and the amount of bias is relative to using Generalized raking despite influence function weighting being less complicated . 

© 2023 by naijport. All rights reserved.

bottom of page