Machine Learning Analysis of Women’s Representation in Science, Technology, Innovation, and Policy (STIP)
Introduction
This article delves into a comprehensive study that utilizes machine learning to examine the representation of women in STIP across 60 countries. By analyzing a carefully curated small dataset, which includes five numerical features and one categorical feature, the research highlights patterns in this critical domain.
Methodology Overview
The research implemented a supervised regression model where the dependent variable was the Percentage of Women in STIP (PWS). To tackle missing data—1.70% for STEM degrees and 33.30% for business degrees—KNN imputation was employed.
Feature Engineering
A significant aspect of the study was feature engineering, which involved three autoencoder variants: basic, variational, and denoising. This process expanded the dataset from its original dimensions to 27 columns. A rigorous feature selection approach combined methods such as Random Forest Feature Importance, LASSO regression, and Sequential Feature Selection to pinpoint influential predictors.
Dimensionality Reduction Techniques
The research employed dimensionality reduction techniques, including correlation analysis and Principal Component Analysis (PCA) with 95% variance retention, effectively minimizing data noise while retaining essential information.
Model Evaluation and Sensitivity Analysis
The experimental design consisted of evaluating numerous regression models, including Ridge Regression, SVR, Linear Regression, ElasticNet, and Lasso. Each model underwent 10-fold cross-validation to ensure robust evaluation and hyperparameter optimization via GridSearchCV.
Performance Metrics
Multiple metrics such as R², mean cross-validation standard deviation, Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE) were utilized to assess model efficiency. The study employed statistical analyses to explore relationships between diversity quotas and women’s representation in STEM and policymaking, bolstered by outlier analysis for additional robustness.
Data Collection and Sources
The dataset encompassed metrics pertinent to women’s participation in various sectors, focusing on Women in STEM Percentage (WSP), Women in Policymaking (WP), and Diversity and Inclusion Quota Systems (DIQS). Data was meticulously gathered from reliable organizations such as UNESCO, The World Bank, UN Women, and OECD, spanning from the 1980s to 2024.
Machine Learning’s Role in Social Sciences
The application of machine learning is reshaping social sciences, aiding in recognizing complex patterns and facilitating informed policy-making. This study underscores the synergy between human expertise and machine learning capabilities, particularly in understanding gender disparities in STIP.
The Role of Institutional Theory
By focusing on institutional mechanisms rather than individual behaviors, this research draws on institutional theory to scrutinize how gender disparities are sustained within formal and informal institutions, even in the face of equality policies.
Research Questions Addressed
The research aimed to answer two pivotal questions: firstly, it evaluated the predictive accuracy of machine learning models on the percentage of women in STIP while accounting for domestic data gaps. Secondly, it investigated the impact of Diversity and Inclusion Quota Systems on boosting female representation in STIP sectors.
Concluding Insights
This study adopts a structured methodology leveraging machine learning to untangle the complexities surrounding gender representation in STIP. By meticulously designing a comprehensive framework for data analysis and model validation, the work contributes to understanding how diversity initiatives can effectively enhance women’s participation in these fields.
