Overview
Over the past few years, our research group has advanced a number of computational approaches to measure the scope and impact of open source software (OSS), including a method that evaluates the resource costs of source code development in online platforms (e.g., Robbins et al. 2018). The goal of this current project is to address how different software types may impact economic evaluations of OSS. During our 2021 Data Science for the Public Good Young Scholars Summer Program, our team has begun to develop a methodology to help researchers study different software types through the use of computational text analysis. Drawing on 10+ million repositories scraped from GitHub, the world’s largest code hosting platform, we detail an approach that classifies software into categories using the information provided on repositories such as README files and repository descriptions. The categories are based on Fleming’s (2021) proposed classifications of software price indices and another prominent code hosting platform named SourceForge. After detailing these category types, we discuss how we use dictionary-based and unsupervised computational text analysis to classify these GitHub repositories. More specifically, we plan to probabilistically match repositories to predefined categories using text-based similarity metrics. After detailing this methodology, we talk about some potential use cases that this approach may proffer and its potential impact on developing novel economic evaluations of OSS tools.
Teaser Video
Zoom Link
Project Website
Fellow
Crystal Zang
University of Pittsburgh Graduate School of Public Health
Interns
Cierra Oliveira
Clemson University
Stephanie Zhang
University of Virginia
Mentors
Brandon Kramer
Postdoctoral Research Associate, Biocomplexity Institute, University of Virginia
Gizem Korkmaz
Research Associate Professor, Biocomplexity Institute, University of Virginia
Stakeholders
Carol Robbins
Senior Analyst, National Center for Science and Engineering Statistics
Ledia Guci
Science Resources Analyst, National Center for Science and Engineering Statistics