Over the past few years, our research group has advanced a number of computational approaches to measure the scope and impact of open source software (OSS), including a method that evaluates the resource costs of source code development in online platforms (e.g., Robbins et al. 2018). The goal of this current project is to address how different software types may impact economic evaluations of OSS. During our 2021 Data Science for the Public Good Young Scholars Summer Program, our team has begun to develop a methodology to help researchers study different software types through the use of computational text analysis. Drawing on 10+ million repositories scraped from GitHub, the world’s largest code hosting platform, we detail an approach that classifies software into categories using the information provided on repositories such as README files and repository descriptions. The categories are based on Fleming’s (2021) proposed classifications of software price indices and another prominent code hosting platform named SourceForge. After detailing these category types, we discuss how we use dictionary-based and unsupervised computational text analysis to classify these GitHub repositories. More specifically, we plan to probabilistically match repositories to predefined categories using text-based similarity metrics. After detailing this methodology, we talk about some potential use cases that this approach may proffer and its potential impact on developing novel economic evaluations of OSS tools.

Teaser Video

Zoom Link


Project Website



Crystal Zang 

University of Pittsburgh Graduate School of Public Health 





Cierra Oliveira 

Clemson University 




Stephanie Zhang 

University of Virginia





Brandon Kramer

Postdoctoral Research Associate, Biocomplexity Institute, University of Virginia

Gizem Korkmaz

Research Associate Professor, Biocomplexity Institute, University of Virginia


Carol Robbins

Senior Analyst, National Center for Science and Engineering Statistics

Ledia Guci

Science Resources Analyst, National Center for Science and Engineering Statistics