Classifying and Measuring Open Source Software Projects on GitHub

Overview

Over the past few years, our research group has advanced a number of computational approaches to measure the scope and impact of open source software (OSS), including a method that evaluates the resource costs of source code development in online platforms (e.g., Robbins et al. 2018). The goal of this current project is to address how different software types may impact economic evaluations of OSS. During our 2021 Data Science for the Public Good Young Scholars Summer Program, our team has begun to develop a methodology to help researchers study different software types through the use of computational text analysis. Drawing on 10+ million repositories scraped from GitHub, the world’s largest code hosting platform, we detail an approach that classifies software into categories using the information provided on repositories such as README files and repository descriptions. The categories are based on Fleming’s (2021) proposed classifications of software price indices and another prominent code hosting platform named SourceForge. After detailing these category types, we discuss how we use dictionary-based and unsupervised computational text analysis to classify these GitHub repositories. More specifically, we plan to probabilistically match repositories to predefined categories using text-based similarity metrics. After detailing this methodology, we talk about some potential use cases that this approach may proffer and its potential impact on developing novel economic evaluations of OSS tools.