R&D Text Corpora Filtering and Data Mining

Overview

We use administrative data for federal grants to discover research topics and their trends in the area of artificial intelligence (AI). Our data source is Federal RePORTER, a database of federally funded research grants that includes project abstracts and other project data such as funding agencies and start years. We filter Federal RePORTER project abstracts for those that describe projects about AI. AI is a complex and hard to define theme, so this filtering problem is challenging. We utilized three different filtering methods: 1) an AI term matching method proposed by the Organization for Economic Co-operation and Development (OECD), 2) a method by Eads et al., which utilizes term matching and topic modeling, and 3) a Sentence BERT (bidirectional encoder representations from transformers) method that compares the similarity between the AI Wikipedia page and each grant abstract. Each filtering method produces an AI themed corpus on which we run a non-negative matrix factorization (NMF) topic model. Using linear regression and visualization, we analyze the topic model results to discover AI research trends in projects that were federally funded.