This corpus represents the efforts of the Chicago Text Lab to build a significant collection of contemporary American fiction spanning the period 1880-2000. The corpus contains nearly 9,000 novels which were selected based on the number of library holdings as recorded in WorldCat. They represent a diverse array of authors and genres, including both highly canonical and mass-market works. There are about 7,000 authors represented in the corpus, with peak holdings around 1900 and the 1980s.
In addition, we've recently added genre metadata for these novels, which we produced using text classification methods from the open source machine learning module SciKit-Learn. We trained a classifier to differentiate between 12 different genres, ranging from detective to historical fiction, using Library of Congress tags as training data. The classifier was able to assign genre tags to more than 4,000 untagged novels with a high degree of precision. You can read about this process on our blog.