US Novel Corpus

This corpus represents the efforts of the Chicago Text Lab to build a significant collection of contemporary American fiction spanning the period 1880-1924. The public corpus contains over 1,200 novels which were selected based on the number of library holdings as recorded in WorldCat. They represent a diverse array of authors and genres, including both highly canonical and mass-market works. There are about 500 authors represented in the corpus.

In addition, we've recently added genre metadata for these novels, which we produced using text classification methods from the open source machine learning module SciKit-Learn. We trained a classifier to differentiate between 12 different genres, ranging from detective to historical fiction, using Library of Congress tags as training data. The classifier was able to assign genre tags to many untagged novels with a high degree of precision. You can read about this process on our blog.

Search the public collection (1,245 texts)