Data for Research
A Text Mining service from JSTOR
DfR HomeDfR Help

Sample Datasets

We’re happy to provide sample datasets for use in research and teaching. These datasets include open access content on JSTOR, and can be used for research, or as sample datasets for teaching and practicing text mining techniques.

Early Journal Content dataset

The Early Journal Content (EJC) on JSTOR includes public domain journal articles published in the United States before 1923 and articles published in other countries before 1870, and includes discourse and scholarship in the arts and humanities, economics and politics, and in mathematics and other sciences. The EJC dataset includes full-text OCR and article-level metadata.

Download EJC Dataset (12.2 GB) - Last updated December 2017

Open Access Ebooks dataset

We have partnered with leading presses on a project to add open access ebooks to JSTOR. Thousands of titles are now available from publishers such as University of California Press, Cornell University Press, NYU Press, and University of Michigan Press; most books in this group were published between the years 2000 and 2017. The open access ebooks dataset includes full-text OCR and title-level metadata.

Download OA Books Dataset (1.5 GB) - Last updated December 2017