We’re happy to provide sample datasets for use in research and teaching. These datasets include open access
content on JSTOR, and can be used for research, or as sample datasets for teaching and practicing text mining techniques.
Early Journal Content dataset
The Early Journal Content (EJC) on JSTOR includes public domain journal articles published in the United States before
1923 and articles published in other countries before 1870, and includes discourse and scholarship in the arts and
humanities, economics and politics, and in mathematics and other sciences. The EJC dataset includes full-text OCR
and article-level metadata.
Download EJC Dataset (12.2 GB) - Last updated December 2017
Open Access Ebooks dataset
We have partnered with leading presses on a project to add open access ebooks to JSTOR. Thousands of titles
are now available from publishers such as University of California Press, Cornell University Press, NYU Press,
and University of Michigan Press; most books in this group were published between the years 2000 and 2017.
The open access ebooks dataset includes full-text OCR and title-level metadata.
Download OA Books Dataset (1.5 GB) - Last updated December 2017