| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Statement of Corpus Expansion Plans

Page history last edited by Alan Liu 6 years, 5 months ago

[Excerpted from WE1S's 2017 grant proposal]

 

1. Corpus Expansion

 

WE1S will expand the range and representativeness of its primary corpus of contemporary journalistic publications (defined as newspapers, magazines, and radio/TV transcripts of news or talk shows available in English across multiple nations). Through institutional subscriptions to commercial databases--e.g., LexisNexis Academic, ProQuest (including News and Newspapers, ProQuest Historical Newspapers, ProQuest's ethnic, race, and gender news databases)--WE1S's researchers have access to over 2,500 English-language newspapers from which full-text digital articles of the past few decades can be collected (through a combination of manual and automated means conforming to source licensing terms) and converted for text analysis operations into non-consumptive-use datasets.

WE1S plans to devote research at the beginning of its timeline to determine which specific sources to target in these areas that will be most representative and useful for the project's goals. While the criteria for representativeness and usefulness will evolve iteratively as the project team begins its research on potential sources (see under Project Year 1 in section II.e, Activities and Timeline), WE1S has initially identified two key areas for corpus expansion:

 

  • The first is the geographical and national scope of its corpus of materials: WE1S will investigate expanding the range of its sources by including materials from Anglophone newspapers located outside North America. Such newspapers include The Times, The Sunday Times, and The Independent in the United Kingdom; The Australian and The Daily Telegraph in Australia; The New Zealand Herald in New Zealand; and The Times of India in India. Initial criteria for inclusion include a publication's value for representing a part of the world not previously included; its national or regional circulation; and the technical feasibility of collecting and processing its articles. WE1S will also draw on current research on media impact to help it develop a strategic rationale for selection of materials (e.g., the approaches to defining and measuring the impact of journalistic media surveyed by Schiffrin and Zuckerman).
  • The second area for corpus expansion concerns what may be called the social scope of WE1S's materials. An especially high-priority goal is to include sources that can allow WE1S to ask research questions about how the humanities are viewed by, or in relation to, different social groups (racial, ethnic, gender, immigrant, and age). This is a diversity aim that is organic to WE1S's core research. Because both historical and contemporary anecdotal evidence suggests that particular groups channel themselves (or are channeled) into career choices that make the humanities a lesser priority during first-to-college or first-generation-immigrant stages in their social trajectory, WE1S hypothesizes that researching "what everyone says about the humanities" in particular groups can add meaningfully to society's more common talking points about numbers of humanities majors, career goals, or the relation of the humanities to the sciences or business. To facilitate such research, WE1S will include in its primary corpus journalistic materials provided by databases such as Ethnic Newswatch, Proquest Black Newspapers, and Proquest U.S. Hispanic Newsstand. These are the sources for this purpose that WE1S has so far identified from canvassing the databases available to its researchers through institutional subscription and also from initial consultation with scholars and university administrators working in race and ethnic studies. WE1S will seek further resources. If feasible, WE1S will also attempt a small-scale experiment in topic-modeling a limited sample of articles from Spanish-language newspapers, though existing topic modeling and other text analysis methods are not capable of integrating multilingual materials in the same model. Criteria for inclusion of materials in WE1S's research corpus will be a source's value for representing part of the "social scope" of the humanities not previously included, the publication’s circulation and intended audience, and the technical feasibility of collecting its articles.

In addition to expanding its primary corpus of materials as outlined above, WE1S plans to extend the range of research questions it can pose by collecting smaller "sub-corpora" of other kinds of sources that can be folded into, or separated from, its main corpus as needed for computational analysis. Particular sub-corpora will be chosen after detailed research at the beginning of the project timeline. Steps in such research will involve consulting scholars and university administrators as well as WE1S's advisory board; reading and discussion of sample materials from potential sub-corpora; assessment of technical feasibility (i.e., can a source be used in a way that fits practically into the project's technical workflow); and assessment of strategic value (e.g., does a sub-corpus add meaningfully to the representativeness of the project's materials or provide needed perspective on questions that emerge in analysis of previously gathered materials). Sub-corpora are likely to include some of the following:

  • Historical newspaper coverage of the humanities from earlier in the 20th century (gathered through ProQuest Historical Newspapers; the Library of Congress's Chronicling America resource; and, in some cases, through the archives and API's of individual newspapers);
  • Government and political documents (gathered through resources such as Congress.gov, Whitehouse.gov, U. S. Government Publishing Office, and the archives of individual states, with data gathering assisted by API's from the Sunlight Foundation)[1];
  • Reports and publications by scholarly and professional associations as well as grant agencies and foundations[2];
  • Public documents of higher-education institutions that mention the humanities (e.g., so called university "viewbooks"; mission statements of humanities centers; and speeches by campus presidents and deans);
  • Scholarly research articles discussing the humanities (collected from JSTOR). A particularly rich avenue of research will be to use the recently introduced JSTOR Labs Text Analyzer service to discover research articles relevant to sample materials from the WE1S corpus. (Because the Text Analyzer builds on JSTOR Lab's own usage of topic modeling, there may also be ways that WE1S can use Text Analyzer to corroborate or extend WE1S analyses of topic models.)

 


[1] For WE1S's preliminary scoping study of U.S. Congress, White House, and selected state documents related to the humanities, see 4Humanities.org, "What U.S. Politicians Say About the Humanities--A Data Set and Analysis."

[2] An example is the 2013 report titled The Heart of the Matter from the American Academy of Arts & Sciences' Commission on the Humanities and Social Sciences. For WE1S's topic-model study of this document, see 4Humanites.org, “The Heart of the Matter Topic-Modeled (A Preliminary Experiment)."

Comments (0)

You don't have permission to comment on this page.