| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Document Collection Strategy Planning Document

Page history last edited by Alan Liu 10 years, 2 months ago

The WhatEvery1Says corpus was started by manually collecting examples of public discourse about the humanities in newspapers, magazines, blogs, white papers, etc.  The selection was suggested by correspondents, Twitter, etc.  By itself, such manually collected materials can be valuable for study and data-mining to detect patterns of discourse about the humanities.

 

However, as suggested by Harold Marcuse at 4Humanities@UCSB's meeting of Nov. 21, 2013, a more authoritative corpus would in principle be collected through a broader selection strategy that  depends less on human-filtered criteria of relevance:

 

Harold Marcuse (email of Nov. 24, 2013):

 

I poked around a bit and found two things that appear to me promising, specific questions that beg explanation, and which might get at the kind of developments and connections we are looking for.

A simple ngram of "humanities" (I attach one case-insensitive) has (for me) surprising results: since 1994 (peak of culture wars?) the use of the term has been falling off. The corpus ends in 2008, and that final year may not be reliable, but the trend is quite clear. Capitalized Humanities gradually catches up an finally surpasses lower case humanities in 1976 on its way to that 1994 peak. BTW this is identical to the US corpus, which is very different than the British one, which rises very slowly but continuously over the entire period (tweak the smoothing if you want to see that).

But if the clear and steady drop-off since 1994 is surprising, what begs explanation is the use of the term in the Wall St. Jnl (corpus 1984-present). As the second attached image shows (see bar graph in lower right of it), in 2010 the use of the term SKYROCKETED--exactly opposite the ngram trend (which of course ends in 2008). The default Proquest search results are by relevance, and just looking at the titles and content excerpt (I presume these to be the main factors in the ProQuest relevance algorithm) of the first search results, these are ALL EXACTLY the kind of texts we are looking for--a representative but also complete sample of a certain discourse since 1984, sortable by date, relevance (and after downloading maybe length), with lots of possibilities for showing topic clusters moving over time, etc. And the 3900 results start with the dates 1999, 2012, 1985, 1985, 5x2013, 1994, 2012, 1987, 1989, 2002, 2010--an intriguing spread.

What I'm saying is that I think it would be better to focus energy on figuring out how to scrape this database (and/or other newspapers) into a processable database, since these are methodologically "clean" and homogeneous corpuses: easier to work with, and with the promise of yielding useful results.

Once that is done, working on a heterogeneous corpus like the WhatEvery1Says might be easier, but presumably so would applying the process to any of the other ProQuest/newspaper databases (NYT, Washington Post, Chr. Sci Monitor, London Times, LA Times, ...).

 

 

Google Ngram Viewer - "Humanities"  Wall Street Journal - "Humanities" 

 

 

 

 

Toward a broader algorithmic or semi-algorithmic document collection strategy:

 


  • (A) Documents from defined archives:

 

    • (1) In addition to The Wall Street Journal, what other archives of newspapers, magazines, books, etc. from the U. S. and other nations do we have comprehensive or systematic full-text digital access to? [Task: identify possible archives]

 

    • (2) Some resources (such as Proquest) have search engines with relevance algorithms that should allow us to identify documents for the WhatEvery1Says corpus.  But what are the best terms and concatenations of terms to search on?  For example, not every document that is about the humanities may discuss "the humanities" by that name.  Some will discuss the "liberal arts" or "liberal studies," and some will discuss "English majors," "history majors," etc. as metonyms for the broader issue of the humanities.  We should to some testing to see what combination of terms provides the clearest "signal" of documents or sections of documents that focus on the humanities. [Task: testing search-term strategies]

 

    • (3) Is there an algorithmic way to set a threshold for inclusion of a document in WhatEvery1Says? For example, we probably don't want to include a document if it mentions the word "humanities" once in the course of discussing something else entirely or a long compendium of other issues.

 

 

    • (5) What is the best temporal / longitudinal strategy for document collection?  One possibility, for example, would be to collect documents from ten-year slices going back as far as the archives allow.

 

    • (6) What is the best way to scrape text from documents identified for collection (this issue overlaps with text preparation planning)

 

 


  • (B) Documents from undefined archives:

 

    • (1) What is the best way of using general-purpose search engines such as Google to identify documents for collection in WhatEvery1Says?

 

    • (2) [Variations of questions A.2-5 above apply to B.]

 

 


  • (C) Use of Google Books Ngram Viewer:

 

    • (1) What is the best way to use Google Books Ngram Viewer to complement the collection and analysis of documents in WhatEvery1Says?

 

    • (2) Ngram Viewer provides the beginning of a pipeline from Ngram frequency results to books for "interesting" years in those results.  (From "About Ngram Viewer": "Below the graph, we show 'interesting' year ranges for your query terms. Clicking on those will submit your query directly to Google Books").  What is the feasibility of building a script that compares the books identified by Ngram viewer as "interesting" with relevance to the humanities to books that we have full-text digital access to, so that these books or the relevant chapters can be collected for WhatEvery1Says?

 


Text Preparation and Topic Modeling
(go to planning page)

 

 

 

Comments (0)

You don't have permission to comment on this page.