| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

View
 

Planning Notes (Douglass, Mar 3, 2014)

Page history last edited by Alan Liu 10 years, 7 months ago

On "Editorial Policy"

 

My phrase "editorial policy" was a bit broad, so let me be more explicit about what I feel is relevant here.

 

Let us assume that we have a hard drive with collection of "original" (unedited by us)  whatevery1says sources, saved offline along with metadata on each in our spreadsheet -- this is a PDF, that is an HTML file, etc. We're already halfway there with our existing spreadsheet. Let's call the rules that we use to determine how we store our sources "archival policies" -- eg if we could save a PDF, rich HTML, plain HTML, or any combination of the three, which do we do?

 

Okay, we've done that. A copy of each source is sitting on the hard drive. However, our topic modeler doesn't accept hybrid HTML / PDF / whatever input. It accepts plain text files. So now we want to create a plain text document for each rich document in our collection.

 

This is where editorial policy comes in -- what will the topic modeler see as "the text?"

 

Will the plaintext document that we feed into the modeler contain...

 

The title? Author? Dateline? Pull quotes? Blog comment streams? Etc.

 

Some of these things (like datelines) we should expect to have minimal impact on our topics. Others (like comment streams on public pages) have big implications -- including them captures a lot of public discourse (literally what everyone says) but partially dissolves our ability to talk about authorial text when interpreting per-document results... excluding them, on the other hand, may be work -- editorial work, which might or might not be worth automating.

 

Let's say that we decide that all plaintext files should contain the title and author at the top, and include any magazine-style "pull quotes" in-line with the text. When we process a text file, it might (for example) contain the word "humanities" three times--once in the title, once in the body text, and once in a pull quote (a graphic repetition of the body text).

 

If we decide to include titles only in the file name and not include pull quotes (the are decorative, and not part of the essay proper) then the same file contains the word "humanities" only once. This becomes a bigger deal when many documents are inconsistent in what they make available to save -- some include titles, some don't, some include their comment steam in the PDF and others download only the body text. High per-document inconsistency limits our ability to make claims about what we have modeled and what our findings might mean.

 

So, another way of imagining editorial policy is: We made plaintext copies of all our sources as given -- now, what steps do we need to take in order to make them all consistent *enough* with each other to meet our needs and let us expect interpretable results from analysis?

 

1. Online resource

2. Offline saved version(s) -- archival policy

3. Unedited plaintext -- standardized transformation from source type

4. Edited plaintext for computational analysis -- editorial policies -- and, given our hybrid sources, what of this can be automated?

5. [magic happens]

6. Result data

7. Profit

Comments (0)

You don't have permission to comment on this page.