The Canonical Entity (of the past) — the card catalog
MARC was not created for online catalogs; it was created for card catalogs. It’s a fundamental flaw. Readable reformatting of MARC isn’t much better
The classic bib record
- Â a collection of statements
- taken from the piece itself
- sometimes enhanced w inferred parentheticals
- or additional statements not on the piece (eg subject headings)
- where punctuation, which may or may not be present is used (inconsistently) for structure
- We’re dealing with uncontrolled text strings that are only loosely connected to anything else
Actually a number of problems
- identification problems (titles aren’t enough; names aren’t enough)
- linkage problems (web problem; language problem)
- quality problems (legacy problems — strings are not controlled terms; often, they cannot be turned into them
problems
- Hamlet problem — no clues; too many results; interface isn’t helping
- People searches like John Rock
First define ALL THE THINGS
- Work
- Entity: a thing with distinct and independent existence (someone who created a work)
- Relationship: the way in which two or more people or things are connected
Record: War and Peace
- Entity: type: work/War and Peace/Author
- Entity: type: person, Leo Tolstoy
- Entity: type: place: where story took place
Entities of Initial Focus
- person
- place
- object
- concept
- organization
- work
Relationships bw entities are established
- person-author
- work-subject-concept
Critical: Using authoritative sources whenever possible (Virtual International Authority File)
And linking to other authoritative data sources
This process is called “Shredding”; try not to use record anymore. instead, using set of assertions about a thing. Place; Person; Work; Organization. Every instance of William Shakespeare.
From Records to Entities: Works
- All manifestations of Hamlet under one umbrella
- Making sausage
- Some of the info from all the manifestations flow up to the work record
- In a Worldcat record, look for the Linked Data section
- Work example — all the manifestation of the work
Worldcat: Linked Data & Entities +Â LCSH; VIAF; FAST
Knowledge Vault data flow:
- Data sources: Enhanced WorldCat; VIAF; FAST; Etc.
- Extracts
- Collective: Knowledge Triples
- Fusion
- ???
So What?
- Exposure — as linked data has been released, better traffic
- Improving Discovery — screen of an author, description from wikipedia; know his works and those about him;
- By making links & making them explicit & unique, connections like these become easy, not difficult
- Entity cards: person, org, place, concept, creativework, event + Related pple & orgs; related concepts; related places; related works
- We’re actually using the work we’ve been doing for the last 50 years, using the machine data, in new ways, in ways that aren’t ambiguous, not text streams, but unique identifiers
- Linked Data sets: Entity JSÂ — War Between the States collection example: Person; Organization; Concept; Place; Event; Work — also found a way to rank these sets of data
- Can make links dynamically & in batch
Why This Work Matters
- Fiction Finder — Solving the Hamlet Problem
- Embedding Authority Control — wikipedia connections — VIAF authority files embedded into Wikipedia articles — Solving the Wang/Li/Zhang Problem (the three most common names in China — name isn’t enough)
- Helps with translation work — solving the language problem
- MARC Usage in Worldcat – cleaning up and normalizing the data — variation of cataloging over time; mistakes; differences in interpretation of the rules, all over 50 years. — exposing the quality problem. Quality is a pursuit, not a destination
It’s not the record, it’s the linkage
- requires a new kind of thinking: “sets of assertions about something” NOT a “record”
- Requires much more than simply translating a record from MARC to a new format, not 1:1 translation
- There are things we can do now to make MARC “linked data ready” –Â Upcoming OCLC webinar
- Quality is a pursuit, not an endgame
For more a lot more information: Library Linked Data in the Cloud
National Library of Medicine doing a lot of work with Linked Data, too.
This is way on the cutting edge, don’t have to necessarily go out and do anything. Sit back & wait. Schema.org standard instead of BibFrame schema that LOC is developing. But support LOC BibFrame standard under development, as well.
What about RDA? How does it fit into all of this? RDA adds incremental benefit but what that is, he can’t characterize. Not sure how to qualify or quantify.
Large data projects have used crowdsourcing to help — has OCLC considered this? EntityJS & Knowledge Vault work will include a way to do this.
New world: Data flow, not static.