Best Practices for Capturing and Describing Web Content

From AUCWiki
Jump to: navigation, search

Contents

Scope

Best Practices for Capturing and Describing Web Content outlines digital capture specifications and the application of qualified Dublin Core metadata elements to describe digital content for inclusion in the the AUC Web Archive. [1]

The AUC Web Archive was established March 2009. Collections include the American University in Cairo, Coptic Religion and Culture, Egyptian Arts, Culture, and Society, Egyptian Business and Economy, Egyptian and Middle Eastern Architecture, January 25th Revolution, and Migration and Refugee Studies. The Web archive' collection policies and content are managed by the Rare Books and Special Collections and technical support is supplied by Archive-It, a subscription service available from the Internet Archive. The Rare Books and Special Collection Library's current subscription level allows us to crawl 13,000,000 documents not exceeding 1,024 GB of storage. [2] [3]

Settings

Frequency

Individual URLs may be crawled with varying degrees of frequency.

  1. One Time
  2. Twice Daily
  3. Daily
  4. Weekly
  5. Monthly
  6. Bi-monthly
  7. Quarterly
  8. Semiannually
  9. Annually

Selection of crawl frequency should be made on a case-by-case basis. Consider how frequently a site is updated. Also, consider the Seed Type. Maybe you only need to crawl one page. For example, if you are crawling a Twitter feed of a frequent poster, crawl one page only once or twice a day to avoid exceeding our data and document budgets.

Activation Status

Public Site

Seed Type

  • Default [4]
  • News/RSS Feed [5]
  • Crawl One Page Only [6]

Group

Do not assign URLs to groups. Instead, use the descriptive metadata to categorize content.

Crawl Scope

Host Constraints

Use host constraints to block specific domains, ignore robots, and limit the the number of documents from a specific domain to be archived per crawl.

Option Notes
Block All Use this option to block capture of all documents from a particular host. For example, I block aucegypt.edu in all Web Archive collections except the American University in Cairo Web sites collection. We don't need to collect this data twice. Also, by looking at crawl reports, you may notice a site that is being captures, that we don't need. For example, www.toyotaegypt.com.eg documents were being captured in the January 25th Revolution Web sites collection. By blocking this host, we no longer collection extraneous data.
Block URLs if Use this option to block specific URLs within a site.
Doc Limit Use this to limit the the number of documents from a specific domain to be archived per crawl. For example, we simply cannot archive all of YouTube. Limit the number of YouTube documents captures per crawl to make sure we don't exceed our subscription. Don't make your Doc Limit too small. Most Web pages have formatting files, e.g. CSS, that must be captured in order for the page to display properly in the Web archive.
Ignore Robots.txt The robots exclusion standard, is a way for a webmaster to direct a web crawler (aka robot or spider) not to crawl all or specified parts of their website. The webmaster places their request in the form of a specific text file that is easily found on their website. If you want to crawl a block page or site, e.g. Twitter, Facebook, YouTube, be sure to modify a host constraint.

Metadata

Description

The descriptive metadata scheme used for describing cultural heritage materials is derived from the fifteen terms of the Dublin Core Metadata Element Set, Version 1.1 and their refining properties defined in DCMI Metadata Terms. Best Practices for CONTENTdm and other OAI-PMH compliant Repositories: Creating Shareable Metadata, authored by the CONTENTdm Metadata Working Group was also consulted. [7] [8] [9]

Web Page or Site

Use for seed URL and document metadata.

Label Notes Required
Title
  • Capitalize the first word of the title and enter in lowercase all other words except proper nouns.
  • Do not include initial articles, unless an integral part of the title and/or provided by the creator as the true title. This allows the system to index by title alphabetically. Optionally, you may add the title containing the article to the Alternative Title field.
  • Use the Grab Title tool to pull Web site/page titles directly from the resource. Edit the titles as necessary.
Creator
  • Enter the name(s) of entities responsible for the creation of the item.
  • Look up names in the Library of Congress Name Authority File or the Getty Union List of Artist Names. [10] [11]
  • If an authorized name is not found in the Library of Congress Name Authority File or the Getty Union List of Artist Names, create one based on AACR2 rules. [10] [11]
Subject
  • Library of Congress Subject Headings should be used to record specific topical information about the image and its context to the collection. [10]
  • Include terms for places such as in this field if the object being described relates directly to the location.
  • Do not include names. Record names in the Names field.
  • Exclude geographic headings and subdivisions if that information can be appropriately included in the Location (coverage.spatial) field. Exceptions exist for items such as maps and newspapers, in cases where the main topic of the item is a geographic place.
Description
  • Enter supplemental descriptive information such as a free text summary.
  • End field with a period.
  • Avoid simply restating the title.
Publisher
  • Use to record the publisher
  • Look up names in the Library of Congress Name Authority File. [10]
  • If an authorized name is not found in the Library of Congress Name Authority File, create one based on AACR2 rules. [10]
  • List multiple entries on separate lines in alphabetical order.
Date
  • Format the date according ISO 8601 standard:
    • YYYY-MM-DD
    • YYYY-MM
    • YYYY
    • YYYY-YYYY
  • Enter the year that you began capturing a specific Web site. If there is a date captures for a specific URL ends, enter inclusive dates, e.g 2011-2012.
Type
  • Enter the characteristic and general type of content of the resource.
  • Look up terms using the Dublin Core Metadata Initiative Type Vocabulary. [12]
Format
  • Enter Internet Media Type. [13]
Source
  • Enter the domain of the website.
Relation
  • Enter the title of the digital collection to which the resource belongs.
Coverage
  • Enter each place as its own term, in order from specific to general.
  • Look up terms in the Getty Thesaurus of Geographic Names. [14]
  • Do not include “World” unless it specifically is world-related, like world maps.
  • Only include names of geographic places like cities, counties, countries, etc., and omit named places such as Tahrir Square or Matḥaf al-Miṣrī. Record named places in the Topic element instead.
  • Use the broadest term that applies for the page/site. For example, if the contents of the page/site refer to Egypt and the Middle East, enter Middle East.
Collector
  • Enter the name of the collector.
Language
  • Enter the ISO 639-1 Language Code for items with linguistic content. [10]
  • Leave this field blank if there is no linguistic content.
  • List multiple entries on separate lines in alphabetical order.

Collection

For web archive collection level metadata.

Label Notes Required
Title
  • Capitalize the first word of the title and enter in lowercase all other words except proper nouns.
  • Do not include initial articles, unless an integral part of the title and/or provided by the creator as the true title. This allows the system to index by title alphabetically. Optionally, you may add the title containing the article to the Alternative Title field.
  • Use the Grab Title tool to pull Web site/page titles directly from the resource. Edit the titles as necessary.
Creator
  • Enter the name(s) of entities responsible for the creation of the item.
  • Look up names in the Library of Congress Name Authority File or the Getty Union List of Artist Names. [10] [11]
  • If an authorized name is not found in the Library of Congress Name Authority File or the Getty Union List of Artist Names, create one based on AACR2 rules. [10] [11]
Subject
  • Library of Congress Subject Headings should be used to record specific topical information about the image and its context to the collection. Additionally, when describing items that are graphic in nature, such as photographs, use Subjects. [10]
  • Include terms for places such as in this field if the object being described relates directly to the location.
  • Do not include names. Record names in the Names field.
  • Exclude geographic headings and subdivisions if that information can be appropriately included in the Location (coverage.spatial) field. Exceptions exist for items such as maps and newspapers, in cases where the main topic of the item is a geographic place.
Description
  • Enter supplemental descriptive information such as a free text summary.
  • End field with a period.
  • Avoid simply restating the title.
Date
  • Format the date according ISO 8601 standard:
    • YYYY-MM-DD
    • YYYY-MM
    • YYYY
    • YYYY-YYYY
  • Enter the year that you began capturing content in the collection. If there is a date captures for a specific URL ends, enter inclusive dates, e.g 2011-2012.
Type
  • Enter the characteristic and general type of content of the resource.
  • Look up terms using the Dublin Core Metadata Initiative Type Vocabulary. [12]
Format
  • Enter Internet Media Type. [13]
Relation
  • Enter the title of the digital collection to which the resource belongs.
Coverage
  • Enter each place as its own term, in order from specific to general.
  • Look up terms in the Getty Thesaurus of Geographic Names. [14]
  • Do not include “World” unless it specifically is world-related, like world maps.
  • Only include names of geographic places like cities, counties, countries, etc., and omit named places such as Tahrir Square or Matḥaf al-Miṣrī. Record named places in the Topic element instead.
  • Use the broadest term that applies for the page/site. For example, if the contents of the page/site refer to Egypt and the Middle East, enter Middle East.
Collector
  • Enter the name of the collector.

Workflow

Name/Label Scope Note Required
Date Entered
  • Enter the date, in YYYY-MM-DD format, that the metadata record is created.
Date Reviewed-Curator
  • Enter the date, in YYYY-MM-DD format, that the metadata record is reviewed by the RBSCL University Archivist.
Date Reviewed-Archivist
  • Enter the date, in YYYY-MM-DD format, that the metadata record is reviewed by the RBSCL Digital Collections Archivist.
Date Added
  • Enter the date, in YYYY-MM-DD format, that the item is added to digital collection.
Date Deactivated
  • Enter the date, in YYYY-MM-DD format, that the capture is stopped for the item.

References

  1. AUC Web Archive
  2. Archive-It
  3. Internet Archive
  4. For Web sites, pages, blogs, etc.
  5. Use this for Google News feeds to capture daily news about a topic without inputting individual seed URLs for each site.
  6. Use this if you only want to capture the URL itself. This option is may be useful for Twitter feeds that are captured daily; however, if the Twitter feed includes links to a lot of photos or videos, you may want to use the Default setting.
  7. Dublin Core Metadata Element Set, Version 1.1
  8. DCMI Metadata Terms
  9. Best Practices for CONTENTdm and other OAI-PMH compliant Repositories: Creating Shareable Metadata
  10. 10.0 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 Library of Congress Authorities
  11. 11.0 11.1 11.2 11.3 Getty Union List of Artist Names
  12. 12.0 12.1 Dublin Core Metadata Initiative Type Vocabulary
  13. 13.0 13.1 Internet Media Types
  14. 14.0 14.1 Getty Thesaurus of Geographic Names
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox