Identifying, selecting, and organizing the attributes of Web resources

Access full-text files




Pasch, Grete

Journal Title

Journal ISSN

Volume Title



The basic human approach for referring to the real world is to represent the observed objects by their attributes, be it in natural language or in formal data models. Library cataloging is no different in using attributes to represent information resources, but its approach to data modeling is implicit and does not provide methodologies for attribute analysis. This is a critical problem when representing web resources since they differ significantly from the kinds of resources typically handled by library catalogs. The purpose of this dissertation is systematically to identify, select, and organize the attributes of web resources by means of an alternative to the traditional, library-based bibliographic model. Here, an alternative methodology is explored that combines data modeling principles from information systems theory, concepts from bibliographic modeling, and Gerard Genette's paratextual theory. The proposed methodology is applied to a working collection of 300 web resources listed by the LANIC (Latin American Network Information Center) center of the Institute of Latin American Studies, University of Texas at Austin. A semi-automatic process is used to extract attributes from the HTML code's HEAD and BODY sections, the information provided by the browser, the data about each locally stored file, and the LANIC directory pages. Attributes are also manually marked up and extracted from each pageview. As a result, a total of 290 attributes were identified and selected. The attributes were then organized according to two methods. First, a direct mapping into Dublin Core (DC) highlights the shortcomings of the traditional approach: two thirds of the attributes found do not match any DC elements, and questions about the structure and meaning of the DC elements are underscored. Second, the matching of each attribute to its parent entity resulted in a model with 35 entities grouped into four categories: agents, binders, components, and original documents. These entities highlight the origin of each attribute, help model the life cycle of the information entities, and offer an alternative source for attribute values. The 37 unmatched attributes (expressive, navigational, and directive attributes) hint at the possible application of a social relativist approach for modeling them further.