Data set versioning through linked data models
Loading...
Authors
Lee, Benno
Issue Date
2018-08
Type
Electronic thesis
Thesis
Thesis
Language
ENG
Keywords
Computer science
Alternative Title
Abstract
Our versioning model, implemented as VersOn, enables more complete change capture using linked-data and exposes the information stored in change logs as linked-data. The model achieves distinct AIM change capture, version-attribute connectivity, and version continuity. The impact to change log performance of including the VersOn entries into the log exceeds a 50% reduction because modifications were not summarized. The impact demonstrates the importance of being able to summarize change in generating manageable change logs. Our implementation of VersOn enables more precise assessment of change in comparison to dot-decimal identifiers by counting specific changes rather than using categorical assessments. Using VersOn, we were able to demonstrate methods to more accurately asses data set change rates by using the change time rather than the version publication time and using the VersOn change rate rather than the version publication rate. Through developing VersOn, we additionally found that the ontology encoding of the AIM model enables data consumers to take more control over the change assessment process and become more independent of the version information defined the data producer. Analyzing datasets through VersOn also revealed that data sets have a tendancy towards a certain type of change as well as specific categories of versioning schedules.
Data sets invariably require versioning systems to manage changes due to an imperfect collection environment. Data versioning systems are employed to manage changes to data, logging new data sets and communicating that change to data consumers. Versioning discussion remains imprecise, lacking standardization or formal specifications. Many works tend to define versions based on previously established data project traditions rather than an analysis on the versioning characteristics of the data set. Provenance ontologies have begun to address some of the issues in capturing change information as linked-data, but remain incomplete. To improve completeness, a linked-data model was developed to capture addition, invalidation, and modification (AIM) changes, connect versions to changing attributes, and maintain continuity between versions. The Versioning Ontology (VersOn), the instantiation of the model, enabled change logs, documents explaining the differences between versions, to be exposed as linked-data. Encoding VersOn into the change log has an impact on the performance of the log as more data is included into the document. Data versioning traditions have popularized the use of dot-decimal identifiers which indicate a categorical difference from the immediately prior version. Using the change counts from applying VersOn, a more precise method of evaluating change distance was generated compared to the categorical differences enabled by dot-decimal identifiers. The greater precision enables changes to be assessed across time rather than across versions, improving the accuracy of a data set’s change rate assessment.
A complete versioning model needed to satisfy three requirements. Requiring a model to capture AIM changes means that all types of changes were captured by the model. Requiring a model to connect versions to attributes means the changing parts of a version are appropriately attributed to the correct data object. Requiring that version continuity be preserved means that the version model can capture not just the interaction between two objects, but also a series of versions as commonly seen in large data versioning systems. A provenance based model was only able to partially achieve version-attribute connections and version continuity. A log based model was then developed but could not satisfy version continuity. A third hybrid approach was created but did not meet the version to attribute connection requirement. A fully connected model was studied, but that model could not differentiate between AIM changes. The final model, instantiated as VersOn, separated the fully connected model into three forms to meet the AIM change capture requirement. VersOn explicitly connects version objects to the underlying attributes, fulfilling the version-attribute connectivity requirement. By propagating changes across attributes and maintaining relations between versions and changes in cases of missing attributes, our model maintains version continuity, meeting all requirements for completeness.
The instantiation of the versioning model enabled VersOn to be encoded into change logs using the Resource Description Framework in Attributes (RDFa) and JSON for Linked Data (JSON-LD). Adding more data into the change log had an impact on the performance, measured by content per storage size, but the reduction was expected to be no more than 50%. Scripts were used to automatically generate a change log standardized by VersOn, but the impact to performance exceeded expectations. The RDFa change log experienced a 90% reduction in performance while the JSON-LD change log experienced more than 95% reduction in performance. The increased storage utilization by the JSON-LD change log enabled less restricted implementation of VersOn into the change log.
Versioning graphs instantiated using VersOn separate changes into individual instances organized by type of change which can be counted to create a change distance. Dot-decimal identifiers only separate the amount of change from the prior version into major, minor, or smaller categories. Using the change counts in the versioning graph increase the precision at which change between versions was assessed. Extending the changes enabled the use of domain knowledge to additionally improve the precision of change assessment as appropriate to the domain. Through analysis of the Global Change Master Directory (GCMD) Keywords Version 8.5, we found that VersOn enables data consumers to assess data change in ways relevant to the consumer and independent of the producer’s assessment of change as indicated by the dot-decimal identifier assigned to the version.
Change counts collected using VersOn were made across versions rather than across time. Using the version publication time, the change counts of GCMD Keywords were distributed across time. Some of the versions were discovered to bundle changes from much earlier than the publication of the previous version, meaning that the change rate using version publication time did not accurately reflect the change rate of the changes in bundles. Using the start time of the change rather than the version more accurately captures the change rate of the data. The data sets collected by the Earth Observing Laboratory (EOL) were found to have AIM change rates consistently different from the version publication rate. Since change counts measure the individual changes in the data set, the VersOn enabled change counts integrated with time provides a more accurate assessment of the data set change rate.
Data sets invariably require versioning systems to manage changes due to an imperfect collection environment. Data versioning systems are employed to manage changes to data, logging new data sets and communicating that change to data consumers. Versioning discussion remains imprecise, lacking standardization or formal specifications. Many works tend to define versions based on previously established data project traditions rather than an analysis on the versioning characteristics of the data set. Provenance ontologies have begun to address some of the issues in capturing change information as linked-data, but remain incomplete. To improve completeness, a linked-data model was developed to capture addition, invalidation, and modification (AIM) changes, connect versions to changing attributes, and maintain continuity between versions. The Versioning Ontology (VersOn), the instantiation of the model, enabled change logs, documents explaining the differences between versions, to be exposed as linked-data. Encoding VersOn into the change log has an impact on the performance of the log as more data is included into the document. Data versioning traditions have popularized the use of dot-decimal identifiers which indicate a categorical difference from the immediately prior version. Using the change counts from applying VersOn, a more precise method of evaluating change distance was generated compared to the categorical differences enabled by dot-decimal identifiers. The greater precision enables changes to be assessed across time rather than across versions, improving the accuracy of a data set’s change rate assessment.
A complete versioning model needed to satisfy three requirements. Requiring a model to capture AIM changes means that all types of changes were captured by the model. Requiring a model to connect versions to attributes means the changing parts of a version are appropriately attributed to the correct data object. Requiring that version continuity be preserved means that the version model can capture not just the interaction between two objects, but also a series of versions as commonly seen in large data versioning systems. A provenance based model was only able to partially achieve version-attribute connections and version continuity. A log based model was then developed but could not satisfy version continuity. A third hybrid approach was created but did not meet the version to attribute connection requirement. A fully connected model was studied, but that model could not differentiate between AIM changes. The final model, instantiated as VersOn, separated the fully connected model into three forms to meet the AIM change capture requirement. VersOn explicitly connects version objects to the underlying attributes, fulfilling the version-attribute connectivity requirement. By propagating changes across attributes and maintaining relations between versions and changes in cases of missing attributes, our model maintains version continuity, meeting all requirements for completeness.
The instantiation of the versioning model enabled VersOn to be encoded into change logs using the Resource Description Framework in Attributes (RDFa) and JSON for Linked Data (JSON-LD). Adding more data into the change log had an impact on the performance, measured by content per storage size, but the reduction was expected to be no more than 50%. Scripts were used to automatically generate a change log standardized by VersOn, but the impact to performance exceeded expectations. The RDFa change log experienced a 90% reduction in performance while the JSON-LD change log experienced more than 95% reduction in performance. The increased storage utilization by the JSON-LD change log enabled less restricted implementation of VersOn into the change log.
Versioning graphs instantiated using VersOn separate changes into individual instances organized by type of change which can be counted to create a change distance. Dot-decimal identifiers only separate the amount of change from the prior version into major, minor, or smaller categories. Using the change counts in the versioning graph increase the precision at which change between versions was assessed. Extending the changes enabled the use of domain knowledge to additionally improve the precision of change assessment as appropriate to the domain. Through analysis of the Global Change Master Directory (GCMD) Keywords Version 8.5, we found that VersOn enables data consumers to assess data change in ways relevant to the consumer and independent of the producer’s assessment of change as indicated by the dot-decimal identifier assigned to the version.
Change counts collected using VersOn were made across versions rather than across time. Using the version publication time, the change counts of GCMD Keywords were distributed across time. Some of the versions were discovered to bundle changes from much earlier than the publication of the previous version, meaning that the change rate using version publication time did not accurately reflect the change rate of the changes in bundles. Using the start time of the change rather than the version more accurately captures the change rate of the data. The data sets collected by the Earth Observing Laboratory (EOL) were found to have AIM change rates consistently different from the version publication rate. Since change counts measure the individual changes in the data set, the VersOn enabled change counts integrated with time provides a more accurate assessment of the data set change rate.
Description
August 2018
School of Science
School of Science
Full Citation
Publisher
Rensselaer Polytechnic Institute, Troy, NY