A Linked Data Representation for Summary Statistics and Grouping Criteria

McCusker, Jamie
Dumontier, Michel
Chari, Shruthi
McGuinness, Deborah L.
No Thumbnail Available
Other Contributors
Issue Date
Terms of Use
Attribution-NonCommercial-NoDerivs 3.0 United States
Full Citation
Jim McCusker, Michel Dumontier, Shruthi Chari, Joanne Luciano and Deborah L. McGuinness. A Linked Data Representation for Summary Statistics and Grouping Criteria. Semantic Statistics (SemStats). Co-located with the International Semantic Web Conference, Auckland, NZ, October, 2019.
. Summary statistics are fundamental to data science, and are the buidling blocks of statistical reasoning. Most of the data and statistics made available on government web sites are aggregate, however, until now, we have not had a suitable linked data representation available. We propose a way to express summary statistics across aggregate groups as linked data using Web Ontology Language (OWL) Class based sets, where members of the set contribute to the overall aggregate value. Additionally, many clinical studies in the biomedical field rely on demographic summaries of their study cohorts and the patients assigned to each arm. While most data query languages, including SPARQL, allow for computation of summary statistics, they do not provide a way to integrate those values back into the RDF graphs they were computed from. We represent this knowledge, that would otherwise be lost, through the use of OWL 2 punning semantics, the expression of aggregate grouping criteria as OWL classes with variables, and constructs from the Semanticscience Integrated Ontology (SIO), and the World Wide Web Consortium’s provenance ontology, PROV-O, providing interoperable representations that are well supported across the web of Linked Data. We evaluate these semantics using a Resource Description Framework (RDF) representation of patient case information from the Genomic Data Commons, a data portal from the National Cancer Institute.