Datasets
=======================================================


Creating a dataset
*******************

Datasets can be created by clicking the "New Dataset" button on the Data Platform page.

.. figure:: /images/dataplatformhomepage.png
   :scale: 90 %
   :alt: Data Platform Main Page

**Title**

Select a meaningful title for the dataset. The title also forms the basis of the default Permalink for the dataset.

**Permalink**

Left blank, the default permalink will be a URL friendly version of the title (e.g. "My Dataset" would produce the permalink "my-dataset"). 

.. note:: The permalink of a dataset is used as the base for all data resource URIs. This cannot be changed after initial creation. 

**Isolate This Dataset**

If you do not want data in this dataset to be returned in queries (unless explicitly queried), tick this box

Dataset Metadata
*******************

After a dataset has been created, it is best practise to fill out as much metadata for the dataset as possible. Filling out this information creates a description of the dataset using the VoID vocabulary for describing semantic data. `You can read more about the VoID vocabulary here <http://www.w3.org/TR/void/>`_.

**Title**

The name of the dataset

**Description**

A textual description of the dataset

**Creator**

A person, organisation, or service, that is primarily responsible for creating the dataset.

**Publisher**

A person, organisation, or service, that is responsible for making the dataset available. 

**Contributors**

A person, organisation, or service, that is responsible for making contributions to the dataset. 

.. note::  Creators, Publishers and Contributors are selected from the :ref:`Address Book <address-book>`

**Created**

Date of creation of the dataset. 

**Modified**

Date on which the dataset was changed.

**Issued**

Date of formal issuance (e.g., publication) of the dataset.

**License**

Data without explicit license is a potential legal liability and leaves consumers unclear what the usage conditions are. Therefore, it is very important that publishers make explicit the terms under which the dataset can be used. The URIs of some licenses designed specifically for data are:

* Public Domain Dedication and License (PDDL) - "Public Domain for data/databases" http://www.opendatacommons.org/licenses/pddl/
* Open Data Commons Attribution (ODC-By) - "Attribution for data/databases" http://www.opendatacommons.org/licenses/by/
* Open Database License (ODC-ODbL) - "Attribution Share-Alike for data/databases" http://www.opendatacommons.org/licenses/odbl/
* CC0 1.0 Universal - "Creative Commons public domain waiver" http://creativecommons.org/publicdomain/zero/1.0/

The use of other licenses that are not designed specifically for data is discouraged because they may not have the intended legal effect when applied to data. 

**Norms**

The community norms for access and use of a resource. Norms are not legally binding but represent the general principles or "code of conduct" adopted by a community for access and use of resources. Best practice is to use the URI of a document describing these norms as the value of this property.

**Waiver**

The waiver of rights over a resource. Best practice is use the URI of a waiver legal document as the value of this property. 

**URI Lookup Endpoint**

Besides the SPARQL protocol, a simple URI lookup protocol for accessing a dataset can also be described using VoID.

**URI Space**

Used to state that all entity URIs in a dataset start with a given string. In other words, they share a common URI namespace.

**URI Regex Pattern**

In cases where a simple string prefix match is insufficient, the void:uriRegexPattern property can be used. It expresses a regular expression pattern that matches the URIs of the dataset's entities.

**Root Resource**

Many datasets are structured in a tree-like fashion, with one or a few natural "top concepts" or "entry points", and all other entities reachable from these root resources in a small number of steps.

One or more such root resources can be named. Naming a resource as a root resource implies:

* that it is a central entity of particular importance in the dataset; and
* that the entire dataset can be crawled by resolving the root resource(s) and recursively following links to other URIs in the retrieved RDF responses.

Root resources make good entry points for crawling an RDF dataset.

**Example Resource**

For documentation purposes, it can be helpful to name some representative example entites for a dataset. Looking up these entities allows users to quickly get an impression of the kind of data that is present in a dataset.

**SPARQL Endpoint**

A SPARQL endpoint provides access to a dataset via the SPARQL protocol. The address of the SPARQL Endpoint is generated by the Data Platform automatically based on the URI of the dataset VoID when a dataset contains semantic data and is published.

**Data Dumps**

If an RDF dump of the dataset is available, then its location can be announced using void:dataDump. The Data Platform fills out the data dump values automatically by providing a link to the exported data dump (these are created after a dataset is published, or when an administrator has used the "Export" function)

Upload Data
************

Once you have your RDF data files ready, upload them using the "Upload" page. This starts a background task which imports all the data into dataset. The status of the task can be viewed on the "Background Tasks" tab of the Data Platform page.

Dataset statistics
**********************

When semantic data has been uploaded into the dataset, certain statistics are calculated about the data, and those statistics are added to the VoID description of the dataset:

**Triples** : The total number of triples contained in the dataset.
**Entities** : The total number of entities that are described in the dataset.
**Classes** : The total number of distinct classes in the dataset.
**Properties** : The total number of distinct properties in the dataset.
**Distinct Objects** : The total number of distinct subjects in the dataset. 
**Distinct Subjects** : The total number of distinct objects in the dataset.

As the calculation of statistics requires queries to be run over the entire set of data, this is done as a background task. You can view the statistics on the "Stats" page, where you can also recalculate the statistics at any time.


Exporting Data
*****************

Exporting a dataset's semantic data is done as a background task and creates a file in the website's Downloads folder. The addresses of the files to be downloaded are added to the dataset's "Data Dump" property.

Data Exports are done automatically when a dataset is published, but can also be done from the "Export Data" page by an administrator at another time if they choose to do so.


Publishing a dataset
************************

When you are happy with the amount of information you have added to a dataset, you can publish that dataset to be viewable by visitors to your website. 

The public view of a dataset contains the title and all metadata. There are also automatically generated links to the SPARQL Endpoint for that dataset, along with links to the RDF of the VoID description.

Deleting Data
***************

On the "Delete Data" page, you can choose to delete only the data from the dataset (leaving the metadata and placeholder in the Data Platform, allowing you to add data again) or you can delete all data *and* all metadata about that dataset.