NetworkedPlanet

When we publish anything on the web it should go without saying that in order for that distributed information to be usable, we must adhere to standards when we publish it. If everyone had their own way of marking some text in a web page to be bold, web page browsers would have an impossible task of displaying that text in the way it was meant to be shown. That’s why we have the standard markup language of HTML to enable everyone to publish web pages.

Similarly, when we publish data on the web we should look to standards to ensure consistency and interoperability between distributed data publishers and any applications that need to pull that data into their systems. DataDock currently makes use of two main recommended approaches to publishing data on the web - “CSV on the Web” (CSVW) and Linked Data. This blog post will concentrate on CSVW and there will be a follow up for those interested in Linked Data.

The first stage in DataDock’s processing pipeline is to take a CSV file of tabular data, and open it to allow the DataDock user some basic control over how that data should be processed. Whilst the current amount of user control over the CSV processing is straightforward datatype selection of the different columns, we will be rapidly expanding DataDock’s abilities to allow more user control over defining the structure of the CSV they are publishing.

CSV columns list

The W3C’s CSV on the Web Working Group developed specifications relating to describing tabular data. This makes it the obvious choice for us to adopt this standard when we pull the CSV into the DataDock processing pipeline. There is now a W3C CSV on the Web Community Group that is open to all, where there is a focus on discussion and support for implementors, publishers and spec developers to share experience with CSVW and related ideas.

Using the CSVW specification, DataDock takes the information entered by the user (e.g. title / description / selected license) and combines it with the list of columns to produce a metadata JSON file. An example of one of the metadata JSON files produced during the import of a CSV into DataDock is shown below (the example below only shows the first four of the columns for brevity):

In its current release, DataDock auto-creates the column definitions, creating the machine-friendly name, and property_url from each column’s header value, which is set as the initial value in titles. DataDock currently asks for user input only when it comes to the column’s datatype, although it does make a simple guess at the datatype based on the values of the first non-header row of the selected CSV. Future upgrades to the DataDock platform will make use of the more extended definitions of the CSVW standard.

After the CSVW metadata file has been created, DataDock’s conversion process uses the metadata file to generate RDF linked data. The DataDock then uses the linked data to generate the pages for the data portal, and publishes all this to the user’s target GitHub repository.

Each DataDock repository has a “csv” directory that contains a sub-directory for each dataset. This contains the original CSV upload along with its CSVW metadata JSON file. You can find the CSV and CSVW metadata relating to the examples from the previous blog post on my GitHub repo here.

For those who want to learn a bit more about CSVW, a more in-depth blog post is available by Gregg Kellogg, or if you’re a documentation geek then feel free to dive into the recommendations and notes published by the CSVW working group.

It’s possible that future versions of the DataDock will include an API layer, and that this may handle more complex CSV processing using supplied metadata files. If that’s something that you would be interested in, make it known as a feature request on our support site.

The next blog post will dive into the linked data part of the publishing pipeline. It’s that side of things that the most exciting for me personally, because as we add enhancements to the CSVW property_url definition user control, we can add features to publish and reuse standard vocabulary terms to define the data published from disparate sources - whilst keeping this standard of data publishing free and as easy as possible for non-technical people to use.

DataDock logo

This article is part of the Data Dock Intro series