How do we streamline the import proposal/data quality assessment flow?

Posted by CloCkWeRX on 29 July 2024 in English.

The problem

We are generating an increasing level of data as a society. An unstated goal of openstreetmap that many contributors subscribe to is “completeness” or “accuracy”, which works fine when you dataset is small, local and high level detail, but less so when scaled up to determining if every traffic light crossing in the world has tactile paving.

So naturally, automation and data imports are where people start to look; and very sensibly there’s a process to propose, review and ingest large datasets.

However, this relies on:

Expertise and peer review
Honesty and diligence of the importer to have and execute a QA plan
A second level of QA tools and mappers to QA and maintain data

What could we do differently?

In the semantic web/linked data world, two big concepts emerged. The first is the semantic web layer cake, which talks about going from “machine readable” to “schemas” to “query” to “proof” to “trust”. In OSM terms these are poi, tags, overpass, a lot of tools like keep right or osmose, and at the moment, human boots on the ground survey.

The concept of 5 star open data is focused on the idea that we have a lot of data locked up in silos - and while it would be ideal to align it to every standard and have the highest quality possible data; 95% of the time it’s better to publish anything at all rather than wait until it’s perfect. So long as data consumers have an idea of the limitations, they can apply judgement when attempting to use it.

What is the current state?

A number of open data portals provide basic indicators of “5 star open data” quality.

In our wiki, we maintain documentation which describes the OSM community’s view on data quality of an external dataset.

We have tags for change sets describing the source.

What specifically would we change?

I’m proposing a set of tools or standard metadata for annotating external datasets and proposed/approved exports; so that editing and conflation tools can reason about the quality of data.

IE, if you have a dataset which is derived from OSM, corrects wrong tags and it has been human verified from a random sampling of 5% of the data? That’s a good candidate for letting a maintenance bot operate on this with minimal oversight, and is potentially 5 star quality.

Have a stream of AI generated shop names from street level imagery? Tag that was 2/5 and have flags for requiring human verification, even if it is one click approval.

What would be the impact?

By having these standards in place, tools that are typically used for bulk imports or conflation can add extra guard rails around the process; and from a community review/import approval perspective it becomes a discussion about the higher risk aspects of an import.

It also then greenlights a degree of automated maintenance activities - after data is imported and mappers are promoted to confirm accuracy in the ground; it then becomes lower risk to trust that data source for bots updating existing attributes.

Discussion

Comment from fititnt on 12 August 2024 at 01:55

I like the general idea of what you are talking about. However, it’s a long journey implementing it: sometimes not merely lack of tools, but how information in different data silos have different ontologies to describe it’s data (not viable automation at all). On the good side: if focusing in one area (such as what can be added to OpenStreetMap) it may reduce a lot of more philosophical problems (like how to categorize abstracts concepts like humans or organizations) , but at same time plotting the data into a map somewhat “allow see” error not obvious on more abstract categorizations.

If you are not yet, please join groups both inside OSM and Wikidata related to terms like ontology, schema, RDF, “ OSM tagging”.

https://m.wikidata.org/wiki/Wikidata:WikiProject_Ontology (click the invite link “Telegram Group”) - this is more focused on Wikidata ontology, but worth reading. Really challenging structure information
https://meta.m.wikimedia.org/wiki/Wikimaps_User_Group (click the link Wikimaps Telegram Group) - this one have both people from Wikipedia and OpenStreetMap
On OpenStreetMap, I’m unsure if there’s any group focused on more Ontology/schema, but most people interested in “OSM Data Items”, who talk about RDF or who do data integration with Wikidata are good starters.

(Also feel free to add me on telegram, it’s the same username I use on OpenStreetMap, just cite you are from this post)

Comment from CloCkWeRX on 12 August 2024 at 23:24

@fititnt, thanks will think about joining those groups!

You’ve already found some of the other discussions around ingesting alltheplaces.xyz data into OSM with matkoniecz.

But if you haven’t done so already wander over to the GitHub repo for ATP which I think has solved a lot of the “what is importable from to OSM?” In a practical way - “schema.org terms/scraped data to OSM tagging schema in geojson”)

There’s about to be “one click write me an importer/scraper” levels of tooling: https://github.com/alltheplaces/alltheplaces/pull/6526

And then following that; potentially mashing up the Web Data Commons (approx 35GB of nquads): https://github.com/alltheplaces/alltheplaces/issues/6344

Both of those end up being a step-change level of data gathering; and all the places is already “too big” to just import recklessly.

OpenStreetMap

CloCkWeRX's Diary