More data musings
The only guarantee about user entered data is that, given enough entries it'll be inconsistent :-(
take for example an openstreetmap xapi query to pull out '/api/0.6/*[amenity=post_box]'
which is nice dataset of ~85k enties which I'll use for some simple analysis
So, the UK has ~40k postboxes, of which according to draco the breakdown of entries from the count are sources as follows:
13.5k - osm, 26.7k - website.
so of those 13504 UK postboxes in OSM, how many are royal mail run (hint - most of them!)
does the data match?
$ grep "operator" ~/Downloads/data.osm | sort | uniq -c | grep -i royal
1 <tag k='operator' v='Post Office: Royal Mail'/>
1 <tag k='operator' v='royal mail'/>
1 <tag k='operator' v='Royal mail'/>
5065 <tag k='operator' v='Royal Mail'/>
1 <tag k='operator' v='RoyalMail'/>
1 <tag k='operator' v='Royal MAil'/>
1 <tag k='operator' v='Royal Mail Warwick'/>
2 <tag k='operator' v='Royal York'/>
not bad - only a few CaSe sEnsiTive issues to sort out
What about other operators, say La Poste?
$ grep "operator" ~/Downloads/data.osm | sort | uniq -c | grep -i poste
1 <tag k='operator' v='Bureau de poste'/>
1 <tag k='operator' v='De Post - La Poste'/>
7 <tag k='operator' v='la poste'/>
21 <tag k='operator' v='la Poste'/>
12 <tag k='operator' v='La poste'/>
917 <tag k='operator' v='La Poste'/>
1 <tag k='operator' v='La Poste Belgique'/>
6 <tag k='operator' v='La Poste - De Post'/>
1 <tag k='operator' v='La Poste Suisse'/>
1 <tag k='operator' v='Le Poste'/>
1 <tag k='operator' v='poste'/>
5 <tag k='operator' v='Poste'/>
again - it's the 'long tail' problem. So, out of the ~85k entries how many unique operators?
404 (how apt for a web service)
and of those how many are singles? 222 - OVER HALF!