[Editor's note: The age-old rule for cloropleth mapping that suggests aggregation by multi-scale areal units based on the map's zoom level is slowly seeping into "clustering" for the point-based mashup geo community. This overview from Travellr published on the Google GeoDevelopers blog includes two illustrations that show the power of this technique. I used such a technique (different implementation) on The Washington Post's recent swine flu mapping.]
Republished from Google GeoDevelopers Blog.
Monday, June 22, 2009
Recently, there has been a lot of interest in clustering algorithms. The client-side grid-based MarkerClusterer was released in the open source library this year, and various server-side algorithms were discussed in the Performance Tips I/O talk. We’ve invited the Travellr development team to give us insight on their unique regional clustering technique.
Travellr is a location aware answers service where people can ask travel-related questions about anywhere in the world. One of its features is a map-based interface to questions on the site using Google Maps.
Figure 1. An example of the Travellr Map, showing question markers for Australia.
Clustering for usability
We learned that the best way to display markers without cluttering our map was to cluster our questions depending on how far you zoom in. If the user was looking at a map of the continents, we would cluster our questions into a marker for each continent. If the user zoomed-in to France we would then cluster our questions into a marker for each region or city that had questions. By clustering our data into cities, regions/states, countries, and continents, we could display relevant markers on the map depending on what zoom level the user was looking at.
Optimizing for Clustering
Our next challenge was how to extract clustered data from our database without causing excessive server load. Every time the user pans and zooms on the map, we need to query and fetch new clustered data in order to display the markers on the map. We also might have to limit the data if the user has selected a tag, as we’re only interested in a questions related to a topic (ie: “surfing”). To execute this in real-time would be painstakingly slow, as you would need to to cluster thousands of questions in thousands of locations with hundreds of tags on the fly. The answer? Pre-cluster your data of course!
Step 1. Structure your location data
When a question is asked about a city on Travellr, we also know its region/state, country and continent. We store more than 55,000 location points as a hierarchy, with each location “owning” its descendent nodes (and all of their data). Our locations are stored in a Modified Preorder Tree (also called Nested Sets). Modified Preorder Trees are a popular method of storing hierarchical data in a flat database table, having a focus on efficient data retrieval, and easy handling of sub trees. For each location we also keep a record of its depth within the tree, its location type (continent, country, region/state, or city), and its co-ordinates (retrieved using the Google Maps geocoder).
Step 2. Aggregate your data
We calculate aggregate data for every branch of our locations tree ahead of time. By storing aggregate data for cities, regions/states, countries, and continents, we provide an extremely fast and inexpensive method to query our locations database for any information regarding questions asked about a particular location. This data is updated every few minutes by a server-side task.
Our aggregations include:
- Total question count for a location
- Most popular tags for that location
- Number of questions associated with each of those tags.
How we query our structured, aggregate data on the map
Whenever the user zooms or pans the map we fire off a query to our (unpublished API with the tags they are searching for, the current zoom level, and the edge co-ordinates of the map’s bounding box. Based on the zoom level (Figure 2) we work out whether we want to display markers for continents, countries, states, or cities. We then send back the data for these markers and display them on the map.
Figure 2. Clustering at different zoom levels (blue = continents, countries, pink = states, cities)
So what is the result of structuring and aggregating our data in such a way? It means that we have nicely organized, pre-clustered data that can be read from cheaply and easily. This allows us to provide a super-fast map interface for Travellr that puts minimal load on our infrastructure. Everyone is happy!
Comments or Questions?
We’d love to hear from you if you have any questions on how we did things, or suggestions or comments about Travellr’s map. This article was written by Travellr’s performance and scalability expert Michael Shaw (from Insight4) and our client-side scripting aficionado Jaidev Soin.
Tags: aggregation, algorithm, Australia, cities, clustering, continents, country, edge, geodeveloper, google, grid, group, grouping, markerclusterer, Mashup, neighborhoods, network topology, node, states, structured data, subgroup, topology, travellr