To add some context, there were a couple of things floating in my head when I put this together. First, you’ll see the influence of Data Driven Journalism (DDJ). Tapping in to DDJ is useful because it’s still an emerging discipline and I feel there is a rice vein of innovation and exploration around distilling data into stories so there is a lot to learn/borrow from this area. Second, David Sherlock (CETIS) kindly blogged about the conference session which includes some questions he would like answered (most of these questions were asked at Tony Hirst’s Visualisation session at Dev8D). Third, Tony’s session also highlighted that the type of visualisations used is very dependant on the audience using them (you may want something shiny to impress which removes a lot of the data or something more practical).
The presentation is in two parts, communication and data.
Slides 4-9 are designed to show how the same base data can be rendered in different ways:
- It starts with the snowflake video (slide 4), which is visually impressive but at the cost of removing detailed information. For example, it’s hard to see which subjects are getting the post deposits or a sense of who is making the most deposits.
- Slide 5 tries to address the ‘who is making deposits’ question with an interactive map of Jorum submissions reconciled to location.
- Next, (slide 6) removes the element of time, condensing over 9,000 individual records into one image (as this was generated in NodeXL there is an opportunity to highlight that there is an interactive version).
- In slide 7, the dimension of time is reintroduced with an interactive bubble diagram. There’s an opportunity to highlight some ‘glance-ability’ features of this graph, but also raise concerns about information overload.
- Slide 8 is a heatmap of the data in Google Spreadsheets. At this point we are moving away from the visually impressive focusing on situated practical information.
- Slide 9 shows how this information can be shown in a different way. This slide is also an opportunity to highlight the continual struggle between representing whilst not diluting the information
- In slide 10 is the New York Times ‘The Jobless Rate for People Like You’ example which is included as an example of a way to win this battle. The caveat to this is visuals like this are still very bespoke needing an amount of focused development (also remembering to highlight that there are a number of libraries that can help towards getting the job done).
Slide 11 – Whilst I recognise there are ‘rules’ and guidance with graphics and typesetting and perhaps I would be a better person if I took time to understand them, for now I am happy in my ignorance the payback hopefully being to think differently about the problems.
Damn that dirty data
The second part of the presentation is the story behind how the data used in part one was compiled, cleaned, contextualised and combined. It’ll be a whilst stop tour of consuming OAI-PMH data in Google Refine, the issues of reconciling records to institutions, and the delights of Google Refine to clean and combine data.
Opportunity to flag other things done as part of the project.
Mainly just to highlight the range of free/open source tools and libraries out there. May mention some others. Finally just to remind the folks that data use is a great way to validate data. There were a number of examples from the project where using the data turned up unexpected results which were traced back to issues with the datasets.
What do you think?
So this is your opportunity to feedback on the presentation. Do you want a different focus? Are there other tools used in the project you’d like more information on? Should I have included ‘101 ways Google Spreadsheets can annoy you’?