Back in July, I wrote about Big (Location) Data vs. My (Location) Data, which was the theme for a talk I gave at the AGI Northern Conference. The TL;DR premise behind the talk was that the location trail we generate on today’s interweb is part of our own digital history and that there’s a very one sided relationship between the people who generate this digital stuff and the organisations that aim to make money out of our digital stuff.
Once I’d given that talk, done the usual blog write up and posted it, I considered the topic done and dusted and I moved onto the next theme. But as it turns out, the topic was neither done, nor dusted.
Firstly Eric van Rees from Geoinformatics magazine mailed me to say he’d liked the write up and would I consider crunching down 60 odd slides and 3000 odd words into a 750 word maximum column for the next issue of the magazine.
And then a conversation on Twitter ensued where some people immediately saw the inherent value in their personal location history whilst some people … didn’t.
That conversation was enough to make me go back and revisit the theme and the talk morphed and expanded considerably. Fast forward to this week and I’ve given the talk in its’ new form twice, once at Nottingham University’s GeoSpatial faculty and once at the Edinburgh Earth Observatory EOO-AGI(S) seminar series at Edinburgh University.
Maybe now this topic and this talk is finished and it’s time to move on. But somehow, I think this will be a recurring theme in talks to come over the next few years.
The slides from the talk are below and the notes accompanying those slides are after the break.
So, hello, I’m Gary and I’m from the Internet. I’m a self-confessed map addict, a geo-technologist and a geographer. I’m Director of Web & Community for Nokia’s Location and Commerce group. Prior to Nokia I led Yahoo’s Geotechnologies group in the United Kingdom. I’m a founder of the Location Forum, a co-founder of WhereCamp EU, I sit on the Council for the AGI, the UK’s Association for Geographic Information, I’m the chair of the W3G conference and I’m also a Fellow of the Royal Geographical Society.
There are URLs in this talk but this is the only URL in the entirety of this talk you might want to take a note of. Although if you go there right now, it’ll 404 on you, later today or tomorrow, this is where this slide deck, my notes and all the links you’ll be seeing will appear on my blog. That’s an upper case “I” and a zero at the end of the URL by the way …
This is not a talk about GIS. This isn’t even a talk about GI or geographical information in the usual sense of the words. Nor is this the talk I sat down and started to write. That talk was going to be about how maps are now mainstream and how we’ve managed to find ourselves in the middle of something that could be called a ‘map war’, with Nokia, TomTom, Google, Apple and OpenStreetMap battling it out for overall geospatial supremacy. But I didn’t write that talk. The topic reeked of far too much schadenfreude for me to be comfortable with the topic. So I stopped writing that talk and started to think about another suitable theme. Then something happened.
A while back I’d written a talk about the digital history that we are currently creating on the internet. The talk was called ‘Big Data vs. My Data’. I gave the talk at two conferences and it seemed to go down well, which is always gratifying.
So I filed the talk away, wrote a blog post on it, and considered the topic pretty much finished. It wasn’t.
Then Eric van Rees, the editor of Geoinformatics Magazine got in touch. He said that he’d liked the blog post I’d written and the slide deck notes and would I be willing to convert the talk into a magazine column. So I sat down and tried to condense a 3000 odd work talk, spread over around half an hour into a 750 word printed column. Eventually I succeeded, it got published and people seemed to like it. This was also gratifying.
So now I really considered the topic pretty much finished. It still wasn’t.
The topic ended up spawning one of those long conversations on Twitter, where some people agreed with me and some …. didn’t.
So I went back and revisited the topic and decided it really wasn’t finished. Hopefully this version is the final definitive finished version.
This is a talk that goes off in lots of different directions but fundamentally it’s about these two sets of geographical coordinates. Most people here should recognise them as two sets of latitude and longitude. Some of the frighteningly scary people I’ve worked with could probably tell you what country they’re in, just by looking at them. A few, really frighteningly scary people that I know could probably even tell you what city they’re in. But I won’t make you do that. The first coordinate is where I live, near Twickenham Rugby Stadium in West London. The second is pretty much where we are now, in the Old Library in the University of Edinburgh. Why this talk is about these two sets of coordinates, and quite a few other coordinates besides, will, I hope become clearer over the next half an hour or so.
One of the things I love about writing a talk is how the things I hear and the things I read and write get mentally stored away and then, somehow, they start to draw together to form a semi-coherent narrative around the talk title that I inevitably gave to the conference organisers around 3 months prior. So it is with this talk, which in Sesame Street fashion, has been unknowingly brought to you by …
Kellan Elliott-McCrea, previously at Flickr and Yahoo! and now at Etsy …
Aaron Straup Cope, previously at Flickr and Stamen Design and now doing stuff at the Smithsonian …
… and my children. No, really. This isn’t just an excuse to put a photo of my family up on the screen behind me so you can all, hopefully, go “awww”.
But before I get into anything to do with making history, big data, my data or anything interweb or social network related I want to try and frame the context of my thoughts by talking about communication, or to be more precise, the way in which we communicate. We are, politics and warfare aside, a social species and communicating with each other is something we do a lot of, although the manner in which we communicate has changed a lot.
A lot of our communication is both verbal and non-verbal and relies on face to face, person to person, proximity so that the verbal and non verbal approach comes together to express what we intend to say.
Some of our communication is written, the old fashioned way, using pen and paper, although a lot of commentators have called out the “death of the letter”. Whether that’s true or just good headline making hyperbole remains to be seen, but to be fair, I can’t remember the last time I actually sat down and wrote a letter.
A lot of our communication is still verbal but via a phone, be that a land line or a mobile. We call and we text. A lot.
But be it talking face to face, texting someone or even writing an email, the intended audience is still narrow, person to person, or person to small audience.
But the interwebs have added to this sphere of communications and now we broadcast our thoughts, feelings and experiences, sometimes regardless of whether we think anyone will see this, let alone empathise or communicate back.
While we still talk, meet, engage and sometimes broadcast, like I’m doing right now, this human-to-human interaction has been augmented, maybe complimented by electronic communications.
We’re as likely to post a Tweet on Twitter or a status on Facebook or Google+ or another social network as we are to speak face to face.
And because this type of communique is electronic, that means it generates data as we go. Today we generate lots of data, big data, on a daily basis. It’s probably not unfair to say that there’s data being generated in this very auditorium, right now, as I’m saying this.
We all seem to be doing this, though ‘all’ is a sweeping over generalisation, but enough of us are making digital ‘stuff’ for it to start to matter and for it to start to be significant.
Some of this data is implicit. A by-product of what we’re doing. Whether it’s our cell phones loosely mapping out where we are, not a privacy invasion I hasten to add, but the simple way in which cellular networks work, but that’s a topic for another talk on another day, or our GPS navigation, be it built into our car or our smartphone, providing anonymised traffic data probes to show where freeway congestion is, we don’t consciously set out to generate this data. It’s a by product of what we’re doing.
But a lot of this data is very much explicit. We type out a status update on our phone, our tablet, our laptop and we tap or click on the button that says “go” or “submit” or we take a photo, maybe add an image filter or a comment and tap or click the button that says “share” or “upload”.
By doing this we’re explicitly communicating, explicitly broadcasting and sharing with our friend, family, followers and the interwebs in general … and in doing so, we’re playing our part in generating more and more data.
And generate it we do. Lots of it. We call it big data, but massive data would be a more accurate definition of it. Whilst our own individual contributions to big data may not be that big, when you put it all together it’s part of an ever growing corpus of big data and there’s companies that both provide the means for us to broadcast and share this data as well as, hopefully, providing a means of revenue for them to enable them to keep doing this. The amounts that get generated each day is almost too much for us to think about and comprehend. Once a number gets that big, we can’t really deal with it. We know it’s a big number but what that actually represents is hard for us to get our head around.
So let’s look at just a small sample of what gets generated on a daily basis from the social big data, communicating, sharing and broadcasting services I tend to use, if not on a daily basis then at least on a weekly basis. I Tweet and update my Facebook status at least once a day, sometimes up to 20 times a day. I check-in to places on Foursquare at least 10 times a day and take and upload photos to Instagram and Facebook around 3 times a week. That’s just my contribution, think how many people are doing the same thing to get to the sort of volumes you can see on the slide behind me.
As a specific example, I post a single Tweet on Twitter. Weighing in at 72 characters, including spaces and punctuation, it’s only just over half of Twitter’s 140 character maximum. That Tweet is assigned a unique identifier by Twitter, which forms part of the unique URL to that single Tweet. From visiting that URL I can see that Twitter has added who I am, when I posted that Tweet and because I geotagged the Tweet, also where I was when I wrote it. So that’s a little more additional metadata than the 72 characters of the Tweet itself.
But if I then take that unique identifier and fire it back at Twitter’s API, I start to see just how much metadata has been added.
115 lines of JSON come back to me from that API call, making up 3,338 characters. There’s metadata on the Tweet itself, when it was created, the text of the Tweet, what app I was using to Tweet with, there’s information on my Twitter account, my name, my Twitter name, my account’s unique identifier, my general location, my biography, all the stuff that’s in my Twitter profile. There’s how many Tweets I’ve posted (14,811 at the time), how many followers I have, how many favourites I’ve flagged, how many Twitter lists I appear in. There’s the details of my profile on Twitter’s web site, HTML colours, profile image URL and the like. And because I’ve geotagged the Tweet, there’s the full geographic information about where I was including a bounding box of the locality.
All of a sudden I can see just how Big Data got its name.
But how long will all of this continue? Remember the people I spoke about right at the start of this talk, some 16 slides back? It’s time to bring them into the picture. Firstly, my children, although this applies equally to pretty much all children. Remember when you were a child? The summer holidaywas endless. The skies were always blue and the sun was always out (remember, I’m from the UK where Summer and sun do not always go together, in fact it was pouring down with rain as I wrote this at home last week). And just like the summer holidaywas endless, so were your parents and the people around you, they were eternal and would always be there. Remember feeling like that? But then the inevitable happened. We grew up and we discovered, often the hard way, that the summer wasn’t endless and that almost everything is finite.
Social networks aren’t finite either. They get born, if they’re lucky they grow and then at some time or other they … stop. If it’s a social network you don’t use then it doesn’t really bother us much.
But if it’s a network you’ve shared a lot of content through, what happens then? A lot of people, myself included, immediately get into “I want my data back” mode.
But is it your data. Of course it is. You made it. You composed that Tweet. You shared that link. You took that photo. You were at that place you checked-in at. Of course it’s your data.
But there’s a point to be made here. You may have created that data, you may own that data, but the copy of that data in that social network is just that. It’s a copy. It’s not necessarily “your” data and because most of us don’t preserve what we send up into the cloud on its way to our social networks, you may have created it, but the copy in the cloud isn’t necessarily yours.
It’s an easy mistake to make. I may be a geo-technologist and many more things besides, but I am not a lawyer, and apart from the lawyers in the room, more of you aren’t and most of the people who use social networks aren’t lawyers either, unless it’s DeferoLaw, which is a social network for the legal profession.
… we see phrases like “you retain your rights” …
… another favourite is “you own the content you posted”
… and “you always own your information” and immediately the subtleties and complexities of data ownership, licensing, copyright and intellectual property are cast aside. We say to ourselves, “it’s my data dammit, I own it, I want it”.
And it’s this belief that we really are lawyers in our spare time that makes people think that somehow the data they’ve shared via a social network is physically theirs, rather than a bit for bit perfect copy that we’ve licensed to that social network. We forget for a moment that we’re using that social network as a cloud based backup, in some cases the only backup, of our creations and we mutter darkly about “holding my data hostage”.
The blunt, and often harsh reality, is the age old adage that “you get what you pay for”. If you pay, you’re probably a customer. If you’re using something for “free” (and I say free in very large italics and inverted commas here), then you’re probably, unknowingly or unwittingly, the product. Harsh. But fair. It’s our content that the social networks monetize and that allows them to keep their servers and disk storage up and running. You might have seen that previous slide with the Tech Crunch post and be thinking “ah, but Flickr Pro is chargeable and if my subscription lapses I can’t get my photos back”. That’s actually not really true, if not particularly simple, but bear with me for a few more slides.
Now let’s forget “big data” for a moment and think about “your data” instead. Actually, let’s think about “my data” for a moment. As of last week, my social media footprint on Twitter, Foursquare, Instagram and Flickr looked something like this. Facebook’s numbers would be up there too, but I’ll get to that in a moment.
Now in the grand scheme of things, in the massive numbers thrown about around about “big data” this is but a drop in the ocean. But …
I created these check-ins, status updates, tweets and photos. They’re important to me. Very important to me.
And as Aaron Cope pointed our earlier this year, my small, insignificant contribution to big data is part of my own, very subjective, very personal, history.
As I may have mentioned before, I’m a geo-technologist and a high percentage of my explicit big data contribution has a geo or location component to it. I’d like to map our where I checked-in, I’d like to see where I was when I Tweeted or what photos I took at a particular location. Some of this “mappyness” already exists in some of the big data stores where my contributions live, but not all of it, it’s far too niche and personal for that. But it’s still important to me.
Remember, in 99% of the social networks I use, I’m not the customer, I’m contributing to the product. But how do my regularly used social networks fare here. Regardless of whether I own the data I put up there, how easy is it to get a copy of?
Firstly, what about a one click solution? Can I go to a particular page on the web and click the big button which says “give me a copy of my data”.
Facebook is the only one of my 5 social networks that does this. Well, it almost does this. At least I’m sure I used to be able to do this.
I can still request a download of my information. But it now only seems to give me my status updates since I enabled Timeline on my account, though I can still get all of my photos and messages since 2008. Rather than say that this doesn’t work, I’ll just file this under “needs further investigation” and move on.
Sometimes this lack of a one button download of contributed data is a deliberate decision on the part of a given social network. Sometimes, it’s a hope that with an API, some enterprising developer will do this, but most of the time, that doesn’t always happen.
So talking of APIs, surely the remaining social networks will have an API and let me knock up some code to get a copy of my data contributions. Surely?
Not all social networks do. An API tends to come after a social network’s launch, if it comes at all, and often it doesn’t let me do all that I want to do.
Thankfully, all the networks I used, with the exception of Twitter not only provide an API, but let me use that API to get my data. All of my data.
This is a good thing and meets the requirements for an API to meet what Kellan Elliot McCrea calls “minimal competance”. He went on to say
“The ability to get out the data you put in is the bare minimum. All of it, at high fidelity, in a reasonable amount of time.
The bare minimum that you should be building, bare minimum that you should be using, and absolutely the bare minimum you should be looking for in tools you allow and encourage people who aren’t builders to use.”
Kellan was behind Flickr’s API and his sentiments are, to my mind, admirable.
Sadly, Twitter doesn’t let me do this and fails the minimal competence test miserably. Deep in their API documentation I found the justification for this as being essential to ensure Twitter’s stability and performance and leave it as an exercise to you the audience to work out what I think of this excuse.
The sad truth here is that when it comes to our own individual online data history, there’s not always a willingness to make it easy for us to get copies of our history, if it’s even on the radar at all.
But if we can’t always get our data history back, maybe the solution is to make an archive of it before it goes in or keep that archive up to date as you go … a personal digital archive or PDA (and not to be confused with personal electronic organisers, or PDAs, such as the Palm Pilot).
Thanks to web APIs and another social network, admittedly one for people who know how to code, a lot of this is already possible and the scope, range and functionality is growing by the day. The irony that I can build my own personal digital archive out of code found on another social network, which itself is built around a source code archival system is not lost on me either.
So, firstly, there’s my own Instagram (and no, I’m not going to share the URL of where this lives I’m afraid. The idea here is that this is a personal archive, not a clone of a social network).
My own Instagram is called parallel-ogram. It’s on GitHub; you can download it, configure it, run it. For free.
Parallel-ogram works as well on my phone as it does on my laptop, showing me exactly what I’ve uploaded to Instagram. Indeed, it goes one step further than Instagram as currently there’s no way to see what you’ve uploaded other than through their mobile app. Parallel-ogram doesn’t allow me to take photos or upload them, at least not yet, but it does allow me to go back to the day I first uploaded a photo, grabs copies for me and twice a day it uses the Instagram API to see what I may have uploaded and quietly grabs a copy and stashes it away for me.
There’s also my own archive of Foursquare …
It’s called privatesquare and it’s also on GitHub
Like parallel-ogram, privatesquare quietly uses the Foursquare API to go back to my first check-in and twice a day quietly synchs my check-ins for me. I can go back and look at them, see maps of them and browse my check-in history. Unlike parallel-ogram, privatesquare also allows me to check-in, even if I don’t want to share this with Foursquare. In short it allows me to use it both as an archive and also as a check-in tool, and if I want to use Foursquare’s official mobile app, I can do that, safe and secure in the knowledge that privatesquare will keep itself up to date.
My photos also end up on Flickr and there’s a private archive of that too
It’s called parallel-flickr, it also lives on GitHub and it’s also filed under “something I really must install, configure and get running when I have some spare time”.
So I have my own archives of Instagram, Flickr and Foursquare. I sort of have my own archive of Facebook. But what about my Tweets?
Well until Twitter decides that their site is stable enough to let me grab my Tweets through their archive, the next best solution is to archive by another means. I’ve put the RSS feed to my Tweet-stream into Google Reader, which helpfully never throws anything away. I did this a long time ago and I have almost all, but 100% all of my Tweets. Now all I need to do is write some code to read them from Google Reader and then get the Tweet data from Twitter, which then do allow via their API. Sadly, this is also filed under “something I must do when I have the time”. It’s not perfect, but then again, none of what I’ve discussed is, but it’s a start and that’s good enough for the time being.
Finally, you might have noticed the links in my slides look sort of like bitly links, only on the vtny.org domain. That’s because I’ve been archiving my short links for a few years now
Using my own short URL archive and my own, self hosted, URL shortener. I just thought I’d mention that.
So, my big data contribution, my personal online history, is important to me. Yours might be important to you too. We’re often told that we can’t have our cake and eat it, but with the advent of the personal digital archive, maybe we can thanks to the enterprising people who create APIs in the first place and those who not only use these APIs but also put their code up for all the world to use, free of charge. Your online history may not be that important in the grand scheme of things, but it’s your online history, it’s personal, you made it. When social networks go the place where software goes to die, you might just want to preserve that personal history before the servers get powered off forever. Maybe the geeks will inherit the Earth after all.
I want to wrap up with a slightly cautionary tale, which highlights why our digital stuff and interweb history might just be important in ways you might not immediately think of. A friend of a friend, called Claudio, received a call from the British Transport Police in June of last year. There’d been an assault at Leicester Square Tube station in which an unfortunate individual ended up with broken ribs. The Police had evidence that placed Claudio at the Tube station at the time the assault took place. Could he explain what he was doing at that place and time. It’s worth noting here that the assault had taken place in December of 2010, almost 7 months prior.
I wonder how many of us could say with certainty where we were, what we were doing and whether there was anyone to corroborate this without recourse to some form of aide memoire.
For Claudio, it was entirely feasible that he was at Leicester Square on the night of the assault but worryingly, there was large gaps in his recollection and that of his friends.
Thankfully, by mining his web history and that of his friends he was able to piece together the events of the night, with some additional proof in the form of geotagged photographs.
As the cliche goes, Claudio was eliminated from the enquiries but what I find particularly telling about this anecdote is the strong web history and Big Data elements to it. The initial accusation was built on Big Data, namely Claudio was one of those people who used his Oyster Card to enter the Tube station, which left a date and time stamped record. In fact, the date and time that he entered the station was precisely the same time that the person, captured on CCTV, entered the station. Once the full picture was in place, it could be seen that Claudio was not the suspect that the Police were looking for. But not only was the potential accusation built on Big Data but the defence, the alibi and the proof of his innocence were built on Big Data and people’s web histories as well.
It wouldn’t be an outrageous prediction to see that this sequence of events might start playing themselves out a lot more in the not too distant future as we grow ever more reliant on web based services, Big Data stores and as those data stored start to be interlinked.
The whole tale is worth a read; you’ll find it at the end of the URL on the screen behind me.
Thank you for listening.