Of Digital "Stuff" And Making Your Personal Interweb History • Gary Gale

Back in July, I wrote about Big (Location) Data vs. My (Location) Data, which was the theme for a talk I gave at the AGI Northern Conference. The TL;DR premise behind the talk was that the location trail we generate on today's interweb is part of our own digital history and that there's a very one sided relationship between the people who generate this digital stuff and the organisations that aim to make money out of our digital stuff.

Once I'd given that talk, done the usual blog write up and posted it, I considered the topic done and dusted and I moved onto the next theme. But as it turns out, the topic was neither done, nor dusted.

Firstly Eric van Rees from Geoinformatics magazine mailed me to say he'd liked the write up and would I consider crunching down 60 odd slides and 3000 odd words into a 750 word maximum column for the next issue of the magazine.

$\"Still Waiting To See The Value ...\"$

And then a conversation on Twitter ensued where some people immediately saw the inherent value in their personal location history whilst some people ... didn't.

That conversation was enough to make me go back and revisit the theme and the talk morphed and expanded considerably. Fast forward to this week and I've given the talk in its' new form twice, once at Nottingham University's GeoSpatial faculty and once at the Edinburgh Earth Observatory EOO-AGI(S) seminar series at Edinburgh University.

Maybe now this topic and this talk is finished and it's time to move on. But somehow, I think this will be a recurring theme in talks to come over the next few years.

The slides from the talk are below and the notes accompanying those slides are after the break.

[scribd id=111913058 key=key-15vmdecagp3xopiyihgt mode=list]

Slide 2

So, hello, I’m Gary and I'm from the Internet. I’m a self-confessed map addict, a geo-technologist and a geographer. I’m Director of Web & Community for Nokia’s Location and Commerce group. Prior to Nokia I led Yahoo’s Geotechnologies group in the United Kingdom. I’m a founder of the Location Forum, a co-founder of WhereCamp EU, I sit on the Council for the AGI, the UK’s Association for Geographic Information, I’m the chair of the W3G conference and I’m also a Fellow of the Royal Geographical Society. Slide 3

There are URLs in this talk but this is the only URL in the entirety of this talk you might want to take a note of. Although if you go there right now, it'll 404 on you, later today or tomorrow, this is where this slide deck, my notes and all the links you'll be seeing will appear on my blog. That’s an upper case “I” and a zero at the end of the URL by the way … Slide 4

This is not a talk about GIS. This isn't even a talk about GI or geographical information in the usual sense of the words. Nor is this the talk I sat down and started to write. That talk was going to be about how maps are now mainstream and how we’ve managed to find ourselves in the middle of something that could be called a ‘map war’, with Nokia, TomTom, Google, Apple and OpenStreetMap battling it out for overall geospatial supremacy. But I didn’t write that talk. The topic reeked of far too much schadenfreude for me to be comfortable with the topic. So I stopped writing that talk and started to think about another suitable theme. Then something happened. Slide 5

A while back I’d written a talk about the digital history that we are currently creating on the internet. The talk was called ‘Big Data vs. My Data’. I gave the talk at two conferences and it seemed to go down well, which is always gratifying. Slide 6

So I filed the talk away, wrote a blog post on it, and considered the topic pretty much finished. It wasn’t. Slide 7

Then Eric van Rees, the editor of Geoinformatics Magazine got in touch. He said that he’d liked the blog post I’d written and the slide deck notes and would I be willing to convert the talk into a magazine column. So I sat down and tried to condense a 3000 odd work talk, spread over around half an hour into a 750 word printed column. Eventually I succeeded, it got published and people seemed to like it. This was also gratifying. Slide 8

So now I really considered the topic pretty much finished. It still wasn’t. Slide 9

The topic ended up spawning one of those long conversations on Twitter, where some people agreed with me and some …. didn’t. Slide 10

So I went back and revisited the topic and decided it really wasn’t finished. Hopefully this version is the final definitive finished version. Slide 11

This is a talk that goes off in lots of different directions but fundamentally it’s about these two sets of geographical coordinates. Most people here should recognise them as two sets of latitude and longitude. Some of the frighteningly scary people I’ve worked with could probably tell you what country they’re in, just by looking at them. A few, really frighteningly scary people that I know could probably even tell you what city they’re in. But I won’t make you do that. The first coordinate is where I live, near Twickenham Rugby Stadium in West London. The second is pretty much where we are now, in the Old Library in the University of Edinburgh. Why this talk is about these two sets of coordinates, and quite a few other coordinates besides, will, I hope become clearer over the next half an hour or so. Slide 12

One of the things I love about writing a talk is how the things I hear and the things I read and write get mentally stored away and then, somehow, they start to draw together to form a semi-coherent narrative around the talk title that I inevitably gave to the conference organisers around 3 months prior. So it is with this talk, which in Sesame Street fashion, has been unknowingly brought to you by ... Slide 13

Kellan Elliott-McCrea, previously at Flickr and Yahoo! and now at Etsy ...

Aaron Straup Cope, previously at Flickr and Stamen Design and now doing stuff at the Smithsonian ...

... and my children. No, really. This isn't just an excuse to put a photo of my family up on the screen behind me so you can all, hopefully, go "awww". Slide 14

But before I get into anything to do with making history, big data, my data or anything interweb or social network related I want to try and frame the context of my thoughts by talking about communication, or to be more precise, the way in which we communicate. We are, politics and warfare aside, a social species and communicating with each other is something we do a lot of, although the manner in which we communicate has changed a lot.

A lot of our communication is both verbal and non-verbal and relies on face to face, person to person, proximity so that the verbal and non verbal approach comes together to express what we intend to say. Slide 15

Some of our communication is written, the old fashioned way, using pen and paper, although a lot of commentators have called out the "death of the letter". Whether that's true or just good headline making hyperbole remains to be seen, but to be fair, I can't remember the last time I actually sat down and wrote a letter. Slide 16

A lot of our communication is still verbal but via a phone, be that a land line or a mobile. We call and we text. A lot.

Slide 17

But be it talking face to face, texting someone or even writing an email, the intended audience is still narrow, person to person, or person to small audience.

But the interwebs have added to this sphere of communications and now we broadcast our thoughts, feelings and experiences, sometimes regardless of whether we think anyone will see this, let alone empathise or communicate back. Slide 18

While we still talk, meet, engage and sometimes broadcast, like I'm doing right now, this human-to-human interaction has been augmented, maybe complimented by electronic communications.

Slide 19

We're as likely to post a Tweet on Twitter or a status on Facebook or Google+ or another social network as we are to speak face to face. Slide 20

And because this type of communique is electronic, that means it generates data as we go. Today we generate lots of data, big data, on a daily basis. It's probably not unfair to say that there's data being generated in this very auditorium, right now, as I'm saying this. Slide 21

We all seem to be doing this, though ‘all’ is a sweeping over generalisation, but enough of us are making digital ‘stuff’ for it to start to matter and for it to start to be significant. Slide 22

Some of this data is implicit. A by-product of what we're doing. Whether it's our cell phones loosely mapping out where we are, not a privacy invasion I hasten to add, but the simple way in which cellular networks work, but that's a topic for another talk on another day, or our GPS navigation, be it built into our car or our smartphone, providing anonymised traffic data probes to show where freeway congestion is, we don't consciously set out to generate this data. It's a by product of what we're doing. Slide 23

But a lot of this data is very much explicit. We type out a status update on our phone, our tablet, our laptop and we tap or click on the button that says "go" or "submit" or we take a photo, maybe add an image filter or a comment and tap or click the button that says "share" or "upload". Slide 24

By doing this we're explicitly communicating, explicitly broadcasting and sharing with our friend, family, followers and the interwebs in general ... and in doing so, we're playing our part in generating more and more data. Slide 25

And generate it we do. Lots of it. We call it big data, but massive data would be a more accurate definition of it. Whilst our own individual contributions to big data may not be that big, when you put it all together it's part of an ever growing corpus of big data and there's companies that both provide the means for us to broadcast and share this data as well as, hopefully, providing a means of revenue for them to enable them to keep doing this. The amounts that get generated each day is almost too much for us to think about and comprehend. Once a number gets that big, we can't really deal with it. We know it's a big number but what that actually represents is hard for us to get our head around. Slide 26

So let's look at just a small sample of what gets generated on a daily basis from the social big data, communicating, sharing and broadcasting services I tend to use, if not on a daily basis then at least on a weekly basis. I Tweet and update my Facebook status at least once a day, sometimes up to 20 times a day. I check-in to places on Foursquare at least 10 times a day and take and upload photos to Instagram and Facebook around 3 times a week. That's just my contribution, think how many people are doing the same thing to get to the sort of volumes you can see on the slide behind me. Slide 27

As a specific example, I post a single Tweet on Twitter. Weighing in at 72 characters, including spaces and punctuation, it’s only just over half of Twitter’s 140 character maximum. That Tweet is assigned a unique identifier by Twitter, which forms part of the unique URL to that single Tweet. From visiting that URL I can see that Twitter has added who I am, when I posted that Tweet and because I geotagged the Tweet, also where I was when I wrote it. So that’s a little more additional metadata than the 72 characters of the Tweet itself. Slide 28

But if I then take that unique identifier and fire it back at Twitter’s API, I start to see just how much metadata has been added. Slide 29

115 lines of JSON come back to me from that API call, making up 3,338 characters. There’s metadata on the Tweet itself, when it was created, the text of the Tweet, what app I was using to Tweet with, there’s information on my Twitter account, my name, my Twitter name, my account’s unique identifier, my general location, my biography, all the stuff that’s in my Twitter profile. There’s how many Tweets I’ve posted (14,811 at the time), how many followers I have, how many favourites I’ve flagged, how many Twitter lists I appear in. There’s the details of my profile on Twitter’s web site, HTML colours, profile image URL and the like. And because I’ve geotagged the Tweet, there’s the full geographic information about where I was including a bounding box of the locality.

All of a sudden I can see just how Big Data got its name. Slide 30

But how long will all of this continue? Remember the people I spoke about right at the start of this talk, some 16 slides back? It's time to bring them into the picture. Firstly, my children, although this applies equally to pretty much all children. Remember when you were a child? The summer holidaywas endless. The skies were always blue and the sun was always out (remember, I'm from the UK where Summer and sun do not always go together, in fact it was pouring down with rain as I wrote this at home last week). And just like the summer holidaywas endless, so were your parents and the people around you, they were eternal and would always be there. Remember feeling like that? But then the inevitable happened. We grew up and we discovered, often the hard way, that the summer wasn't endless and that almost everything is finite. Slide 31

Social networks aren't finite either. They get born, if they're lucky they grow and then at some time or other they ... stop. If it's a social network you don't use then it doesn't really bother us much.

Slide 32

But if it's a network you've shared a lot of content through, what happens then? A lot of people, myself included, immediately get into "I want my data back" mode. Slide 33

But is it your data. Of course it is. You made it. You composed that Tweet. You shared that link. You took that photo. You were at that place you checked-in at. Of course it's your data.

But there's a point to be made here. You may have created that data, you may own that data, but the copy of that data in that social network is just that. It's a copy. It's not necessarily "your" data and because most of us don't preserve what we send up into the cloud on its way to our social networks, you may have created it, but the copy in the cloud isn't necessarily yours. Slide 34

It's an easy mistake to make. I may be a geo-technologist and many more things besides, but I am not a lawyer, and apart from the lawyers in the room, more of you aren't and most of the people who use social networks aren't lawyers either, unless it's DeferoLaw, which is a social network for the legal profession. Slide 35

... we see phrases like "you retain your rights" ... Slide 36

… another favourite is “you own the content you posted” Slide 37

... and "you always own your information" and immediately the subtleties and complexities of data ownership, licensing, copyright and intellectual property are cast aside. We say to ourselves, "it's my data dammit, I own it, I want it". Slide 38

And it's this belief that we really are lawyers in our spare time that makes people think that somehow the data they've shared via a social network is physically theirs, rather than a bit for bit perfect copy that we've licensed to that social network. We forget for a moment that we're using that social network as a cloud based backup, in some cases the only backup, of our creations and we mutter darkly about "holding my data hostage". Slide 39

The blunt, and often harsh reality, is the age old adage that "you get what you pay for". If you pay, you're probably a customer. If you're using something for "free" (and I say free in very large italics and inverted commas here), then you're probably, unknowingly or unwittingly, the product. Harsh. But fair. It's our content that the social networks monetize and that allows them to keep their servers and disk storage up and running. You might have seen that previous slide with the Tech Crunch post and be thinking "ah, but Flickr Pro is chargeable and if my subscription lapses I can't get my photos back". That's actually not really true, if not particularly simple, but bear with me for a few more slides. Slide 40

Now let's forget "big data" for a moment and think about "your data" instead. Actually, let's think about "my data" for a moment. As of last week, my social media footprint on Twitter, Foursquare, Instagram and Flickr looked something like this. Facebook's numbers would be up there too, but I'll get to that in a moment.

Now in the grand scheme of things, in the massive numbers thrown about around about "big data" this is but a drop in the ocean. But ... Slide 41

I created these check-ins, status updates, tweets and photos. They're important to me. Very important to me. Slide 42

And as Aaron Cope pointed our earlier this year, my small, insignificant contribution to big data is part of my own, very subjective, very personal, history.

Slide 43

As I may have mentioned before, I'm a geo-technologist and a high percentage of my explicit big data contribution has a geo or location component to it. I'd like to map our where I checked-in, I'd like to see where I was when I Tweeted or what photos I took at a particular location. Some of this "mappyness" already exists in some of the big data stores where my contributions live, but not all of it, it's far too niche and personal for that. But it's still important to me. Slide 44

Remember, in 99% of the social networks I use, I'm not the customer, I'm contributing to the product. But how do my regularly used social networks fare here. Regardless of whether I own the data I put up there, how easy is it to get a copy of? Slide 45

Firstly, what about a one click solution? Can I go to a particular page on the web and click the big button which says "give me a copy of my data". Slide 46

Facebook is the only one of my 5 social networks that does this. Well, it almost does this. At least I'm sure I used to be able to do this. Slide 47

I can still request a download of my information. But it now only seems to give me my status updates since I enabled Timeline on my account, though I can still get all of my photos and messages since 2008. Rather than say that this doesn't work, I'll just file this under "needs further investigation" and move on. Slide 48

Sometimes this lack of a one button download of contributed data is a deliberate decision on the part of a given social network. Sometimes, it's a hope that with an API, some enterprising developer will do this, but most of the time, that doesn't always happen. Slide 49

So talking of APIs, surely the remaining social networks will have an API and let me knock up some code to get a copy of my data contributions. Surely? Slide 50

Not all social networks do. An API tends to come after a social network's launch, if it comes at all, and often it doesn't let me do all that I want to do. Slide 51

Thankfully, all the networks I used, with the exception of Twitter not only provide an API, but let me use that API to get my data. All of my data. Slide 52

This is a good thing and meets the requirements for an API to meet what Kellan Elliot McCrea calls "minimal competance". He went on to say

"The ability to get out the data you put in is the bare minimum. All of it, at high fidelity, in a reasonable amount of time.

The bare minimum that you should be building, bare minimum that you should be using, and absolutely the bare minimum you should be looking for in tools you allow and encourage people who aren’t builders to use." Slide 53

Kellan was behind Flickr's API and his sentiments are, to my mind, admirable.

Slide 54

Sadly, Twitter doesn't let me do this and fails the minimal competence test miserably. Deep in their API documentation I found the justification for this as being essential to ensure Twitter's stability and performance and leave it as an exercise to you the audience to work out what I think of this excuse. Slide 55

The sad truth here is that when it comes to our own individual online data history, there's not always a willingness to make it easy for us to get copies of our history, if it's even on the radar at all. Slide 56

But if we can't always get our data history back, maybe the solution is to make an archive of it before it goes in or keep that archive up to date as you go ... a personal digital archive or PDA (and not to be confused with personal electronic organisers, or PDAs, such as the Palm Pilot). Slide 57

Thanks to web APIs and another social network, admittedly one for people who know how to code, a lot of this is already possible and the scope, range and functionality is growing by the day. The irony that I can build my own personal digital archive out of code found on another social network, which itself is built around a source code archival system is not lost on me either. Slide 58

So, firstly, there's my own Instagram (and no, I'm not going to share the URL of where this lives I'm afraid. The idea here is that this is a personal archive, not a clone of a social network). Slide 59

My own Instagram is called parallel-ogram. It's on GitHub; you can download it, configure it, run it. For free. Slide 60

Parallel-ogram works as well on my phone as it does on my laptop, showing me exactly what I've uploaded to Instagram. Indeed, it goes one step further than Instagram as currently there's no way to see what you've uploaded other than through their mobile app. Parallel-ogram doesn't allow me to take photos or upload them, at least not yet, but it does allow me to go back to the day I first uploaded a photo, grabs copies for me and twice a day it uses the Instagram API to see what I may have uploaded and quietly grabs a copy and stashes it away for me. Slide 61

There's also my own archive of Foursquare ...

Slide 62

It's called privatesquare and it's also on GitHub Slide 63

Like parallel-ogram, privatesquare quietly uses the Foursquare API to go back to my first check-in and twice a day quietly synchs my check-ins for me. I can go back and look at them, see maps of them and browse my check-in history. Unlike parallel-ogram, privatesquare also allows me to check-in, even if I don't want to share this with Foursquare. In short it allows me to use it both as an archive and also as a check-in tool, and if I want to use Foursquare's official mobile app, I can do that, safe and secure in the knowledge that privatesquare will keep itself up to date. Slide 64

My photos also end up on Flickr and there’s a private archive of that too Slide 65

It's called parallel-flickr, it also lives on GitHub and it's also filed under "something I really must install, configure and get running when I have some spare time". Slide 66

So I have my own archives of Instagram, Flickr and Foursquare. I sort of have my own archive of Facebook. But what about my Tweets? Slide 67

Well until Twitter decides that their site is stable enough to let me grab my Tweets through their archive, the next best solution is to archive by another means. I've put the RSS feed to my Tweet-stream into Google Reader, which helpfully never throws anything away. I did this a long time ago and I have almost all, but 100% all of my Tweets. Now all I need to do is write some code to read them from Google Reader and then get the Tweet data from Twitter, which then do allow via their API. Sadly, this is also filed under "something I must do when I have the time". It's not perfect, but then again, none of what I've discussed is, but it's a start and that's good enough for the time being. Slide 68

Finally, you might have noticed the links in my slides look sort of like bitly links, only on the vtny.org domain. That's because I've been archiving my short links for a few years now Slide 69

Using my own short URL archive and my own, self hosted, URL shortener. I just thought I'd mention that. Slide 70

So, my big data contribution, my personal online history, is important to me. Yours might be important to you too. We're often told that we can't have our cake and eat it, but with the advent of the personal digital archive, maybe we can thanks to the enterprising people who create APIs in the first place and those who not only use these APIs but also put their code up for all the world to use, free of charge. Your online history may not be that important in the grand scheme of things, but it's your online history, it's personal, you made it. When social networks go the place where software goes to die, you might just want to preserve that personal history before the servers get powered off forever. Maybe the geeks will inherit the Earth after all. Slide 71

I want to wrap up with a slightly cautionary tale, which highlights why our digital stuff and interweb history might just be important in ways you might not immediately think of. A friend of a friend, called Claudio, received a call from the British Transport Police in June of last year. There'd been an assault at Leicester Square Tube station in which an unfortunate individual ended up with broken ribs. The Police had evidence that placed Claudio at the Tube station at the time the assault took place. Could he explain what he was doing at that place and time. It's worth noting here that the assault had taken place in December of 2010, almost 7 months prior.

I wonder how many of us could say with certainty where we were, what we were doing and whether there was anyone to corroborate this without recourse to some form of aide memoire.

For Claudio, it was entirely feasible that he was at Leicester Square on the night of the assault but worryingly, there was large gaps in his recollection and that of his friends.

Thankfully, by mining his web history and that of his friends he was able to piece together the events of the night, with some additional proof in the form of geotagged photographs.

As the cliche goes, Claudio was eliminated from the enquiries but what I find particularly telling about this anecdote is the strong web history and Big Data elements to it. The initial accusation was built on Big Data, namely Claudio was one of those people who used his Oyster Card to enter the Tube station, which left a date and time stamped record. In fact, the date and time that he entered the station was precisely the same time that the person, captured on CCTV, entered the station. Once the full picture was in place, it could be seen that Claudio was not the suspect that the Police were looking for. But not only was the potential accusation built on Big Data but the defence, the alibi and the proof of his innocence were built on Big Data and people's web histories as well. Slide 72

It wouldn't be an outrageous prediction to see that this sequence of events might start playing themselves out a lot more in the not too distant future as we grow ever more reliant on web based services, Big Data stores and as those data stored start to be interlinked.

The whole tale is worth a read; you'll find it at the end of the URL on the screen behind me. Slide 73

Thank you for listening.