I Know Where You Were Last Summer: London’s public bike data is telling everyone where you’ve been [vartree.blogspot.co.uk]
This article is about a publicly available dataset of bicycle journey data that contains enough information to track the movements of individual cyclists across London, for a six month period just over a year ago.
I’ll also explore how this dataset could be linked with other datasets to identify the actual people who made each of these journeys, and the privacy concerns this kind of linking raises.
It probably won’t surprise you to learn that there is a publicly available Transport For London dataset that contains records of bike journeys for London’s bicycle hire scheme. What may surprise you is that this record includes unique customer identifiers, as well as the location and date/time for the start and end of each journey. The public dataset currently covers a period of six months between 2012 and 2013.
What are the consequences of this? It means that someone who has access to the data can extract and analyse the journeys made by individual cyclists within London during that time, and with a little effort, it’s possible to find the actual people who have made the journeys.
Five years ago, a team of researchers from Google announced a remarkable achievement in one of the world’s top scientific journals, Nature. Without needing the results of a single medical check-up, they were nevertheless able to track the spread of influenza across the US. What’s more, they could do it more quickly than the Centers for Disease Control and Prevention (CDC). Google’s tracking had only a day’s delay, compared with the week or more it took for the CDC to assemble a picture based on reports from doctors’ surgeries. Google was faster because it was tracking the outbreak by finding a correlation between what people searched for online and whether they had flu symptoms.
As researchers contemplate mining the students’ details, however, the university is grappling with ethical issues raised by the collection and analysis of these huge data sets, known familiarly as Big Data, said L. Rafael Reif, the president of M.I.T.
For instance, he said, serious privacy breaches could hypothetically occur if someone were to correlate the personal forum postings of online students with institutional records that the university had de-identified for research purposes.
It wasn’t so long ago that the excitement surrounding online education reached fever pitch. Various researchers offering free online versions of their university classes found they could attract vast audiences of high quality students from all over the world. The obvious next step was to offer far more of these online classes.
That started a rapid trend and various organisations sprung up to offer online versions of university-level courses that anyone with an Internet connection could sign up for. The highest profile of these are organisations such as Coursera, Udacity, and edX.
But this new golden age of education has rapidly lost its lustre.
German Chancellor Angela Merkel is proposing building up a European communications network to help improve data protection.
It would avoid emails and other data automatically passing through the United States.
In her weekly podcast, she said she would raise the issue on Wednesday with French President Francois Hollande.
Revelations of mass surveillance by the US National Security Agency (NSA) have prompted huge concern in Europe.
Disclosures by the US whistleblower Edward Snowden suggested even the mobile phones of US allies, such as Mrs Merkel, had been monitored by American spies.
Our home computer console will be used to send and receive messages—like telegrams. We could check to see whether the local department store has the advertised sports shirt in stock in the desired color and size. We could ask when delivery would be guaranteed, if we ordered. The information would be up-to-the-minute and accurate. We could pay our bills and compute our taxes via the console. We would ask questions and receive answers from “information banks”—automated versions of today’s libraries. We would obtain up-to-the-minute listing of all television and radio programs … The computer could, itself, send a message to remind us of an impending anniversary and save us from the disastrous consequences of forgetfulness.
It took decades for cloud computing to fulfill Baran’s vision.
Today’s big data is noisy, unstructured, and dynamic rather than static. It may also be corrupted or incomplete. “We think of data as being comprised of vectors – a string of numbers and coordinates,” said Jesse Johnson, a mathematician at Oklahoma State University. But data from Twitter or Facebook, or the trial archives of the Old Bailey, look nothing like that, which means researchers need new mathematical tools in order to glean useful information from the data sets. “Either you need a more sophisticated way to translate it into vectors, or you need to come up with a more generalized way of analyzing it,” Johnson said.
In their parents\’ attic, in boxes in the garage, or stored on now-defunct floppy disks — these are just some of the inaccessible places in which scientists have admitted to keeping their old research data. Such practices mean that data are being lost to science at a rapid rate, a study has now found.
The authors of the study, which is published today in Current Biology1, looked for the data behind 516 ecology papers published between 1991 and 2011. The researchers selected studies that involved measuring characteristics associated with the size and form of plants and animals, something that has been done in the same way for decades. By contacting the authors of the papers, they found that, whereas data for almost all studies published just two years ago were still accessible, the chance of them being so fell by 17% per year. Availability dropped to as little as 20% for research from the early 1990s.