新科技大数据遭遇数据净化难题
日期:2014-07-18 13:38

(单词翻译:单击)

大数据遭遇数据净化难题

Karim Keshavjee, a Toronto physician and digital health consultant, crunches mountains of data from 500 doctors to figure out how to improve patient treatment. But it’s a frustrating slog to get a computer to decipher all the misspellings, abbreviations, and notes written in unintelligible medical shorthand.
卡里姆o科夏瓦杰是多伦多的一名医生和网络健康顾问,他要从500名医生那里反馈的海量数据中总结出怎样才能更好地治疗病人。但是众所周知,医生的“书法”本来就堪比天书,要想让电脑识别出其中的拼写错误和缩写更是难于登天。
For example, “smoking information is very hard to parse,” Keshavjee said. “If you read the records, you understand right away what the doctor meant. But good luck trying to make a computer understand. There’s ‘never smoked’ and ‘smoking = 0.’ How many cigarettes does a patient smoke? That’s impossible to figure out.”
比如科夏瓦杰指出:“患者是否吸烟是个很重要的信息。如果你直接阅读病历,你马上就能明白医生是什么意思。但是要想让电脑去理解它,那就只能祝你好运了。虽然你也可以在电脑上设置‘从不吸烟’或‘吸烟=0’的选项。但是一个患者每天吸多少支烟?这几乎是电脑不可能搞明白的问题。
The hype around slicing and dicing massive amounts of data, or big data, makes it sound so easy: Just plug a library’s worth of information into a computer and wait for valuable insights to pour out about how to speed up an auto assembly line, get online shoppers to buy more sneakers, or fight cancer. The reality is much more complicated. Data is inevitably “dirty” thanks to obsolete, inaccurate, and missing information. Cleaning it up is an increasingly important and overlooked job that can help prevent costly mistakes.
由于宣传报道把大数据吹得神乎其神,因此很多人可能觉得大数据用起来特别简单:只要把相当于一整个图书馆的信息插到电脑上,然后就可以坐在一边,等着电脑给出精辟见解,告诉你如何提高自动生产线的生产效率,如何让网购者在网上购买更多的运动鞋,或是如何治疗癌症。但事实远远比想象复杂得多。由于信息会过时、不准确和缺失,因此数据不可避免地也有“不干净”的时候。如何把数据变“干净”是一个越来越重要但又经常被人忽略的工作,但它可以防止你犯下代价高昂的错误。
Although techniques are improving all the time, scrubbing data can only accomplish so much. Even when dealing with a relatively tidy set of information, getting useful results can be arduous and time-consuming.
虽然科技一直都在进步,但是人们在净化数据上能想到的法子并不多。即便是处理一些相对较“干净”的数据,要想获得有用的结果往往也是件费时费力的事情。
“I tell my clients that the world is messy and dirty,” said Josh Sullivan, a vice president at business consulting firm Booz Allen who handles data crunching for clients. “There are no clean data sets.”
博思艾伦咨询公司(Booz Allen)副总裁约什o沙利文说:“我对我的客户说,这是个混乱肮脏的世界,没有完全干净的数据集。”
Data analysts start by looking for information that’s out of the norm. Because the volume of data is so huge, they typically hand the job over to software that automatically sifts through numbers and text to look for anything unusual that needs further review. Over time, computers can improve their accuracy in spotting what’s belongs and what doesn’t. They can also better understand what words and phrases mean by clustering similar examples together and then grading their interpretations for accuracy.
数据分析师一般喜欢先寻找非常态的信息。由于数据量太巨大,他们一般都会把筛选数据的工作交给软件来完成,来寻找是否有些反常的东西需要进一步检查。随着时间的推移,电脑筛选数据的精确性也会提高。通过对类似案例进行分类,它们也会更好地了解一些词语和句子的含义,然后提高筛选的精确性。
“The approach is easy and straightforward, but training your models can take weeks and weeks,” Sullivan said.
沙利文说:“这种方法简单直接,但‘训练’你的模型可以需要一周又一周的时间。”
A constellation of companies offer software and services for cleaning data. They range from technology giants like IBM IBM -0.24% and SAP SAP 0.12% to big data and analytics specialists like Cloudera and Talend Open Studio. A legion of start-ups are also trying to get a toehold as data janitors including Trifacta, Tamr, and Paxata.
有些公司也提供了用来净化数据的软件和服务,其中既包括像IBM和SAP一样的科技巨头,也包括Cloudera和Talend开放工作室从事等大数据和分析的专门机构。一大批创业公司也想争当大数据的看门人,其中有代表性的包括Trifacta、Tamr和Paxata等。
Healthcare, with all its dirty data, is one of the toughest industries for big data technology. Electronic health records make medical information increasingly easy to dump into computers, but there’s still a lot room for improvement before researchers, pharmaceutical companies and hospital business analysts can slice and dice all the information they want.
由于“不干净”的数据太多,医疗业被认为是大数据技术最难搞定的行业之一。虽然随着电子病历的普及,将医疗信息输入电脑的难度已经变得越来越低,但是研究人员、制药公司和医疗业分析人士要想把他们需要的数据尽情地拿来分析,在数据上要提高的地方还有很多。
Keshavjee, the doctor and CEO of InfoClin, a health data consulting firm, spends his days trying to tease out ways to improve patient treatment by sifting through tens of thousands of electronic medical records. Obstacles pop up all the time.
健康数据咨询公司InfoClin的医生兼CEO科夏瓦杰花了很多时间,希望数以万计的电子医疗病历中筛选有用的数据,以提高对病人的诊疗水平。但他们在筛选的过程中却不断遇到阻碍。
Many doctors neglect to note a patient’s blood pressure in their medical records, something that no amount of data cleaning can fix. Simply determining what ails patients—based on what’s in their files—is surprisingly difficult for computers. Doctors may enter the proper code for diabetes without clearly indicating whether it’s the patient who has the disease or a family member. Or they may just enter “insulin” without mentioning the underlying diagnosis because, to them, it’s obvious.
很多医生在病历中没有记录病人的血压,这个问题是无论哪种数据净化方法都修复不了的。光凭借现有病历的信息去判断病人得了什么病对电脑来说就已经是一项极其困难的任务。医生在输入糖尿病编号的时候,可能忘了清楚地标注究竟是患者本人得了糖尿病,还是他的某个家人得了糖尿病。又或许他们光是输入了“胰岛素”三个字,而没有提到患者得了什么病,因为这对他们来说是再明显不过的事情。
Physicians also use a lot of idiosyncratic shorthand for medications, illnesses and basic patient details. Deciphering it takes a lot of head scratching for humans and is nearly impossible for a computer. For example, Keshavjee came across one doctor who used the abbreviation”gpa.” Only after coming across a variation, “gma,” did he finally solve the puzzle—they were shorthand for “grandpa” and “grandma.”
医生用来诊断、开药和填写病人基本信息时会大量用到一套独特的速记字体。即使让人类来破解它也要大为头痛,而对于电脑基本上是不可能完成的任务。比如科夏瓦杰提到有个医生在病历中写下“gpa”三个字母,让他百思不得其解。好在他发现后面不远处又写着“gma”三字,他才恍然大悟——原来它们是爷爷(grandpa)和奶奶(grandma)的缩写。
“It took a while to figure that one out,” he said.
科夏瓦杰说:“我花了好半天才明白它们到底是什么意思。”
Ultimately, Keshavjee said one of the only ways to solve the problem of dirty data in medical records is “data discipline.” Doctors need to be trained to enter information correctly so that cleaning up after them is less of a chore. Incorporating something like Google’s helpful tool that suggests how to spell words as users type them would be a great addition for electronic medical records, he said. Computers can learn to pick out spelling errors, but minimizing the need is a step in the right direction.
科夏瓦杰认为,解决数据“不干净”的终极方法之一是要给病历制定一套“数据纪律”。要训练医生养成正确录入信息的习惯,这样事后净化数据时才不至于乱得一团糟。科夏瓦杰表示,谷歌有一个很有用的工具,可以在用户进行输入时告诉他们如何拼写生僻字,这样的工具完全可以添加到电子病历工具中。电脑虽然可以挑出拼写错误,但是让医生摒弃不良习惯才是朝着正确的方向迈出了一步。
Another of Keshavjee’s suggestions is to create medical records with more standardized fields. A computer would then know where to look for specific information, reducing the chance of error. Of course, doing so is not as easy as it sounds because many patients suffer from multiple illnesses, he said. A standard form would have to be flexible enough to take such complications into account.
科夏瓦杰的另一个建议是,在电子病历中设置更多标准化的域。这样电脑就会知道到哪里去找特定的信息,从而减少出错率。当然,实际操作起来并没有这么简单,因为很多病人同时身患好几种疾病。因此,一个标准的表格必须拥有足够的灵活性,把这些复杂情况全部考虑进去。
Still, doctors would need to be able to jot down more free-form electronic notes that could never fit in a small box. Nuance like why a patient fell, for example, and not just the injury suffered, is critical for research. But software is hit and miss in understanding free-form writing without context. Humans searching by keyword may do a better job, but they still inevitably miss many relevant records.
但是出于诊疗的需要,医生有时需要在病历上记下一些自由行文的东西,这些内容肯定不是一个小格子能装得下的。比如一个患者为什么会摔倒,如果不是受伤导致的,那么原因就非常重要。但是在没有上下文的条件下,软件对于自由行文的理解只能用撞大运来形容。筛选数据的时候,如果人们用关键词搜索的话可能会做得更好些,但这样也难免会漏掉很多有关的记录。
Of course, in some cases, what appears to be dirty data, really isn’t. Sullivan, from Booz Allen, gave the example the time his team was analyzing demographic information about customers for a luxury hotel chain and came across data showing that teens from a wealthy Middle Eastern country were frequent guests.
当然,在有些案例中,有些看起来不干净的数并不是真的不干净。博思艾伦咨询公司副总裁沙利文举例说,有一次他的团队为一家豪华连锁酒店分析顾客的人口统计数据,突然发现,数据显示一个富有的中东国家的青少年群体是这家酒店的常客。
“There were a whole group of 17 year-olds staying at the properties worldwide,’ Sullivan said. “We thought, ‘That can’t be true.’ “
沙利文回忆道:“有一大群17岁的青少年在世界各地都住这家酒店,我们以为:‘这肯定不是真的。’”
But after some digging, they found that the information was, in fact, correct. The hotel had legions of young customers that it didn’t even realize were there, and had never done anything to market to them. All guests under 22 were automatically logged as “low-income” in the company’s computers. Hotel executives had never considered the possibility of teens with deep pockets.
但做了一些挖掘工作后,他们发现这个信息其实是正确的。这家酒店有大量的青少年顾客,甚至连酒店自己也没有意识到,而且酒店也没有针对这部分顾客做过任何促销和宣传。所有22岁以下的顾客都被这家公司的电脑自动列入“低收入”群体,酒店的高管们也从来没有考虑过这些孩子的腰包有多鼓。
“I think it’s harder to build models if you don’t have outliers,” Sullivan said.
沙利文说:“我认为如果没有离群值的话,构建模型会更难。”
Even when data is clearly dirty, it can sometimes be put to good use. Take the example, again, of Google’s spelling suggestion technology. It automatically recognizes misspelled words and offers alternative spellings. It’s only possible because Google GOOG -0.34% has collected millions and perhaps billions of misspelled queries over the years. Instead of garbage, the dirty data is an opportunity.
即便有时数据明显不干净,它有时依然能派上大用场。比如上文提到的谷歌(Google)的拼写纠正技术。它可以自动识别拼写错误的单词,然后提供替代拼写。这个工具之所以有这样神奇的功用,是因为谷歌在过去几年中已经收集了几亿甚至几十亿个拼写错误的词条。因此不干净的数据也可以变废为宝。
Ultimately, humans, and not machines, draw conclusions from the data they crunch. Computers can sort through millions of documents, but they can’t interpret the findings. Cleaning data is just one of step in a long trial and error process to get to that point. Big data, for all its hype about its ability to lift business profits and help humanity, is a big headache.
最终,从大数据中获得结论的是人而不是机器。电脑虽然可以整理几百万份文件,但它并不能真的解读它。数据净化就是为了方便人们从数据中获取结论而反复试错的过程。尽管大数据已被奉为能提高商业利润、能造福全人类的神器,但它也是个很让人头痛的东西。
“The idea of failure is completely different in data science,” Sullivan said. “If you they don’t fail 10 or 12 times a day to get to where they should be, they’re not doing it right.”
沙利文指出:“失败的概念在数据科学中完全是另一回事。如果我们每天不失败10次或12次来试错,它们就不会给出正确的结果。”

分享到
重点单词
  • understandvt. 理解,懂,听说,获悉,将 ... 理解为,认为
  • improvementn. 改进,改善
  • solvev. 解决,解答
  • accomplishvt. 完成
  • diabetesn. 糖尿病
  • impossibleadj. 不可能的,做不到的 adj. 无法忍受的
  • physiciann. 内科医生
  • sapn. 半穿甲的(烧结铝粉); the liquid wi
  • costlyadj. 昂贵的,代价高的
  • coden. 码,密码,法规,准则 vt. 把 ... 编码,制