translator

I was pleased to see David Pogue’s pos­i­tive review of the new Win­dows Phone, Nokia’s Lumia 900, a cou­ple of weeks ago in the New York Times.  Win­dows Phone has made great progress these past cou­ple of years, and has advanced a beau­ti­ful and fresh design lan­guage, Metro, which we see being adopt­ed all around Microsoft.  (I’ve been a big advo­cate for Metro lan­guage and prin­ci­ples in my own part of the com­pa­ny, Online Ser­vices.)  Pogue’s only real com­plaint is that apps for Win­dows Phone  are still thin­ner on the ground than on iPhone and Android— though as he points out, what real­ly mat­ters is whether the impor­tant, great and use­ful apps are there, not whether the total num­ber is 50,000 or 500,000.  Many apps doesn’t nec­es­sar­i­ly imply many qual­i­ty apps, and most of us have got­ten over last decade’s “app mania” that inspired one to fill screen after screen with run-once won­ders.

What real­ly made me smile was Pogue’s char­ac­ter­i­za­tion of what those impor­tant apps are, in his view.  After reel­ing off  a few of the usu­al sus­pects— Yelp, Twit­ter, Pan­do­ra, Face­book, etc.— he added:

Plen­ty of my less famous favorites are also unavail­able: Line2, Hip­munk, Nest, Word Lens, iStop­Mo­tion, Glee, Oca­ri­na, Songi­fy This.

Even Microsoft’s own amaz­ing iPhone app, Pho­to­synth, isn’t avail­able for the Lumia 900.

I’ve also been asked (a num­ber of times) about Pho­to­synth for Win­dows Phone... hang in there.  A nice piece of news we’ve just announced, how­ev­er, is a new app for Win­dows Phone that I hope will join Pogue’s pan­theon, and that is con­sid­er­ably more advanced than its coun­ter­parts on oth­er devices: Trans­la­tor.  Tech­ni­cal­ly this isn’t a new app, but an update, though the update is far more func­tion­al than its pre­de­ces­sor.

Trans­la­tor has offline lan­guage sup­port, mean­ing that if you install the right lan­guage pack you can use it abroad with­out a data con­nec­tion (essen­tial for now, I wish inter­na­tion­al data were a prob­lem of the past).  It also has a nice speech trans­la­tion mode, but what’s per­haps most inter­est­ing is the visu­al mode.  Visu­al trans­la­tion is real­ly help­ful when you’re encoun­ter­ing menus, signs, forms, etc., and is espe­cial­ly impor­tant when you need to deal with char­ac­ter sets that you not only can’t pro­nounce, but can’t even write or type (that would be Chi­nese).

Word Lens, men­tioned by Pogue, was one of our inspi­ra­tions in devel­op­ing the new Trans­la­tor.  What’s impres­sive about Word Lens is its abil­i­ty to process frames from the cam­era at near-video speed, read­ing text, gen­er­at­ing word-by-word trans­la­tions, and over­lay­ing those onto the video feed in place of the orig­i­nal text.  This is quite a feat, near the edge of achiev­abil­i­ty on cur­rent mobile phone hard­ware.  In my view it’s also one of the first con­vinc­ing appli­ca­tions of aug­ment­ed real­i­ty on a phone.  How­ev­er, the approach suf­fers from some inher­ent draw­backs.  First, the trans­la­tion is word-by-word, which often results in non­sen­si­cal trans­lat­ed texts.  Sec­ond, there isn’t quite enough com­pute time to do the job prop­er­ly in just one frame, yield­ing a some­what slug­gish feel; on the oth­er hand the inde­pen­dent pro­cess­ing of each frame is waste­ful and often makes words flick­er in and out of their cor­rect trans­la­tions, just a bit too fast to fol­low.  For me, these things make Word Lens a good idea, and bet­ter than noth­ing in a pinch, but imper­fect.

The visu­al trans­la­tion in Trans­la­tor takes a dif­fer­ent approach.  It exploits the fact that the text one is aim­ing at is print­ed on a sur­face and is gen­er­al­ly con­stant.  What needs to be done frame-by-frame, then, is to lock onto that sur­face and track it.  This is done using Pho­to­synth-like com­put­er vision tech­niques, but in real­time, a bit like the video track­ing in our TED 2010 demo.  Select­ed, sta­bi­lized frames from that video can then be rec­ti­fied and the opti­cal char­ac­ter recog­ni­tion (OCR) can be done on them asyn­chro­nous­ly— that is, on a timescale not cou­pled to the video fram­er­ate.  We can do a bet­ter job of OCR and trans­la­tion, using a lan­guage mod­el that under­stands gram­mar and mul­ti-word phras­es.  Then, the trans­lat­ed text can be ren­dered onto the video feed in a way that still tracks the orig­i­nal in 3D.  This solves a num­ber of prob­lems at once: improv­ing the trans­la­tion qual­i­ty, avoid­ing flick­er, improv­ing the frame rate, and avoid­ing super­flu­ous repeat­ed OCR.  It’s a small step toward build­ing a per­sis­tent and mean­ing­ful mod­el of the world seen in the video feed and track­ing against it, instead of doing a weak­er form of frame-by-frame aug­ment­ed real­i­ty.  The team has done a real­ly beau­ti­ful job of imple­ment­ing this approach, and the ben­e­fits are pal­pa­ble in the expe­ri­ence.

Use this app on your next vis­it to Chi­na!  I’d love to read com­ments and sug­ges­tions from any­one try­ing Trans­la­tor out in the field.

This entry was posted in mobile and tagged , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *