Re: [SOLUTION] Re: lug-bg: utf,ansi,unicode etc...
- Subject: Re: [SOLUTION] Re: lug-bg: utf,ansi,unicode etc...
- From: raptor <raptor@xxxxxxxxxx>
- Date: Mon, 11 Aug 2003 16:02:11 +0300
|a tova e dobre ;-) no kakto kazah nqma 100% strict method da se otgatne
|kodiraneto na tova koeto podavash ... ima nesto symnitelno tuka pri
|detect-vaneto, t.e. nqmame 100% garanciq franciq 4e ste ucelim input
|encoding-a:
|http://search.cpan.org/author/JNEYSTADT/cyrillic-1.05/Lingua/DetectCharset.pm
|This routine is implemented using algorithm of statistical analysis of text,
|which was proved to be very efficient and showed around 99.98% acccuracy in
|tests.
|
|Ako znaem input encoding-a, posle konviertiraneto gore dolu e lesno imajki
|predvid izklu4eniqta za "symbols-out-of-range" ;-)
]- poglednah modula, pichagata otkriwa mnogo hitro encodinga... nai weroqtno e pusnal statistical analiz na nqkakwi tekstowe (weroqtno ruski, pyk znaesh li move da e porbwal wsichki kirilski ezici :") ) i wsichki wazmovni dwubukweni poredici poluchawat teglo... kolkoto po chesto dwe-bukwi (edna do druga) se sreshtat tolkowa po "tevki" sa..
I kato prowerqwa teksta posle, pri koito ot charsetowete se poluchi po golqma weroqtnost/teglo nego izbira...
Predpolagam che ako se naprawi syshtoto nesto za BG text, ste otgatwa po dobre bg-encoding... ama dokolkoto znam nqma podobni na word-"corpusi" za bulgarski ezik... (i nie sme cheli malko za linguisics :") )
Ako "corpus-a" e dostatychno golqm i da obhwashta poweche oblasti naisitna move da ima 99.98% tochnost..
raptor
============================================================================
A mail-list of Linux Users Group - Bulgaria (bulgarian linuxers).
http://www.linux-bulgaria.org - Hosted by Internet Group Ltd. - Stara Zagora
To unsubscribe: http://www.linux-bulgaria.org/public/mail_list.html
============================================================================
|