SourceForge.net Logo
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Updated: Thu Jul 10 22:06:30 PDT 2003

 jChardet

  1. What is jchardet ?
    jchardet is a java port of the source from mozilla's automatic charset detection algorithm. The original author is Frank Tang. What is available here is the java port of that code. The original source in C++ can be found from http://lxr.mozilla.org/mozilla/source/intl/chardet/ More information can be found at http://www.mozilla.org/projects/intl/chardet.html
  2. ^Top

  3. Where can I download these file ?

    Right here... http://sourceforge.net/project/showfiles.php?group_id=85452
  4. ^Top

  5. How do I build these libraries ?
    There is build.xml at the root directoy. If you have Apache ant installed, Just type "ant".
    Note: There is already a chardet.jar file supplied under dist/lib/chardet.jar, In case you dont want to compile
  6. ^Top

  7. How do I play around with it ? I want to test some web pages.
    There is a sample implementation called HtmlCharsetDetector class that is supplied with the package.
    This class fetches the given HTML page and pass it to the AutoDetect engine and outputs the detected charset.
    To run the sample...
    cd dist/lib
    java -classpath chardet.jar org.mozilla.intl.chardet.HtmlCharsetDetector 
    
  8. ^Top

  9. How will I integrate this code with my project ?
    The procedure is simple...

    First implement the interface nsICharsetDetectionObserver in the class where you want the detected charset to be notified. The interface just need to implement one function Notify(). This function will be called and the final result will be passed whenever the engine positively identifies a charset.
    
    package org.mozilla.intl.chardet ;
    
    import java.lang.* ;
    
    public interface nsICharsetDetectionObserver {
    
            public void Notify(String charset) ;
    }
    

    Second, initialize the class nsDetector. If you find a non-ascii character in your stream then start feeding data to the DoIt() member funtion.

    Finally, once you are done with the input streeam, call DataEnd(). By this time the engine should have notified the detected charset. See src/HtmlCharsetDetector.java for sample implementation.
    
    
  10. ^Top

  11. Show me a sample implementation.
    Code from HtmlCharsetDetector.java
            // Initalize the nsDetector() ;
            int lang = (argv.length == 2)? Integer.parseInt(argv[1])
                                             : nsPSMDetector.ALL ;
            nsDetector det = new nsDetector(lang) ;
    
            // Set an observer...
            // The Notify() will be called when a matching charset is found.
    
            det.Init(new nsICharsetDetectionObserver() {
                    public void Notify(String charset) {
                        HtmlCharsetDetector.found = true ;
                        System.out.println("CHARSET = " + charset);
                    }
            });
    
            URL url = new URL(argv[0]);
            BufferedInputStream imp = new BufferedInputStream(url.openStream());
    
            byte[] buf = new byte[1024] ;
            int len;
            boolean done = false ;
            boolean isAscii = true ;
    
            while( (len=imp.read(buf,0,buf.length)) != -1) {
    
                    // Check if the stream is only ascii.
                    if (isAscii)
                        isAscii = det.isAscii(buf,len);
    
                    // DoIt if non-ascii and not done yet.
                    if (!isAscii && !done)
                        done = det.DoIt(buf,len, false);
            }
            det.DataEnd();
    
            if (isAscii) {
               System.out.println("CHARSET = ASCII");
               found = true ;
            }
    
  12. ^Top




Disclaimer: Even though we try our best to keep this website useful and up-to-date, We do not give any warranty, expressed or implied, as to the accuracy correctness, relevance, suitably of the contents.