-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathutf-8-bom-detection-in-java.html
100 lines (91 loc) · 5.93 KB
/
utf-8-bom-detection-in-java.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>UTF 8 BOM Detection in Java</title>
<link rel="stylesheet" href="/theme/css/main.css" />
<!--[if IE]>
<script src="https://html5shiv.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body id="index" class="home">
<a href="https://github.com/we-taper">
<img style="position: absolute; top: 0; right: 0; border: 0;" src="https://s3.amazonaws.com/github/ribbons/forkme_right_red_aa0000.png" alt="Fork me on GitHub" />
</a>
<header id="banner" class="body">
<h1><a href="/">A Blog <strong>about coding & physics</strong></a></h1>
<nav><ul>
<li><a href="/pages/who-am-i.html">Who am I?</a></li>
<li><a href="/category/blogging.html">Blogging</a></li>
<li><a href="/category/many-boby-physics.html">Many Boby Physics</a></li>
<li><a href="/category/misc-course-notes.html">Misc Course Notes</a></li>
<li class="active"><a href="/category/programming.html">Programming</a></li>
</ul></nav>
</header><!-- /#banner -->
<section id="content" class="body">
<article>
<header>
<h1 class="entry-title">
<a href="/utf-8-bom-detection-in-java.html" rel="bookmark"
title="Permalink to UTF 8 BOM Detection in Java">UTF 8 BOM Detection in Java</a></h1>
</header>
<div class="entry-content">
<footer class="post-info">
<abbr class="published" title="2014-03-06T23:00:00+08:00">
Published: 周四 06 三月 2014
</abbr>
<address class="vcard author">
By <a class="url fn" href="/author/wetaper.html">we.taper</a>
</address>
<p>In <a href="/category/programming.html">Programming</a>.</p>
<p>tags: <a href="/tag/java.html">Java</a> <a href="/tag/utf-8.html">UTF-8</a> </p>
</footer><!-- /.post-info --> <h1>UTF 8 BOM Detection in Java</h1>
<p>Have you ever encountered like this: Reading a file encoded in UTF-8, but always found it starts with a mysterious character which may be printed as "?" into screen but is not seen in any text editor. This is caused by the BOM of UTF-8 files.</p>
<h2>What is BOM?</h2>
<p>See this: <a href="http://en.wikipedia.org/wiki/Byte_order_mark">Wikipedia BOM</a></p>
<p>BOM is, put simply, some marks used to identify the encoding of text, but it is not necessarily required in UTF-8 standard, see:</p>
<blockquote>
<p>While there is obviously no need for a byte order signature when using UTF-8, there are occasions when processes convert UTF-16 or UTF-32 data containing a byte order mark into UTF-8. When represented in UTF-8, the byte order mark turns into the byte sequence. Its usage at the beginning of a UTF-8 data stream is neither required nor recommended by the Unicode Standard, but its presence does not affect conformance to the UTF-8 encoding scheme. Identification of the byte sequence at the beginning of a data stream can, however, be taken as a near-certain indication that the data stream is using the UTF-8 encoding scheme.</p>
</blockquote>
<p><a href="http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf">Link To Document</a></p>
<p>Plus: <a href="http://www.zhihu.com/question/20167122">A discussion on ZhiHu</a></p>
<h2>How to Deal with BOM</h2>
<p>There could be many ways to do it but I found a simple solution. I figured out the unicode representation of BOM is <code>\uFEFF</code>. Therefore, if any UTF-8 file started with character <code>\uFEFF</code>, just remove the first character from it will sovle this problem.</p>
<h2>How to Write files without BOM</h2>
<p>Well, most text editors under Windows will automatically add BOMs to your UTF-8 files because this is favoured by Microsoft, the only exception I know of is notepad++, a great text editor for programmers (other exception? Feel free to inform me by E-mail). So basically you have to live with it on Windows.</p>
<p>Things get much better in Linux. With UTF-8 everywhere, Linux never use this BOM to identify a UTF-8-based file from ANSI-based file. Maybe I will never got to worry about BOM in Linux. Thanks for Microsoft's stupid idea to remind me of BOMs.</p>
</div><!-- /.entry-content -->
</article>
</section>
<section id="extras" class="body">
<div class="blogroll">
<h2>blogroll</h2>
<ul>
<li><a href="http://getpelican.com/">Pelican</a></li>
<li><a href="http://ovelinux.blog.sohu.com/">My Previous Blog</a></li>
</ul>
</div><!-- /.blogroll -->
<div class="social">
<h2>social</h2>
<ul>
<li><a href="we.taper[WhateverInsideIsUseless]gmail.com">My E-Mail (Don't click, copy link and alter it to correct form, thx)</a></li>
</ul>
</div><!-- /.social -->
</section><!-- /#extras -->
<footer id="contentinfo" class="body">
<address id="about" class="vcard body">
Proudly powered by <a href="http://getpelican.com/">Pelican</a>, which takes great advantage of <a href="http://python.org">Python</a>.
</address><!-- /#about -->
<p>The theme is by <a href="http://coding.smashingmagazine.com/2009/08/04/designing-a-html-5-layout-from-scratch/">Smashing Magazine</a>, thanks!</p>
</footer><!-- /#contentinfo -->
<script type="text/javascript">
var disqus_shortname = 'taper';
(function () {
var s = document.createElement('script'); s.async = true;
s.type = 'text/javascript';
s.src = 'https://' + disqus_shortname + '.disqus.com/count.js';
(document.getElementsByTagName('HEAD')[0] || document.getElementsByTagName('BODY')[0]).appendChild(s);
}());
</script>
</body>
</html>