I am crawling webpages to MySQL database using Java.
These webpages are in various encoding(e.g. GBK, UTF8 ...) and may contain none ASCII characters, however, I managed to detect each page's encoding and get the readable string(readable string means it displays the same in Eclipse console
as in Web Browser
).
I get webpage encoding, defaults to UTF-8
if not found, from <meta>
tag.
See the following snippet:
InputStream is = hconn.getInputStream();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
int b = -1;
while (-1 != (b = is.read())) {
baos.write((byte) b);
}
String charset = "UTF-8";
Document doc = Jsoup.parse(baos.toString());
Elements metas = doc.select("meta[http-equiv=Content-Type]");
Pattern p = Pattern.compile("charset=([0-9a-zA-Z_\\-]+)");
Matcher m;
for (Element meta : metas) {
m = p.matcher(meta.toString());
if (m.find())
charset = m.group(1);
}
String str = new String(baos.toByteArray(), charset);
Then, I store it to MySQL. The MySQL connection url is jdbc:mysql://localhost:3306/db?characterEncoding=gbk
, and the column to store text to is of GBK
encoding.
Things happened that strings well displayed in Eclipse console
turned out to be none recognizable sequence in MySQL and sometimes may raise SQLException. Observationally, none GBK
strings will go wrong.
I think converting Non-GBK
strings to GBK
will work, but how to?
And are there any work around approaches? My final goal is construct an inverted index.
Answers to encoding converting is preferred.
Any help will be grateful. Thanks in advance.
Add:
Create 开发者_运维问答table SQL:
CREATE TABLE `indexer`.`pages` (
`content` TEXT CHARACTER SET gbk COLLATE gbk_chinese_ci,
`url` VARCHAR(512) NOT NULL,
`id` INTEGER UNSIGNED NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`)
)
ENGINE = InnoDB;
Error Message:
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'é”??μ¢Wé”??μ?é”??μ—é”??–¤??·DPIyé”????é”??–¤??·é”????0")Sé”????<é”????cé”??–¤??' at line 1
Java will represent the string correctly internally which is shown by the Eclipse console. You should be able to connect to the database using UTF8 and store the data in a UTF8 encoded column. If you want the column to be GBK, I would still connect using UTF8. If this doesn't work, it would be helpful if you can post your CREATE TABLE
statement and the error messages you were getting before.
精彩评论