开发者

Java store crawleds page to mysql in a unified encoding

开发者 https://www.devze.com 2023-03-09 09:55 出处:网络
I am crawling webpages to MySQL database using Java. These webpages are in various encoding(e.g. GBK, UTF8 ...) and may contain none ASCII characters, however, I managed to detect each page\'s encodi

I am crawling webpages to MySQL database using Java.

These webpages are in various encoding(e.g. GBK, UTF8 ...) and may contain none ASCII characters, however, I managed to detect each page's encoding and get the readable string(readable string means it displays the same in Eclipse console as in Web Browser).

I get webpage encoding, defaults to UTF-8 if not found, from <meta> tag. See the following snippet:

InputStream is = hconn.getInputStream();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
int b = -1;
while (-1 != (b = is.read())) {
    baos.write((byte) b);
}

String charset = "UTF-8";
Document doc = Jsoup.parse(baos.toString());
Elements metas = doc.select("meta[http-equiv=Content-Type]");

Pattern p = Pattern.compile("charset=([0-9a-zA-Z_\\-]+)");
Matcher m;

for (Element meta : metas) {
    m = p.matcher(meta.toString());
    if (m.find())
        charset = m.group(1);
}

String str = new String(baos.toByteArray(), charset);

Then, I store it to MySQL. The MySQL connection url is jdbc:mysql://localhost:3306/db?characterEncoding=gbk, and the column to store text to is of GBK encoding.

Things happened that strings well displayed in Eclipse console turned out to be none recognizable sequence in MySQL and sometimes may raise SQLException. Observationally, none GBK strings will go wrong.

I think converting Non-GBK strings to GBK will work, but how to? And are there any work around approaches? My final goal is construct an inverted index.

Answers to encoding converting is preferred.

Any help will be grateful. Thanks in advance.


Add:

Create 开发者_运维问答table SQL:

CREATE TABLE `indexer`.`pages` (
  `content` TEXT CHARACTER SET gbk COLLATE gbk_chinese_ci,
  `url` VARCHAR(512) NOT NULL,
  `id` INTEGER UNSIGNED NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`id`)
)
ENGINE = InnoDB;

Error Message:

You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'é”??μ¢Wé”??μ?é”??μ—é”??–¤??·DPIyé”????é”??–¤??·é”????0")Sé”????<é”????cé”??–¤??' at line 1


Java will represent the string correctly internally which is shown by the Eclipse console. You should be able to connect to the database using UTF8 and store the data in a UTF8 encoded column. If you want the column to be GBK, I would still connect using UTF8. If this doesn't work, it would be helpful if you can post your CREATE TABLE statement and the error messages you were getting before.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号