用WWW::Mechanize取html时总是有部分解出乱码,原因:
一种解决方案是:
- WWW::Mechanize的content()函数会自动调用decoded_content()解码,失败才返回原始的LWP::UserAgent::content()内容,具体见leeym-踩到 WWW::Mechanize 的地雷。
- decoded_content()解码后的部分内容有时也是乱码。
一种解决方案是:
- 指定decoded_content(charset => 'none')避免自动解码,取回原始html内容
- 再用Encode::Detect::CJK::detect()检测其编码
- 根据测出来的编码将原始html解码成unicode处理
#!/usr/bin/perl use strict; use warnings; use utf8; use Carp; use Encode::Detect::CJK qw/detect/; use Encode; use WWW::Mechanize; #初始化$browser my $browser = WWW::Mechanize->new( onerror => undef, onwarn => \&Carp::carp, ); $browser->add_header( 'Accept-Language' => "zh-cn,zh-tw;q=0.7,en-us,*;q=0.3" ); $browser->add_header( 'Accept-Charset' => 'gb2312,gbk,utf-8,gb18030,big5,;q=0.7,*;q=0.3' ); $browser->add_header( 'Keep-Alive' => 300 ); $browser->add_header( 'Connection' => 'keep-alive' ); #取回页面内容 my $url = 'http://www.jjwxc.net/onebook.php?novelid=404264&chapterid=4'; #my $url = 'http://www.dddbbb.net/93345_5194303.html'; my $flag = $browser->get($url); exit unless ( $browser->success() ); #将html内容解码成unicode my $html = $browser->response->decoded_content( charset => 'none' ); my $charset = detect($html); $html = decode( $charset, $html, Encode::FB_XMLCREF ); #...对$html内容做处理... $html=~s[<font color='#.{6}'>.*?</font>][]g; #将修改后的html写入文件,utf8编码 open my $fh, '>:utf8', 'test.html'; print $fh $html; close $fh;
没有评论:
发表评论