用WWW::Mechanize取html时总是有部分解出乱码,原因:
一种解决方案是:
- WWW::Mechanize的content()函数会自动调用decoded_content()解码,失败才返回原始的LWP::UserAgent::content()内容,具体见leeym-踩到 WWW::Mechanize 的地雷。
- decoded_content()解码后的部分内容有时也是乱码。
一种解决方案是:
- 指定decoded_content(charset => 'none')避免自动解码,取回原始html内容
- 再用Encode::Detect::CJK::detect()检测其编码
- 根据测出来的编码将原始html解码成unicode处理
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Carp;
use Encode::Detect::CJK qw/detect/;
use Encode;
use WWW::Mechanize;
#初始化$browser
my $browser =
WWW::Mechanize->new( onerror => undef, onwarn => \&Carp::carp, );
$browser->add_header(
'Accept-Language' => "zh-cn,zh-tw;q=0.7,en-us,*;q=0.3" );
$browser->add_header( 'Accept-Charset' =>
'gb2312,gbk,utf-8,gb18030,big5,;q=0.7,*;q=0.3' );
$browser->add_header( 'Keep-Alive' => 300 );
$browser->add_header( 'Connection' => 'keep-alive' );
#取回页面内容
my $url = 'http://www.jjwxc.net/onebook.php?novelid=404264&chapterid=4';
#my $url = 'http://www.dddbbb.net/93345_5194303.html';
my $flag = $browser->get($url);
exit unless ( $browser->success() );
#将html内容解码成unicode
my $html = $browser->response->decoded_content( charset => 'none' );
my $charset = detect($html);
$html = decode( $charset, $html, Encode::FB_XMLCREF );
#...对$html内容做处理...
$html=~s[<font color='#.{6}'>.*?</font>][]g;
#将修改后的html写入文件,utf8编码
open my $fh, '>:utf8', 'test.html';
print $fh $html;
close $fh;
没有评论:
发表评论