解码 - 源码之家

问题描述：

我一直在考虑一段文字代表HTML如Windows 1252的组合，并引述可打印的HTML：解码

<html>\r\n<head>\r\n<meta http-equiv=3D\"Content-Type\" content=3D\"text/html; charset=3DWindows-1=\r\n252\">\r\n<style type=3D\"text/css\" style=3D\"display:none;\"><!-- P {margin-top:0;margi=\r\nn-bottom:0;} --></style>\r\n</head>\r\n<body dir=3D\"ltr\">This should be a pound sign: =A3 and this should be a long dash: =96 \r\n</body>\r\n</html>\r\n

从HTML <meta>标签我可以看到一块HTML应编码为Windows的1252。

我正在使用node.js来解析这段文字cheerio。但使用https://github.com/mathiasbynens/windows-1252对其进行解码没有帮助：windows1252.decode(myString);正在返回相同的输入字符串。

我想原因是因为输入字符串标准node.js的字符集已经编码，但它实际上代表一个windows-1252编码的HTML的一部分（如果是有道理的？）。

检查由=前面加上那些奇怪的十六进制数字，我可以看到有效windows-1252代码，例如：

这个=\r\n这\r\n应该以某种方式代表在Windows世界回车，
=3D：HEX 3D是DEC 61这是一个等号：=,
=96：HEX 96是DEC 150其中是一个“破折号”符号：–（某种“长减号”），
=A3：HEX A3是DEC 163这是一个井号：£

我没有控制这一段HTML的代，但我应该解析它，并清理它给予£（而不是=A3）等。

现在，我知道我可以保留与转换的内存映射，但我是想知道是否已经有涵盖整个windows-1252字符集的程序化解决方案？

参考这对于整个转换表：https://www.w3schools.com/charsets/ref_html_ansi.asp

编辑：

输入HTML来自一个IMAP会话，因此它似乎有一个7位/ 8“引用的可打印编码”走出上游，我无法控制（参见https://en.wikipedia.org/wiki/Quoted-printable）。

在此期间，我开始意识到这种额外的编码，我试过这个quoted-printable（参考https://github.com/mathiasbynens/quoted-printable）库没有运气。

下面是一个MCV（根据请求）：

var cheerio = require('cheerio'); 
var windows1252 = require('windows-1252'); 
var quotedPrintable = require('quoted-printable'); 

const inputString = '<html>\r\n<head>\r\n<meta http-equiv=3D\"Content-Type\" content=3D\"text/html; charset=3DWindows-1=\r\n252\">\r\n<style type=3D\"text/css\" style=3D\"display:none;\"><!-- P {margin-top:0;margi=\r\nn-bottom:0;} --></style>\r\n</head>\r\n<body dir=3D\"ltr\">This should be a pound sign: =A3 and this should be a long dash: =96 \r\n</body>\r\n</html>\r\n' 
const $ = cheerio.load(inputString, {decodeEntities: true}); 
const bodyContent = $('html body').text().trim(); 
const decodedBodyContent = windows1252.decode(bodyContent); 

console.log(`The input string: "${bodyContent}"`); 
console.log(`The output string: "${decodedBodyContent}"`); 

if (bodyContent === decodedBodyContent) { 
    console.log('The windows1252 output seems the same of as the input'); 
} 

const decodedQp = quotedPrintable.decode(bodyContent) 
console.log(`The decoded QP string: "${decodedQp}"`);

先前脚本产生以下输出：

The input string: "This should be a pound sign: =A3 and this should be a long dash: =96" 
The output string: "This should be a pound sign: =A3 and this should be a long dash: =96" 
The windows1252 output seems the same of as the input 
The decoded QP string: "This should be a pound sign: £ and this should be a long dash: "

在我的命令行我不能看到长划线和我不知道如何正确解码所有这些=<something>编码的字符？

看起来，你是非常不走运的在这里。 – awd

我认为你需要提供更完整的[mcve]。首先，文本如何进入你的程序？ – Quentin

答

似乎通过IMAP接收到的消息被提供了2个不同的编码组合：

实际字符串根据“引用的可打印”编码（https://en.wikipedia.org/wiki/Quoted-printable），因为我觉得有一个问题与编码的7位/ 8位映射经由IMAP信道传输的信息（TCP套接字连接）
是HTML与<meta>标签与Windows 1252字符集

含量（电子邮件正文）的逻辑表示当

这些HTML块在Windows风格中包含大量回车（\r\n）也存在“问题”。在我的情况下，我必须预先处理字符串来处理这个问题：删除那些回车符。

下MCV例子应该显示清洗的过程和验证字符串的代表电子邮件正文内容：

var quotedPrintable = require('quoted-printable'); 
var windows1252 = require('windows-1252'); 

const inputStr = 'This should be a pound sign: =A3 \r\nand this should be a long dash: =96\r\n'; 
console.log(`The original string: "${inputStr}"`); 

// 1. clean the "Windows carriage returns" (\r\n) 
const cleandStr = inputStr.replace(/\r\n/g, ''); 
console.log(`The string without carriage returns: "${cleandStr}"`); 

// 2. decode using the "quoted printable protocol" 
const decodedQp = quotedPrintable.decode(cleandStr) 
console.log(`The decoded QP string: "${decodedQp}"`); 

// 3. decode using the "windows-1252" 
const windows1252DecodedQp = windows1252.decode(decodedQp); 
console.log(`The windows1252 decoded QP string: "${windows1252DecodedQp}"`);

哪个给出了这样的输出：

The original string: "This should be a pound sign: =A3 
and this should be a long dash: =96 
" 
The string without carriage returns: "This should be a pound sign: =A3 and this should be a long dash: =96" 
The decoded QP string: "This should be a pound sign: £ and this should be a long dash: " 
The windows1252 decoded QP string: "This should be a pound sign: £ and this should be a long dash: –"

通知“长破折号“在Windows-1252解码阶段之前/之后呈现不同。

Afaik，这与UTF-8编码/解码无关。我能够弄清楚这个过程的“解码顺序”：https://github.com/mathiasbynens/quoted-printable/issues/5

我不确定的一件事是，如果我正在运行这段代码的操作系统对字符集/编码有某种影响文件或字符串流。

我已经使用了npm包：

https://github.com/mathiasbynens/quoted-printable
https://github.com/mathiasbynens/windows-1251

解码

相关推荐