将未知文件编码批量转换为UTF-8

问题描述：

我需要将某些文件转换为UTF-8，因为它们以UTF-8网站的形式输出，并且内容有时看起来有点难看。将未知文件编码批量转换为UTF-8

我现在可以做到这一点，或者我可以做到这一点，因为它们被读入（通过PHP，只是使用fopen，没有什么幻想）。欢迎任何建议。

您确定这只是错误的编码或只是一些字形缺失？ – Gumbo 2009-06-02 14:13:01

相当确定它是用非UTF-8字符集编写的。多个文件显示相同的恶意字符（e-acute等）的结果相同。 – Oli 2009-06-02 15:23:42

答

我不有一个清晰的PHP解决方案，但对于Python，我个人使用了Universal Encoding Detector library，它在猜测文件编码的编码方面做得非常好。

为了让你开始，这是一个我曾经用来做转换的Python脚本（最初的目的是我想从UTF-16和Shift-JIS的混合转换日本代码库，我如果chardet对检测编码没有把握，则作出默认猜测）：

import sys 
import codecs 
import chardet 
from chardet.universaldetector import UniversalDetector 

""" Detects encoding 

Returns chardet result""" 
def DetectEncoding(fileHdl): 
detector = UniversalDetector() 
for line in fileHdl: 
    detector.feed(line) 
    if detector.done: break 
detector.close() 
return detector.result 


""" Reencode file to UTF-8 
""" 
def ReencodeFileToUtf8(fileName, encoding): 
    #TODO: This is dangerous ^^||, would need a backup option :) 
    #NOTE: Use 'replace' option which tolerates errorneous characters 
    data = codecs.open(fileName, 'rb', encoding, 'replace').read() 
    open(fileName, 'wb').write(data.encode('utf-8', 'replace')) 

""" Main function 
""" 
if __name__=='__main__': 
    # Check for arguments first 
    if len(sys.argv) <> 2: 
    sys.exit("Invalid arguments supplied") 

    fileName = sys.argv[1] 
    try: 
     # Open file and detect encoding 
     fileHdl = open(fileName, 'rb') 
     encResult = DetectEncoding(fileHdl) 
     fileHdl.close() 

     # Was it an empty file? 
     if encResult['confidence'] == 0 and encResult['encoding'] == None: 
      sys.exit("Possible empty file") 

     # Only attempt to reencode file if we are confident about the 
     # encoding and if it's not UTF-8 
     encoding = encResult['encoding'].lower() 
     if encResult['confidence'] >= 0.7: 
      if encoding != 'utf-8': 
       ReencodeFileToUtf8(fileName, encoding) 
     else: 
      # TODO: Probably you could make a default guess and try to encode, or 
      #  just simply make it fail 

     except IOError: 
      sys.exit('An IOError occured')

答

这样做只会改善性能并降低未来错误的可能性，但如果您不知道编码，则根本无法进行正确的转换。

答

我在这第一次尝试是：

如果它在语法上是有效的UTF-8，假设它是UTF-8。
如果在ISO 8859-1（Latin-1）中只有对应于有效字符的字节，则会假设。
否则，失败。

答

文件是否可以包含来自不同代码页的数据？

如果是，那么您根本无法进行批量转换。你将不得不知道文件中每一个子字符串的每一个代码页。

如果否，可以一次批量转换文件，但假设您知道该文件具有哪个代码页。所以我们或多或少地回到了上述相同的情况，我们刚刚将抽象从子字符串范围移到了文件范围。

所以，你需要问自己的问题是。你有关于某些数据属于哪个代码页的信息吗？如果不是，它仍然看起来很难受。

你总是可以做一些分析数据和猜测代码页，虽然这可能会使它有点不太fuglier，你还在猜测，因此它仍然会是fugly :)

将未知文件编码批量转换为UTF-8

相关推荐