Unicode问题与SQLAlchemy

问题描述：

我知道我有一个从Unicode转换的问题，但我不知道它在哪里发生。Unicode问题与SQLAlchemy

我从HTML文件的目录中提取有关最近Eruopean行程的数据。某些位置名称具有非ASCII字符（如é，ô，ü）。我使用正则表达式从文件的字符串表示中获取数据。

如果我打印的位置，因为我找到他们，他们与字符打印因此编码必须确定：

Le Pré-Saint-Gervais, France 
Hôtel-de-Ville, France

我使用的存储数据在SQLite表的SQLAlchemy：

Base = declarative_base() 
class Point(Base): 
    __tablename__ = 'points' 

    id = Column(Integer, primary_key=True) 
    pdate = Column(Date) 
    ptime = Column(Time) 
    location = Column(Unicode(32)) 
    weather = Column(String(16)) 
    high = Column(Float) 
    low = Column(Float) 
    lat = Column(String(16)) 
    lon = Column(String(16)) 
    image = Column(String(64)) 
    caption = Column(String(64)) 

    def __init__(self, filename, pdate, ptime, location, weather, high, low, lat, lon, image, caption): 
     self.filename = filename 
     self.pdate = pdate 
     self.ptime = ptime 
     self.location = location 
     self.weather = weather 
     self.high = high 
     self.low = low 
     self.lat = lat 
     self.lon = lon 
     self.image = image 
     self.caption = caption 

    def __repr__(self): 
     return "<Point('%s','%s','%s')>" % (self.filename, self.pdate, self.ptime) 

engine = create_engine('sqlite:///:memory:', echo=False) 
Base.metadata.create_all(engine) 
Session = sessionmaker(bind = engine) 
session = Session()

通过文件，我循环，并从每一个数据插入到数据库：

for filename in filelist: 

    # open the file and extract the information using regex such as: 
    location_re = re.compile("<h2>(.*)</h2>",re.M) 
    # extract other data 

    newpoint = Point(filename, pdate, ptime, location, weather, high, low, lat, lon, image, caption) 
    session.add(newpoint) 
    session.commit()

我看到每个插入以下警告：

/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/engine/default.py:230: SAWarning: Unicode type received non-unicode bind param value 'Spitalfields, United Kingdom' 
    param.append(processors[key](compiled_params[key]))

，当我尝试用表，如做任何事情：

session.query(Point).all()

我得到：

Traceback (most recent call last): 
    File "./extract_trips.py", line 131, in <module> 
    session.query(Point).all() 
    File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/orm/query.py", line 1193, in all 
    return list(self) 
    File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/orm/query.py", line 1341, in instances 
    fetch = cursor.fetchall() 
    File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/engine/base.py", line 1642, in fetchall 
    self.connection._handle_dbapi_exception(e, None, None, self.cursor, self.context) 
    File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/engine/base.py", line 931, in _handle_dbapi_exception 
    raise exc.DBAPIError.instance(statement, parameters, e, connection_invalidated=is_disconnect) 
sqlalchemy.exc.OperationalError: (OperationalError) Could not decode to UTF-8 column 'points_location' with text 'Le Pré-Saint-Gervais, France' None None

我想能够正确存储，然后返回与原始字符完好无损的位置名称。任何帮助将非常感激。

答

我发现这篇文章是有所帮助解释了我的烦恼：

http://www.amk.ca/python/howto/unicode#reading-and-writing-unicode-data

我可以用“解码器”模块，然后改变我的计划，以获得期望的结果如下：

打开文件时：

infile = codecs.open(filename, 'r', encoding='iso-8859-1')

当打印位置：

print location.encode('ISO-8859-1')

现在我可以查询和操作前从表中的数据没有错误。当我输出文本时，我只需指定编码。

（我还不完全明白这是怎么工作的，所以我想是时候更多地了解Python的Unicode的处理...）

我会在“iso-8859-1”之前先尝试“cp1252”。我不知道以下是否有帮助：http://*.com/questions/368805/python-unicodedecodeerror-am-i-misunderstanding-encode/370199#370199 – tzot 2009-06-10 22:10:56

答

尝试使用统一的列类型，而不是字符串为Unicode列：

Base = declarative_base() 
class Point(Base): 
    __tablename__ = 'points' 

    id = Column(Integer, primary_key=True) 
    pdate = Column(Date) 
    ptime = Column(Time) 
    location = Column(Unicode(32)) 
    weather = Column(String(16)) 
    high = Column(Float) 
    low = Column(Float) 
    lat = Column(String(16)) 
    lon = Column(String(16)) 
    image = Column(String(64)) 
    caption = Column(String(64))

编辑：回应评论：

如果你再有两件事情让约Unicode编码警告您可以尝试：

将您的位置转换为unicode。这将意味着有你的观点是这样创建的：

newpoint =点（文件名，PDATE，的ptime，统一（的位置），天气，高，低纬度，经度，图像，字幕）

的Unicode转换会产生一个unicode字符串时，通过一个字符串或Unicode字符串，所以你不必担心你在传递什么。
如果不解决编码问题，尝试在你的unicode编码调用对象。这意味着使用如下代码：

newpoint =点（文件名，更新，ptime，unicode（位置）.encode（'utf-8'），天气，高，低，拉特，lon，图像，标题）

这一步可能不是必需的，但它实质上是将unicode对象从unicode代码点转换为特定的字节表示形式（本例中为utf-8）。我希望SQLAlchemy在你传递unicode对象时为你做这件事，但它可能不会。

谢谢你的建议。我认为这是朝着正确的方向迈进。我现在收到了关于我插入的数据编码的警告，但我不确定如何解决这个问题。我已更新我的问题以反映您的建议。 – 2009-06-08 19:39:57

答

从sqlalchemy.org

参见0.4.2节

增加了新的标志，字符串和 create_engine（），断言_UNICODE =（TRUE | FALSE | '警告' |无）。默认为False或None对在Unicode类型上创建_engine（）和String，'warn'。当 True, 导致所有的unicode转换操作都会将非Unicode字节串作为绑定参数传递时引发异常。 '警告'结果发出警告。强烈建议所有支持unicode的应用程序应用程序正确使用Python unicode对象（即u'hello'而不是 'hello'），以便数据准确地返回。

我想你是想输入一个非Unicode字节串。也许这可能导致你走上正轨？需要一些转换形式，比较'hello'和u'hello'。

干杯

Unicode问题与SQLAlchemy

相关推荐