开源基础复习整理

开源许可证选择

开源基础复习整理

python基础

Python Turtle 画树的实现

to tree :len :n
    if :n < 1 [stop]
    forward :len
    left 45
    tree (:len / 2) (:n - 1)
    right 90
    tree (:len / 2) (:n - 1)
    left 45
    back :len
end
//代码如下
import turtle
def tree(length,n):
""" paints a branch of a tree """
    if length < (length/n):
    	return # 边 界 条 件
    turtle.forward(length) # 主 干
    turtle.left(45) # 左 转 分 叉
    tree(length*0.5,length/n) # 左 子 树
    turtle.right(90) # 右 转 分 叉
    tree(length*0.5,length/n) # 右 子 树
    turtle.left(45) # 转 回
    turtle.backward(length) # 回 到 起 点
    return # 返 回

数值

• 除法/返回浮点型
• 整数除法用//
• 求模用%
• 幂运算用 **
• 赋值用 =
• 不赋值直接用报错
• 混用整数浮点数类型提升

• 交互模式下 _ 是特殊变量，上一次计算的结果，类似 bash 的 $? 或 cmd 的%errorlevel%
• 非交互模式不能用
• 默认交互环境（非 ipython）下赋值会覆盖
• 除了整数、浮点数，也支持复数

字符串

• 加号拼接字符串
• 乘号重复字符串
• 字符串常量直接拼接
• 字符串可下标
• 0 起始，-1 最后
• : 用于切片，左闭右开

1 2	>>> 3*’un’ + ’ium’ ’unununium’

1 2	word[2:5] # python from position 2 (included) to 5 (excluded) ’tho’

• Python 字符串是不可变的，与 Java 类似
• len 返回长度，类比 strlen

列表

• Python 支持若干数据类型
• 最常见的是列表
• 与数组类似
• 允许元素类型不同
• 和字符串类似，可下标可切片
• 加法和乘法也和字符串类似
• len 返回长度
• 与字符串不同的，列表是可变的
• append 方法用于追加
• 多维数组用嵌套列表实现

>>> squares = [1, 4, 9, 16, 25]
>>> squares[-3:] # slicing returns a new list
[9, 16, 25]
>>> squares + [36, 49, 64, 81, 100]
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

>>> a = [’a’, ’b’, ’c’]
>>> n = [1, 2, 3]
>>> x = [a, n]
>>> x
[[’a’, ’b’, ’c’], [1, 2, 3]]
>>> x[0]
[’a’, ’b’, ’c’]
>>> x[0][1]
’b’

流程控制

• None/False/0/”/()/[]/{} 会判定为假
• 可以有任意多个 elif 分支

• range 左闭右开
• range 在 Python3 中不返回列表，可用 list(range()) 习语

1
2
3

for i in range(5):
	print(i)
//0,1,2,3,4

函数

• def 开头，代表定义函数
• def 和函数名中间要敲一个空格
• 之后是函数名，这个名字用户自己起的，方便自己使用就好
• 函数名后跟圆括号 ()，代表定义的是函数，里边可加参数
• 圆括号 () 后一定要加冒号: 这个很重要，不要忘记了
• 代码块部分，是由语句组成，要有缩进
• 函数要有返回值 return
• 调用时使用函数名 (参数) 方式 eg:见前面turtle

类对象

• 在 init函数中用 self. 初始化的是实例成员
• 在类似 C++ 成员声明处初始化的是类成员（类似静态变量）
• 实例成员可以动态添加和删除
• 必要时用 hasattr 判断是否包含某成员

• 不存在类似 C++ 的私有
• 约定：下划线前缀的是实现细节
• ipython/Jupyter 默认不会补全
• 类似 *nix 下的 dotfiles

封装

>>> class Complex:
... def __init__(self, realpart, imagpart):
... self.r = realpart
... self.i = imagpart
...
>>> x = Complex(3.0, -4.5)
>>> x.r, x.i
(3.0, -4.5)

继承

class A:
def foo(self):
	print("in A::foo")
class B(A):
	pass
b = B()
b.foo()

多态

class A:
	def foo(self):
		print("in A::foo")
class B(A):
	def foo(self):
		print("in B::foo")
class C:
	def foo(self):
		print("not subclass of A/B")
b = B()
b.foo()
isinstance(b, B)
isinstance(b, A)

λ 表达式

• 注意：Python的lambda表达式不能换行

inc = lambda x: x + 1
add = lambda x, y: x + y
inc(5)
add(10, 20)

列表推导

[i*2 for i in range(10)]
[i*2 for i in range(10) if i % 2 == 0]
colours = ["red", "green", "yellow", "blue"]
things = ["house", "car", "tree"]
combined = [(x,y) for x in colours for y in things]

Map/Filter/Reduce

map(lambda x: x**2, range(10)
list(map(lambda x: x** 2, range(10))
filter(lambda x: x % 2 == 0,map(lambda x: x**2, range(10)))
reduce(lambda x, y: x + y,filter(lambda x: x % 2 == 0,map(lambda x: x**2,range(10))))

元组

• 元组与列表类似，都是线性数据结构
• 元组由逗号分割的数据构成
• 元组内容不可变
• 元组可以嵌套

>>> t = 12345, 54321, ’hello!’
>>> t[0]
12345
>>> t
(12345, 54321, ’hello!’)
>>> # Tuples may be nested:
... u = t, (1, 2, 3, 4, 5)
>>> u
((12345, 54321, ’hello!’), (1, 2, 3, 4, 5))
>>> # Tuples are immutable:
... t[0] = 88888
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ’tuple’ object does not support item assignment
>>> # but they can contain mutable objects:
... v = ([1, 2, 3], [3, 2, 1])
>>> v
([1, 2, 3], [3, 2, 1])

• 构造空列表使用 ()
• 构造单元素列表用 (a,)
• 如果元组元素是可变的，则可以修改（有坑）
• 元组也可出现在赋值左部（lhs）

>>> empty = ()
>>> singleton = ’hello’, # <-- note trailing comma
>>> len(empty)
0
>>> len(singleton)
1
>>> singleton
(’hello’,)

>>> x, y, z = t
>>> x, y = y, x
>>> x = (1, 2, 3, [4, 5])
>>> x[3] += [6]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-34-1b7fadc4e686> in <module>()
----> 1 x[3] += [6]
TypeError: ’tuple’ object does not support item assignment
>>> x
(1, 2, 3, [4, 5, 6])

字典

• 字典类似其他语言中散列表、关联数组，按键值方式组织
• 不像列表元组可以用下标随机访问
• 任何不可变对象均可作为键（元组亦可，但不能包含可变元素）
• 空字典用 {} 构造；非空字典在 {} 内逗号分割的键: 值对表示
• list(d.keys()) 返回所有键。如需有序，使用 sorted(d.keys())

>>> tel = {’jack’: 4098, ’sape’: 4139}
>>> tel[’guido’] = 4127
>>> tel
{’sape’: 4139, ’guido’: 4127, ’jack’: 4098}
>>> tel[’jack’]
4098
>>> del tel[’sape’]
>>> tel[’irv’] = 4127
>>> tel
{’guido’: 4127, ’irv’: 4127, ’jack’: 4098}
>>> list(tel.keys())
[’irv’, ’guido’, ’jack’]
>>> sorted(tel.keys())
[’guido’, ’irv’, ’jack’]
>>> ’guido’ in tel
True
>>> ’jack’ not in tel
False

>>> x[(1, 2, 3, [4, 5])] = 6
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-7e134042fef1> in <module>()
----> 1 x[(1, 2, 3, [4, 5])] = 6
TypeError: unhashable type: ’list’

• 字典还可以由二元组列表初始化
• 字典推导与列表推导类似
• 字典键为字符串时，可以用关键字参数设定

>>> dict([(’sape’, 4139),(’guido’, 4127),(’jack’, 4098)])
{’sape’: 4139, ’jack’: 4098, ’guido’: 4127}
>>> {x: x ** 2 for x in (2, 4, 6)}
{2: 4, 4: 16, 6: 36}
>>> dict(sape=4139, guido=4127, jack=4098)
{’sape’: 4139, ’jack’: 4098, ’guido’: 4127}

集合

• 集合是无序无重复的数据结构
• 支持成员测试、添加、删除等操作
• 支持交集、并集、差等集合操作
• 初始化用 {} 括住的逗号分割元素序列
• 空集合用 set()，而不是 {}
• 集合推导同样支持

>>> basket = {’apple’, ’orange’, ’apple’, ’pear’, ’orange’, ’banana’}
>>> print(basket) # show that duplicates have been removed
{’orange’, ’banana’, ’pear’, ’apple’}
>>> ’orange’ in basket # fast membership testing
True
>>> ’crabgrass’ in basket
False
>>> # Demonstrate set operations on unique letters from two words
...
>>> a = set(’abracadabra’)
>>> b = set(’alacazam’)
>>> a # unique letters in a
{’a’, ’r’, ’b’, ’c’, ’d’}
>>> a - b # letters in a but not in b
{’r’, ’d’, ’b’}
>>> a | b # letters in a or b or both
{’a’, ’c’, ’r’, ’d’, ’b’, ’m’, ’z’, ’l’}
>>> a & b # letters in both a and b
{’a’, ’c’}
>>> a ^ b # letters in a or b but not both
{’r’, ’d’, ’b’, ’m’, ’z’, ’l’}
>>> a = {x for x in ’abracadabra’ if x not in ’abc’}
>>> a
{’r’, ’d’}

HTTP 基础知识

• GET 向指定的资源发出“显示”请求。使用 GET 方法应该只用在读取数据，而不应当被用于产生“副作用”的操作中，例如在 WebApplication 中。其中一个原因是 GET 可能会被网络蜘蛛等随意访问。
• HEAD 与 GET 方法一样，都是向服务器发出指定资源的请求。只不过服务器将不传回资源的本文部分。它的好处在于，使用这个方法可以在不必传输全部内容的情况下，就可以获取其中“关于该资源的信息”（元信息或称元数据）

•POST 向指定资源提交数据，请求服务器进行处理（例如提交表单或者上传文件）。数据被包含在请求本文中。这个请求可能会创建新的资源或修改现有资源，或二者皆有。
• PUT 向指定资源位置上传其最新内容。
• DELETE 请求服务器删除 Request-URI 所标识的资源。
• TRACE 回显服务器收到的请求，主要用于测试或诊断。

机器人协议

禁止所有机器人访问特定目录

1
2
3

User-agent: *
Disallow: / * .php$
Disallow: /images/

• 人工阅读 robots.txt
• 使用 urllib.robotparser

>>> import urllib.robotparser
>>> rp = urllib.robotparser.RobotFileParser()
>>> rp.set_url("http://127.0.0.1:8000/robots.txt")
>>> rp.read()
>>> rp.can_fetch("Baiduspider", "http://127.0.0.1:8000/")
False
>>> rp.can_fetch("Baiduspider", "http://127.0.0.1:8000/article/")
True

爬虫

三、第三方依赖库

requests:一个方便、简洁、高效且人性化的HTTP请求库
BeautifulSoup:HTML解析库
pymongo:MongoDB的Python封装模块
selenium:一个Web自动化测试框架，用于模拟登录和获取JS动态数据
pytesseract:一个OCR识别模块，用于验证码识别
Pillow:Python图像处理模块

小问题

conda中pillow的版本估计比较老或者有问题，需要使用pip库中的pillow包，且需要安装其最新版本的包。

解决方案：

conda uninstall pillow
conda update pip
pip install pillow

>>> r = requests.get(’https://api.github.com/user’, auth=(’user’, ’pass’))
>>> r.status_code
200
>>> r.headers[’content-type’]
’application/json; charset=utf8’
>>> r.encoding
’utf-8’
>>> r.text
u’{"type":"User"...’
>>> r.json()
{u’private_gists’: 419, u’total_private_repos’: 77, ...}

发起链接

import requests
r = requests.get("http://127.0.0.1:8000")
print(r.text)
print(r.encoding)
print(r.url)
print(r.status_code)

get参数

import requests
import json
payload = {’key1’: ’value1’, ’key2’: ’value2’}
r = requests.get(’http://httpbin.org/get’, params=payload)
r.encoding = "UTF-8" //编码问题
result = json.loads(r.text)
print(result[’args’])

post参数

1	r = requests.post(’http://httpbin.org/post’, data = {’key’:’value’})

正规式

>>> import re
>>> re.findall(r’\bf[a-z] * ’, ’which foot or hand fell fastest’)
[’foot’, ’fell’, ’fastest’]
>>> re.sub(r’(\b[a-z]+) \1’, r’\1’, ’cat in the the hat’)
’cat in the hat’
//查找所有HTML链接
>>> re.findall("href=’(. * ?)’", html)
[’a.html’, ’b.html’]

• [a-zA-Z] 匹配某个字符
• ˆ和 $ 匹配首尾
• \ˆ和\$ 字面值
• . 匹配任意字符，. 匹配点号
• \b 匹配单词边界
• \w 匹配非空字符
• \s 匹配空格

• 表示任意次数，\ 表示字面值
• + 表示 1 次或多次，+ 表示字面值
• *? 和 +? 表示最短匹配
• {3} 和 {4,6} 表示重复 3 次和 4–6 次

>>> import re
>>> re.match(r’ab * ’, ’abbbb’).group()
’abbbb’
>>> re.match(r’ab * ?’, ’abbbb’).group()
’a’
>>> re.match(r’a\d{3,4}’, ’a1234567’).group()
’a1234’
>>> #使 用|表示 或，\|表 示 字 面 值
>>> import re
>>> re.match(r’(abc)|(def)’, ’abc’).group()
’abc’

match vs. findall

• match 从首字母开始匹配
• 还有个 search，与 match 类似，从任意处开始匹配
• findall 匹配所有子串

>>> import re
>>> p = re.compile(’\d+’)
>>> txt = ’12 drummers drumming, 11 pipers piping, 10 lords a-leaping’
>>> p.findall(txt)
[’12’, ’11’, ’10’]
>>> p.match(txt).group()
’12’

Grouping 分组捕获

• () 表示分组
• \数字进行引用

>>> re.match(r’x(\d+) = \1’, ’x123 = 123’).group()
’x123 = 123’
>>> re.match(r’x(\d+) = \1’, ’x123 = 123’).groups()
(’123’,)
>>> p = re.compile(’(a(b)c)d’)
>>> m = p.match(’abcd’)
>>> m.group(0)
’abcd’
>>> m.group(1)
’abc’
>>> m.group(2)
’b’
>>> re.findall(r’(.)\1’, ’明 明 亮 亮 蛋 蛋’)
[’明’, ’亮’, ’蛋’]

BeautifulSoup

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html_doc)
>>> print(soup.prettify())
#html文档
>>> soup.title
<title>The Dormouse’s story</title>
>>> soup.title.name
u’title’
>>> soup.title.string
u’The Dormouse’s story’
>>> soup.title.parent.name
u’head’
>>> soup.p
<p class="title"><b>The Dormouse’s story</b></p>
>>> soup.p[’class’]
u’title’
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>> soup.find_all(’a’)
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>> soup.find(id="link3")
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

从文档中找到所有 <a> 标签的链接:
for link in soup.find_all(’a’):
	print(link.get(’href’))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

Beautiful Soup对象

Beautiful Soup 将复杂 HTML 文档转换成一个复杂的树形结构, 每个节点都是 Python 对象, 所有对象可以归纳为 4 种: Tag , NavigableString ,BeautifulSoup , Comment .

Tag 有很多方法和属性, 在遍历文档树和搜索文档树中有详细解释. 现在介绍一下 tag 中最重要的属性: name 和 attributes

每个 tag 都有自己的名字, 通过.name 来获取:
一个 tag 可能有很多个属性. tag 有一个“class”的属性, 值为“boldest”. tag 的属性的操作方法与字典相同:
也可以直接”点”取属性, 比如: .attrs :
字符串常被包含在 tag 内.Beautiful Soup 用 NavigableString 类来包装tag 中的字符串:

tag.name
# u’b’
tag[’class’]
# u’boldest’
tag.attrs
# {u’class’: u’boldest’}
tag.string
# u’Extremely bold’
type(tag.string)
# <class ’bs4.element.NavigableString’>
#BeautifulSoup 对象并不是真正的 HTML 或 XML 的 tag， BeautifulSoup 对象包含了一个为“[document]”的特殊属性.name
soup.name
# u’[document]’
#注释部分
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class ’bs4.element.Comment’>

遍历方法

• 使用点号遍历，如 doc.body.a 代表 body 下第一个连接
• .descendants 表示子节点
• .next_element 表示下一个节点
• .next_sibling 表示下一个兄弟
• .name 返回标签类型
• .get_text() 返回文本
• 复数表示生成器
• find_all 更好用
• select 也不错，支持 CSS 选择器

soup.head
# <head><title>The Dormouse’s story</title></head>
soup.title
# <title>The Dormouse’s story</title>

通过点取属性的方式只能获得当前名字的第一个 tag:
soup.a
# <a class="sister" href="http://example.com/elsie" >

如果想要得到所有的 <a> 标签, 或是通过名字得到比一个 tag 更多的内容的时候, 就需要用到 Searching the tree 中描述的方法, 比如: find_all()
soup.find_all(’a’)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie"id="link3">Lacie</a>,

tag 的.contents 属性可以将 tag 的子节点以列表的方式输出:
head_tag = soup.head
head_tag
# <head><title>The Dormouse’s story</title></head>
head_tag.contents
[<title>The Dormouse’s story</title>]
title_tag = head_tag.contents[0]
title_tag
# <title>The Dormouse’s story</title>
title_tag.contents
# [u’The Dormouse’s story’]

BeautifulSoup 对象本身一定会包含子节点, 也就是说 <html> 标签也是BeautifulSoup 对象的子节点:
len(soup.contents)
# 1
soup.contents[0].name
# u’html’

通过 tag 的.children 生成器, 可以对 tag 的子节点进行循环:
for child in title_tag.children:
	print(child)
	# The Dormouse’s story
	
继续分析文档树, 每个 tag 或字符串都有父节点: 被包含在某个 tag 中通过.parent 属性来获取某个元素的父节点. 在例子“爱丽丝”的文档中,<head> 标签是 <title> 标签的父节点:
title_tag = soup.title
title_tag
# <title>The Dormouse’s story</title>
title_tag.parent
# <head><title>The Dormouse’s story</title></head>
文档 title 的字符串也有父节点:<title> 标签
title_tag.string.parent
# <title>The Dormouse’s story</title>
文档的顶层节点比如 <html> 的父节点是 BeautifulSoup 对象:
html_tag = soup.html
type(html_tag.parent)
# <class ’bs4.BeautifulSoup’>
BeautifulSoup 对象的.parent 是 None:
print(soup.parent)
# None
通过元素的.parents 属性可以递归得到元素的所有父辈节点, 下面的例子使用了.parents 方法遍历了 <a> 标签到根节点的所有节点.
link = soup.a
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
for parent in link.parents:
	if parent is None:
		print(parent)
	else:
		print(parent.name)
# p
# body
# html
# [document]
# None

last_a_tag = soup.find("a", id="link3")
last_a_tag
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
last_a_tag.next_sibling
# ’; and they lived at the bottom of a well.’

搜索方法

Beautiful Soup 定义了很多搜索方法, 这里着重介绍 2 个: find() 和find_all() .

如果传入正则表达式作为参数,Beautiful Soup 会通过正则表达式来匹配内容.

import re
下面例子中找出所有以 b 开头的标签
for tag in soup.find_all(re.compile("^b")):
	print(tag.name)
# body
# b
下面代码找出所有名字中包含”t”的标签:
for tag in soup.find_all(re.compile("t")):
	print(tag.name)
# html
# title

如果传入列表参数,Beautiful Soup 会将与列表中任一元素匹配的内容返回. 下面代码找到文档中所有 <a> 标签和 <b> 标签:
soup.find_all(["a", "b"])
# [<b>The Dormouse’s story</b>,
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie"id="link3">Lacie</a>]

True 可以匹配任何值, 下面代码查找到所有的 tag, 但是不会返回字符串
节点
for tag in soup.find_all(True):
	print(tag.name)

1	find_all( name , attrs , recursive , text , **kwargs )

find_all() 方法搜索当前 tag 的所有 tag 子节点, 并判断是否符合过滤器的条件. 这里有几个例子:

soup.find_all("title")
# [<title>The Dormouse’s story</title>]
soup.find_all("p", "title")
# [<p class="title"><b>The Dormouse’s story</b></p>]
soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
import re
soup.find(text=re.compile("sisters"))
# u’Once upon a time there were three little sisters; and their names were\n’

name 参数可以查找所有名字为 name 的 tag, 字符串对象会被自动忽略掉.
soup.find_all("title")
# [<title>The Dormouse’s story</title>]
soup.find_all(id=’link2’)
# [<a class="sister" href="http://example.com/lacie

如果传入 href 参数,Beautiful Soup 会搜索每个 tag 的”href”属性:
soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

使用多个指定名字的参数可以同时过滤 tag 的多个属性:
soup.find_all(href=re.compile("elsie"), id=’link1’)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的 tag:
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

按照 CSS 类名搜索 tag 的功能非常实用, 但标识 CSS 类名的关键字class 在 Python 中是保留字, 使用 class 做参数会导致语法错误. 从Beautiful Soup 的 4.1.1 版本开始, 可以通过 class_ 参数搜索有指定CSS 类名的 tag:
soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

tag 的 class 属性是多值属性. 按照 CSS 类名搜索 tag 时, 可以分别搜索tag 中的每个 CSS 类名:
css_soup = BeautifulSoup(’<p class="body strikeout"></p>’)
css_soup.find_all("p", class_="strikeout")
# [<p class="body strikeout"></p>]
css_soup.find_all("p", class_="body")
# [<p class="body strikeout"></p>]

通过 text 参数可以搜搜文档中的字符串内容. 与 name 参数的可选值一样, text 参数接受字符串, 正则表达式, 列表, True . 看例子:
soup.find_all(text="Elsie")
# [u’Elsie’]
soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u’Elsie’, u’Lacie’, u’Tillie’]
soup.find_all(text=re.compile("Dormouse"))
[u"The Dormouse’s story", u"The Dormouse’s story"]

虽然 text 参数用于搜索字符串, 还可以与其它参数混合使用来过滤tag.Beautiful Soup 会找到.string 方法与 text 参数值相符的 tag. 下面代码用来搜索内容里面包含“Elsie”的 <a> 标签:
soup.find_all("a", text="Elsie")
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

find_all() 和 find() 只搜索当前节点的所有子节点, 孙子节点等.find_parents() 和 find_parent() 用来搜索当前节点的父辈节点, 搜索方法与普通 tag 的搜索方法相同, 搜索文档搜索文档包含的内容. 我们从一个文档中的一个叶子节点开始:
a_string = soup.find(text="Lacie")
a_string
# u’Lacie’
a_string.find_parents("a")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
a_string.find_parent("p")
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>

Beautiful Soup 支持大部分的 CSS 选择器, 在 Tag 或 BeautifulSoup 对象的.select() 方法中传入字符串参数, 即可使用 CSS 选择器的语法找到tag:
soup.select("title")
# [<title>The Dormouse’s story</title>]
soup.select("p nth-of-type(3)")
# [<p class="story">...</p>]

通过 tag 标签逐层查找:
soup.select("body a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.select("html head title")
# [<title>The Dormouse’s story</title>]

通过 CSS 的类名查找:
soup.select(".sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.select("[class~=sister]")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie"

通过 tag 的 id 查找:
soup.select("#link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
soup.select("a#link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过是否存在某个属性来查找:
soup.select(’a[href]’)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过属性的值来查找:
soup.select(’a[href="http://example.com/elsie"]’)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
soup.select(’a[href^="http://example.com/"]’)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.select(’a[href$="tillie"]’)
# [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.select(’a[href * =".com/el"]’)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

开始爬啦

url = "http://oscar-lab.org/puzzle.html"
txt = requests.get(url).text
soup = bs4.BeatifulSoup(txt, "lxml")
tbody = soup.find("tbody")
for tr in tbody.find_all("tr"):
	for td in tr.find_all("td"):
		#traverse all cells

爬虫例子

#!/usr/bin/python3

import requests, re, bs4
import subprocess
import glob

prefix = "https://lkml.org"
seed = "%s/lkml/last100/" % prefix

req = requests.get(seed)

soup = bs4.BeautifulSoup(req.text, "lxml")
interested = soup.find_all(class_="mh")
assert(len(interested)) == 1

hrefid = 0
for tr in interested[0].find_all("tr"):
    class_ = tr.get("class")
    if class_ == None or ("c0" not in class_ and "c1" not in class_):
        continue
    tds = tr.find_all('td')
    assert(len(tds) == 3)
    href = tds[1].a['href']
    target = "%d.html" % hrefid
    #subprocess.call(['wget', "%s/%s" % (prefix, href), "-O", target])
    tds[1].a['href'] = target
    tds[2].a['href'] = target
    hrefid += 1

#modify css location
css = soup.find(href=r'/css/message.css')
assert(css != None)
css['href'] = "." + css['href']


handle = open("index.html", "w")
handle.write(soup.prettify())
handle.close()

for html in glob.glob("*.html"):
    handle = open(html)
    soup = bs4.BeautifulSoup(handle.read())
    for script in soup.find_all("script"):
        script.decompose()
    handle.close()
    handle = open(html, "w")
    handle.write(soup.prettify())
    handle.close()

数据库 sqlite3

文件打开与关闭

1	open(file, mode=’r’, buffering=-1, encoding=None,errors=None, newline=None,closefd=True, opener=None)

’r’ 读（默认）
’w’ 写（覆盖）
’x’ 排他性创建
’a’ 写（追加）
’b’ 二进制模式
’t’ 文本模式（默认）
’+’ 更新（读写）
’U’ 统一 newlines 模式（过时）

handle = open("input.txt", "r")
content = handle.read()
handle.close()
#playing around with content
handle = open("output.txt", "w")
handle.write(content)
handle.close()

csv

读写方法
• 使用 reader() 和 writer() 构造读写对象
• 参数均为文件句柄
• reader 对象可以遍历（或转为 list）
• writer 对象用 writerows/writerow 写行

>>> import csv
>>> students = [[’name’, ’gender’, ’age’],
... [’zhangsan’, ’male’, 13],
... [’lisi’, ’female’, 14]]
>>> handle = open(’students.csv’, ’w’)
>>> writer = csv.writer(handle)
>>> writer.writerows(students)
>>> handle.close()
>>> handle = open(’students.csv’, ’r’)
>>> reader = csv.reader(handle)
>>> for row in reader:
... print(row)

sqlite3

import sqlite3
conn = sqlite3.connect(’example.db’)
# or use :memory:
conn_memory = sqlite3.connect(’:memory:’)

c = conn.cursor()
# Create table
c.execute(’’’CREATE TABLE stocks
(date text, trans text, symbol text, qty real, price real)’’’)
# Insert a row of data
c.execute("INSERT INTO stocks VALUES
(’2006-01-05’,’BUY’,’RHAT’,100,35.14)")
# Save (commit) the changes
conn.commit()
# We can also close the connection if we are done with it.
# Just be sure any changes have been committed or they will be lost.
conn.close()

防止sql注入

# Never do this -- insecure!
symbol = ’RHAT’
c.execute("SELECT * FROM stocks WHERE symbol = ’%s’" % symbol)

# Do this instead
t = (’RHAT’,)
c.execute(’SELECT * FROM stocks WHERE symbol=?’, t)
print(c.fetchone())
# Larger example that inserts many records at a time
purchases = [(’2006-03-28’, ’BUY’, ’IBM’, 1000, 45.00),
			(’2006-04-05’, ’BUY’, ’MSFT’, 1000, 72.00),
			(’2006-04-06’, ’SELL’, ’IBM’, 500, 53.00),]
c.executemany(’INSERT INTO stocks VALUES (?,?,?,?,?)’, purchases)

Using shortcut methods

$ cat a.sql
create table news (id integer, score integer, title text, href text);
insert into news values (1, 8, "hello world", "http://oscar-lab.org");
insert into news values (2, 2, "hello charlie", "http://www.dlut.edu.cn");
$ python3
>>> import sqlite3
>>> conn = sqlite3.connect(’a.db’)
>>> c = conn.cursor()
>>> c.executescript(open(’a.sql’).read())
>>> list(c.execute(’select * from news’))
[(1, 8, ’hello world’, ’http://oscar-lab.org’),
(2, 2, ’hello charlie’, ’http://www.dlut.edu.cn’)]

Row Objects

>>> conn.row_factory = sqlite3.Row
>>> c = conn.cursor()
>>> c.execute(’select* from stocks’)
<sqlite3.Cursor object at 0x7f4e7dd8fa80>
>>> r = c.fetchone()
>>> type(r)
<class ’sqlite3.Row’>
>>> tuple(r)
(’2006-01-05’, ’BUY’, ’RHAT’, 100.0, 35.14)
>>> len(r)
5
>>> r[2]
’RHAT’

>>> for row in c.execute(’SELECT * FROM stocks ORDER BY price’):
		print(row)
(’2006-01-05’, ’BUY’, ’RHAT’, 100, 35.14)
(’2006-03-28’, ’BUY’, ’IBM’, 1000, 45.0)
(’2006-04-06’, ’SELL’, ’IBM’, 500, 53.0)
(’2006-04-05’, ’BUY’, ’MSFT’, 1000, 72.0)

Python type ——SQLite type
None ——NULL
int—— INTEGER
float ——REAL
str ——TEXT
bytes—— BLOB

Json

{
    "firstName": "John",
    "lastName": "Smith",
    "isAlive": true,
    "age": 25,
    "address": {
        "streetAddress": "21 2nd Street",
        "city": "New York",
        "state": "NY",
        "postalCode": "10021-3100"
    },
    "phoneNumbers": [
        {
            "type": "home",
            "number": "212 555-1234"
        },
        {
            "type": "office",
            "number": "646 555-4567"
        }
    ],
    "children": [],
    "spouse": null
}

>>> import json
>>> json.dumps([’foo’, {’bar’: (’baz’, None, 1.0, 2)}])
’["foo", {"bar": ["baz", null, 1.0, 2]}]’
>>> print(json.dumps({"c": 0, "b": 0, "a": 0}, sort_keys=True))
{"a": 0, "b": 0, "c": 0}
>>> handle = open("output.json", "w")
>>> json.dump({’foo’: 123, ’bar’: ’buz’}, handle)
>>> handle.close()

>>> json.loads(’["foo", {"bar":["baz", null, 1.0, 2]}]’)
[’foo’, {’bar’: [’baz’, None, 1.0, 2]}]
>>> from io import StringIO
>>> handle = open(’output.json’, ‘‘r’’)
>>> b = json.load(handle)
>>> print(b)
{’foo’: 123, ’bar’: ’buz’}

vim

hjkl左下上右

移动
• gg: goto first line
• G: goto last line
• 0: goto first column
• ˆ: goto first non-blank column
• $: goto last column
• w: next word
• W: next Word
• e: end of word
• E: end of Word
• fa: find a in current line
• Fa: find a in current line backward
• ta: find till before a
• Ta: find till before a backward
单独使用命令
• i: insert
• I: insert at front
• a: append
• A: append after end
• o: append new line
• O: insert new line
命令
• c: Change
• d: Delete
• y: Yank
• gU: Upper case
• gu: Lower case ...
文本对象
• aw: a word
• iw: inner word
• ab: a bracket
• ib: inner bracket
• at: a tag
命令 +移动/文本对象
• dj: delete current and the next lines
• y0: copy to the line head
• ciw: change inner word
• dab: delete a bracket
• gUit: change inner tag to lower case
相对高阶功能
• :%s/re.pattern/replace/g 全局替换
• :g/re.pattern/d 删除匹配行
• :h! E478 Don’t panic
• :h 42

git

#initialization
$ git init
#configuration
$ git config --global user.name "Your Name"
$ git config --global user.email "[email protected]"
# First commit
$ git add example.py
# adds file to the staging area
$ git commit -m "First commit"
# Second commit
$ git add another.py
# adds file to the staging area
$ git commit -m "Second commit"

• By default you are on the "master" branch
• To create a new branch and switch to it:
    $ git branch branchName
    $ git checkout branchName
• Merge changes back to master branch
    $ git checkout master
    $ git merge branchName
• To delete branch
    $ git branch -d branchName

怎么使用分支

• Master branch is always clean and working
• When I want to try a new idea, create a branch
• Work and commit to that branch until code is clean and working
• Switch back to master and merge branch into master
• Delete branch

• Any branch can be merged to any other branch(Careful, this might get confusing)
• Must commit before switching branches
• Unless you add it, a file will not be tracked.
• If you want to ignore a file, add it to the .git/info/exclude file.
• To revert a file back to it’s state at the last commit:
git checkout fileName
• To revert the entire repo to it’s state at the last commit:
git reset –hard HEAD
• CAREFULL! You’ll loose all your changes since your last commit!

单元测试

import unittest
    class TestStringMethods(unittest.TestCase):
        def test_upper(self):
        	self.assertEqual(’foo’.upper(), ’FOO’)
        def test_isupper(self):
        	self.assertTrue(’FOO’.isupper())
        self.assertFalse(’Foo’.isupper())
        	def test_split(self):
            s = ’hello world’
            self.assertEqual(s.split(), [’hello’, ’world’])
            # check that s.split fails when the separator is not a string
            with self.assertRaises(TypeError):
            	s.split(2)
if __name__ == ’__main__’:
	unittest.main()

• 测试用例由 unittest.TestCase 子类的实例
• 每个测试函数均由 test 几个字符开始
• 使用 TestCase 父类的 assert* 方法来进行判断
• 使用 setUp 和 tearDown 方法来初始化和结束（类似构造和析构）
• 使用 unittest.main() 调用测试用例
• 使用-v 参数输出详细信息
• 执行顺序按测试函数字母顺序执行

1
2
3

test_isupper (__main__.TestStringMethods) ... ok
test_split (__main__.TestStringMethods) ... ok
test_upper (__main__.TestStringMethods) ... ok

方法——检测条件
assertAlmostEqual(a, b) ——round(a-b, 7) == 0
assertNotAlmostEqual(a, b)——round(a-b, 7) != 0
assertGreater(a, b)—— a > b
assertGreaterEqual(a, b)—— a >= b
assertLess(a, b)—— a < b
assertLessEqual(a, b) ——a <= b
assertRegex(s, r)—— r.search(s)
assertNotRegex(s, r)—— not r.search(s)

• 除了在脚本中调用，也可以在命令行调用
• 调用某个文件中的测试用例时，省去文件后缀.py，路径分割符改
为”.”（类似 Java）
• 同样可以使用-v 输出详细信息
• 不加参数直接使用 python -m unittest 时，开启 discovery 模式

python -m unittest test_module1 test_module2
python -m unittest test_module.TestClass
python -m unittest test_module.TestClass.test_method
python -m unittest -v test_module

coverage 用法

1 2	coverage run --branch foo.py coverage html

Flask

安装flask

• pip install flask
• conda install flask

例子

from flask import Flask
app = Flask(__name__)
@app.route(’/’)
def hello_world():
	return ’Hello, World!’

启动服务器

• 保存文件为 hello.py
• export FLASK_APP=hello.py (Linux/OSX)
• set FLASK_APP=hello.py (Windows)
• flask run
• 使用–host=0.0.0.0 使能跨 IP 访问

1
2
3

$ export FLASK_APP=hello.py (run set FLASK_APP=hello.py on Windows)
$ flask run
*Running on http://127.0.0.1:5000/

路由

• 使用 @app.route() 标注 URL 中的链接
• 函数的返回值响应 GET/POST 请求
• 建立 URL 与函数间的绑定
• 使用与函数参数表达动态 URL

@app.route(’/’)
def index():
	return ’Index Page’
@app.route(’/hello’)
def hello():
	return ’Hello, World’
	
@app.route(’/user/<username>’)
def show_user_profile(username):
    # show the user profile for that user
    return ’User %s’ % username
@app.route(’/post/<int:post_id>’)
def show_post(post_id):
    # show the post with the given id, the id is an integer
    return ’Post %d’ % post_id
    
from flask import request
@app.route(’/login’, methods=[’GET’, ’POST’])
def login():
	if request.method == ’POST’:
		do_the_login()
	else:
		show_the_login_form()

Rendering Templates

from flask import render_template
@app.route(’/hello/’)
@app.route(’/hello/<name>’)
def hello(name=None):
	return render_template(’hello.html’, name=name)
	
/application.py
/templates
	/hello.html

Accessing Request Data

@app.route(’/login’, methods=[’POST’, ’GET’])
def login():
    error = None
    if request.method == ’POST’:
        if valid_login(request.form[’username’],
        				request.form[’password’]):
        	return log_the_user_in(request.form[’username’])
    else:
    	error = ’Invalid username/password’
    # the code below is executed if the request method
    # was GET or the credentials were invalid
    return render_template(’login.html’, error=error)
    
searchword = request.args.get(’key’, ’’)

Redirects and Errors

from flask import abort, redirect, url_for
@app.route(’/’)
def index():
		return redirect(url_for(’login’))
@app.route(’/login’)
def login():
		abort(401)
		this_is_never_executed()

Sessions

from flask import Flask, session, redirect, url_for, escape, request
app = Flask(__name__)
@app.route(’/’)
def index():
	if ’username’ in session:
		return ’Logged in as %s’ % escape(session[’username’])
	return ’You are not logged in’
@app.route(’/login’, methods=[’GET’, ’POST’])
def login():
	if request.method == ’POST’:
		session[’username’] = request.form[’username’]
		
>>> import os
>>> os.urandom(24)
’\xfd{H\xe5<\x95\xf9\xe3\x96.5\xd1\x01O<!\xd5\xa2\xa0\x9fR"\xa1\xa8’
Just take that thing and copy/paste it into your code

• 数据获取 (requests)
• 数据过滤 (bs4)
• 数据存储 (sqlite3)
• 数据展示 (flask)

flask小例子

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from flask import Flask, request, render_template, url_for, redirect
import sqlite3

app = Flask(__name__)

@app.route("/")
def root():
    return redirect("index")
@app.route("/index")
def index():
    return """\
            <html>
                <body>
                    <p> This is the index page</p>
                </body>
            </html>"""

@app.route("/hello/")
@app.route("/hello/<name>")
def hello(name="anonymous"):
    conn = sqlite3.connect("application.db")
    cursor = conn.cursor()
    results = list(cursor.execute("select * from students"))
    conn.close()
    return render_template("hello.html", name=name, results=results)

可视化

画图

• 使用 import numpy 引入 numpy 库
• 使用 numpy.linspace 或 numpy.arange 或列表准备数据
• 使用 import matplotlib.pyplot as plt 引入包
• 使用 plt.plot(x, y) 和 plot.show() 画图（推荐 jupyter）
• 使用 plt.xlabel(”)，plt.ylabel(”)，plt.title(”) 设置标题
• 使用 plt.savefig(fname, dpi) 保存图片
• 使用 fig, ax = plt.subplots() 返回图片句柄和坐标轴
• fig.savefig() 保存图片
• ax.plot 画图
• ax.set(xlabel=“x”, ylabel=“y”, title=“title”)

import matplotlib.pyplot as plt
import numpy as np
# Data for plotting
t = np.arange(0.0, 2.0, 0.01)
s = 1 + np.sin(2 * np.pi * t)
# Note that using plt.subplots below is equivalent to using
# fig = plt.figure and then ax = fig.add_subplot(111)
fig, ax = plt.subplots()
ax.plot(t, s)
ax.set(xlabel=’time (s)’, ylabel=’voltage (mV)’,
	title=’About as simple as it gets, folks’)
ax.grid()
fig.savefig("test.png")
plt.show()

histogram (hist) function直方图

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(19680801)
mu, sigma = 100, 15
x = mu + sigma * np.random.randn(10000)
# the histogram of the data
n, bins, patches = plt.hist(x, 50, normed=1, facecolor=’g’, alpha=0.75)
plt.xlabel(’Smarts’)
plt.ylabel(’Probability’)
plt.title(’Histogram of IQ’)
plt.axis([40, 160, 0, 0.03])
plt.grid(True)
plt.show()

Basic pie chart

import matplotlib.pyplot as plt
# Pie chart, where the slices will be ordered and plotted counter-clockwise:
labels = ’Frogs’, ’Hogs’, ’Dogs’, ’Logs’
sizes = [15, 30, 45, 10]
explode = (0, 0.1, 0, 0) # only "explode" the 2nd slice (i.e. ’Hogs’)
fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct=’%1.1f%%’,
		shadow=True, startangle=90)
ax1.axis(’equal’) # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

Fill plot demo

import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 1, 500)
y = np.sin(4 * np.pi * x) * np.exp(-5 * x)
fig, ax = plt.subplots()
ax.fill(x, y, zorder=10)
ax.grid(True, zorder=5)
x = np.linspace(0, 2 * np.pi, 500)
y1 = np.sin(x)
y2 = np.sin(3 * x)

Log Demo

import numpy as np
import matplotlib.pyplot as plt
# Data for plotting
t = np.arange(0.01, 20.0, 0.01)
# Create figure
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)
# log y axis
ax1.semilogy(t, np.exp(-t / 5.0))
ax1.set(title=’semilogy’)
ax1.grid()
# log x axis
ax2.semilogx(t, np.sin(2 * np.pi * t))
ax2.set(title=’semilogx’)
ax2.grid()
# log x and y axis
ax3.loglog(t, 20 * np.exp(-t / 10.0), basex=2)

Legend using pre-defined labels

import numpy as np
import matplotlib.pyplot as plt
# Make some fake data.
a = b = np.arange(0, 3, .02)
c = np.exp(a)
d = c[::-1]
# Create plots with pre-defined labels.
fig, ax = plt.subplots()
ax.plot(a, c, ’k--’, label=’Model length’)
ax.plot(a, d, ’k:’, label=’Data length’)
ax.plot(a, c + d, ’k’, label=’Total message length’)
legend = ax.legend(loc=’upper center’, shadow=True, fontsize=’x-large’)
# Put a nicer background color on the legend.
legend.get_frame().set_facecolor(’#00FFCC’)
plt.show()

XKCD-style sketch plots

import matplotlib.pyplot as plt
import numpy as np
with plt.xkcd():
    # Based on "Stove Ownership" from XKCD by Randall Monroe
    # http://xkcd.com/418/
    fig = plt.figure()
    ax = fig.add_axes((0.1, 0.2, 0.8, 0.7))
    ax.spines[’right’].set_color(’none’)
    ax.spines[’top’].set_color(’none’)
    plt.xticks([])
    plt.yticks([])
    ax.set_ylim([-30, 10])
    data = np.ones(100)
    data[70:] -= np.arange(30)
    plt.annotate(
        ’THE DAY I REALIZED\nI COULD COOK BACON\nWHENEVER I WANTED’,
        xy=(70, 1), arrowprops=dict(arrowstyle=’->’), xytext=(15, -10))

Pandas(lecture14,ten minutes to pandas)

MongoDB

NoSQL 是对不同于传统的关系数据库的数据库管理系统的统称。两者存在许多显著的不同点，其中最重要的是 NoSQL 不使用 SQL 作为查询语言。其数据存储可以不需要固定的表格模式，也经常会避免使用 SQL的 JOIN 操作，一般有水平可扩展性的特征。
NoSQL 的结构通常提供弱一致性的保证，如最终一致性，或交易仅限于单个的数据项。
• 图形关系存储：Neo4J、ArangoDB
• 键-值存储：Redis、MemcacheD、BerkeleyDB
• 列存储: HBase、Cassandra
• 文档存储: MongoDB、CouchDB

安装

• Debian/Ubuntu: sudo apt-get install mongodb
• Windows: https://www.mongodb.com/download-center，下载后一路 Next 安装，或直接下载压缩包解压
• OSX: brew install mongodb

启动

• Debian/Ubuntu: service mongodb start
• Windows: mongod –dbpath path\to\dbpath
• OSX: mongod –dbpath path/to/dbpath

参数

--dbpath arg   directory for datafiles - defaults to /data/db
--port arg     specify port number - 27017 by default
--replSet arg  arg is <setname>
--configsvr    declare this is a config db of a cluster;
               default port 27019; default dir /data/configdb
--journal      enable journaling
--nojournal    disable journaling (journaling is on by default for 64 bit)
--logpath arg  log file to send write to instead of stdout - has to be a file, not                  directory

MongoDB 数据结构

JSON
JSON(JavaScript Object Notation, JS 对象标记) 是一种轻量级的数据交换格式。它基于 ECMAScript (w3c 制定的 js 规范) 的一个子集，采用完全独立于编程语言的文本格式来存储和表示数据。简洁和清晰的层次结构使得 JSON 成为理想的数据交换语言。易于人阅读和编写，同时也易于机器解析和生成，并有效地提升网络传输效率。

BSON
BSON( Binary Serialized Document Format) 是一种二进制形式的存储格式，采用了类似于 C 语言结构体的名称、对表示方法，支持内嵌的文档对象和数组对象，具有轻量性、可遍历性、高效性的特点，可以有效描述非结构化数据和结构化数据。

CRUD 操作

CRUD (1): Create
#choose db
>use test
#insert json object into collection
>db.students.insert({"id": 123, "name": "zhangsan", "age": 19})
WriteResult({ "nInserted" : 1 })

CRUD (2): Retrieve
#choose db
>use test
> db.students.find({"id": 123})
{ "_id" : ObjectId("..."), "id" : 123, "name" : "zhangsan", "age" : 19 }

CRUD (3): Update
> db.students.update({"id":123}, {"$inc": {"age":1}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
> db.students.find()
{ "_id" : ObjectId("..."), "id" : 123, "name" : "zhangsan", "age" : 20 }

CRUD (4): Delete
> db.students.deleteMany({"id":123})
{ "acknowledged" : true, "deletedCount" : 1 }

pymongo 安装

• via pip
$pip install pymongo
• via distro (e.g., apt)
$sudo apt-get install python-pymongo

pymongo API 使用

import pymongo
from pymongo import MongoClient
client = MongoClient()
# Get the sampleDB database
db = client.sampleDB
# equivalently, use db = client[’sampleDB’]
coll = db.sampleCollection
# equivalently, use coll = db[’sampleCollection’]

#CRUD: Create
coll.insert_one({"id": 123, "name": "zhangsan", "age": 18})
coll.insert_one({"id": 124, "name": "lisi", "age": 17})

#CRUD: Retrieve
for entry in coll.find():
	print("id: %d, name: %s, age: %d" %
		(entry[’id’], entry[’name’], entry[’age’])
		
#CRUD: Update
col.update({"age": {"$lt": 20}}, {"$inc": {"age": 1}})
#{u’n’: 1, u’nModified’: 1, u’ok’: 1, ’updatedExisting’: True}
col.update({"age": {"$lt": 20}}, {"$inc": {"age": 1}}, multi=True)
#{u’n’: 2, u’nModified’: 2, u’ok’: 1, ’updatedExisting’: True}

#CRUD: Delete
col.remove({"age": {"$gt": 10}})
#{u’n’: 2, u’ok’: 1}
coll.insert_one({"id": 123, "name": "zhangsan", "age": 18})
coll.insert_one({"id": 124, "name": "lisi", "age": 17})
col.remove({"age": {"$gt": 10}}, multi=False)
#{u’n’: 1, u’ok’: 1}

本文作者：独归远，采用署名-非商业性使用-相同方式共享CC BY-NC-SA 3.0许可协议
本文标题：开源基础复习整理
本文链接：https://blog.syzhou.site/2018/2018.10.24开源基础复习整理/

开源基础复习整理

开源许可证选择

python基础

Python Turtle 画树的实现

数值

字符串

列表

流程控制

函数

类对象

λ 表达式

列表推导

Map/Filter/Reduce

元组

字典

集合

HTTP 基础知识

机器人协议

爬虫

三、第三方依赖库

小问题

发起链接

get参数

post参数

正规式

match vs. findall

Grouping 分组捕获

BeautifulSoup

Beautiful Soup对象

遍历方法

搜索方法

开始爬啦

爬虫例子

数据库 sqlite3

文件打开与关闭

csv

sqlite3

防止sql注入

Using shortcut methods

Row Objects

Json

vim

git

单元测试

coverage 用法

Flask

安装flask

例子

启动服务器

路由

Rendering Templates

Accessing Request Data

Redirects and Errors

Sessions

flask小例子

可视化

画图

histogram (hist) function直方图

Basic pie chart

Fill plot demo

Log Demo

Legend using pre-defined labels

XKCD-style sketch plots

Pandas(lecture14,ten minutes to pandas)

MongoDB

安装

启动

参数

MongoDB 数据结构

CRUD 操作

pymongo 安装

pymongo API 使用

相关推荐