正则表达式与多个管道JSON文件
我有以下命令抢在UNIX一个JSON:正则表达式与多个管道JSON文件
wget -q -O- https://www.reddit.com/r/NetflixBestOf/.json
哪个(每次显然有不同的结果)给了我下面的输出格式:
{
"kind": "...",
"data": {
"modhash": "",
"whitelist_status": "...",
"children": [
e1,
e2,
e3,
...
],
"after": "...",
"before": "..."
}
}
其中阵列的儿童中的每个元素是结构化的作为对象如下:
{
"kind": "...",
"data": {
...
}
}
这里是一个前充足完整的上传.json的get(车身太长,直接发布: https://pastebin.com/20p4kk3u
我需要打印完整的数据对象数组孩子的每一个元素中的存在。我知道我需要管ATLEAST两次,最初得到那里的孩子[...],然后数据{...},这是我到目前为止有:
wget -q -O- https://www.reddit.com/r/NetflixBestOf/.json | tr -d '\r\n' | grep -oP '"children"\s*:\s*\[\s*\K({.+?})(?=\s*\])' | grep -oP '"data"\s*:\s*\K({.+?})(?=\s*},)'
我是新来的正则表达式,所以我不知道如何处理括号或大括号内的元素我正在grepping。上面的行没有打印任何东西,我不知道为什么。任何帮助表示赞赏。
代码
wget -q -O- https://www.reddit.com/r/NetflixBestOf/.json | tr -d '\r\n' | grep -oP '"children"\s*:\s*\[\s*\K({.+?})(?=\s*\])' | grep -oP '"data"\s*:\s*\K({.+?})(?=\s*},)'
一些关于正则表达式
* == zero or more time
+ == one or more time
? == zero or one time
\s == a space character or a tab character or a carriage return character or a new line character or a vertical tab character or a form feed character
\w == is a word character and can to be from A to Z (upper or lower), from 0 to 9, included also underscore (_)
\d == all numbers from 0 to 9
\r == carriage return
\n == new line character (line feed)
\ == escape special characters so they can to be read as normal characters
[...] == search for character class. Example: [abc] search for a or b or c
(?=) == is a positive lookahead, a type of zero-width assertion. What it's saying is that the captured match must be followed by whatever is within the parentheses but that part isn't captured.
\K == match start at this position.
反正你可以阅读更多关于正则表达式从这里:Regex Tutorial
现在我可以试着解释代码
wget download the source.
tr remove all line feed e carriage return, so we have all the output in one line and can to be handle from grep.
grep -o option is used for only matching.
grep -P option is for perl regexp.
So here
grep -oP '"children"\s*:\s*\[\s*\K({.+?})(?=\s*\])'
we have sayed:
match all the line from "children"
zero or more spaces
:
zero or more spaces
\[ escaped so it's a simple character and not a special
zero or more spaces
\K force submatch to start from here
(submatch
{.+?} all, in braces (the braces are included because after start submatch sign. See greedy, not greedy in the regex tutorial for understand how work .+?)
) close submatch
(?=\s*\]) stop submatch when zero or more space founded and simple ] is founded but not include it in the submatch.
感谢您的详细解释,非常有帮助。后续问题,如果使用egrep而不使用perl regex语法,会有什么区别? –
看看这里:https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions –
如果你想得到儿童阵列试试这个,但我不知道这是你在找什么。
wget -O - https://www.reddit.com/r/NetflixBestOf/.json | sed -n '/children/,/],/p'
你开到使用第三方的事业吗?我通常使用jq二进制来轻松解析json数据。根据您的要求,您只需将json数据传递给具有内部查询语言的jq即可:cat/tmp/data | jq'.data.children | 。[]'(这里/ tmp/data包含完整的json)。通过使用这些实用程序,您实际上可以使用较短的查询和高级功能(如原始输出,查询等)完成工作。 – akskap
那么,获取数据的最终目标不是唯一的目标;这一次恰好是一个.json格式,但我想知道如何通过正则表达式来处理任何文件。 –