正则表达式与多个管道JSON文件

问题描述：

wget -q -O- https://www.reddit.com/r/NetflixBestOf/.json

哪个（每次显然有不同的结果）给了我下面的输出格式：

{ 
"kind": "...", 
"data": { 
"modhash": "", 
"whitelist_status": "...", 
"children": [ 
e1, 
e2, 
e3, 
... 
], 
"after": "...", 
"before": "..." 
} 
}

其中阵列的儿童中的每个元素是结构化的作为对象如下：

{ 
"kind": "...", 
"data": { 
... 
} 
}

这里是一个前充足完整的上传.json的get（车身太长，直接发布： https://pastebin.com/20p4kk3u

我需要打印完整的数据对象数组孩子的每一个元素中的存在。我知道我需要管ATLEAST两次，最初得到那里的孩子[...]，然后数据{...}，这是我到目前为止有：

wget -q -O- https://www.reddit.com/r/NetflixBestOf/.json | tr -d '\r\n' | grep -oP '"children"\s*:\s*\[\s*\K({.+?})(?=\s*\])' | grep -oP '"data"\s*:\s*\K({.+?})(?=\s*},)'

我是新来的正则表达式，所以我不知道如何处理括号或大括号内的元素我正在grepping。上面的行没有打印任何东西，我不知道为什么。任何帮助表示赞赏。

你开到使用第三方的事业吗？我通常使用jq二进制来轻松解析json数据。根据您的要求，您只需将json数据传递给具有内部查询语言的jq即可：cat/tmp/data | jq'.data.children | 。[]'（这里/ tmp/data包含完整的json）。通过使用这些实用程序，您实际上可以使用较短的查询和高级功能（如原始输出，查询等）完成工作。 – akskap

那么，获取数据的最终目标不是唯一的目标;这一次恰好是一个.json格式，但我想知道如何通过正则表达式来处理任何文件。 –

答

代码

wget -q -O- https://www.reddit.com/r/NetflixBestOf/.json | tr -d '\r\n' | grep -oP '"children"\s*:\s*\[\s*\K({.+?})(?=\s*\])' | grep -oP '"data"\s*:\s*\K({.+?})(?=\s*},)'

一些关于正则表达式

* == zero or more time 
+ == one or more time 
? == zero or one time 
\s == a space character or a tab character or a carriage return character or a new line character or a vertical tab character or a form feed character 
\w == is a word character and can to be from A to Z (upper or lower), from 0 to 9, included also underscore (_) 
\d == all numbers from 0 to 9 
\r == carriage return 
\n == new line character (line feed) 
\ == escape special characters so they can to be read as normal characters 
[...] == search for character class. Example: [abc] search for a or b or c 
(?=) == is a positive lookahead, a type of zero-width assertion. What it's saying is that the captured match must be followed by whatever is within the parentheses but that part isn't captured. 
\K == match start at this position.

反正你可以阅读更多关于正则表达式从这里：Regex Tutorial

现在我可以试着解释代码

wget download the source. 
tr remove all line feed e carriage return, so we have all the output in one line and can to be handle from grep. 
grep -o option is used for only matching. 
grep -P option is for perl regexp. 

So here 
grep -oP '"children"\s*:\s*\[\s*\K({.+?})(?=\s*\])' 
we have sayed: 
match all the line from "children" 
zero or more spaces 
: 
zero or more spaces 
\[ escaped so it's a simple character and not a special 
zero or more spaces 
\K force submatch to start from here 
(submatch 
{.+?} all, in braces (the braces are included because after start submatch sign. See greedy, not greedy in the regex tutorial for understand how work .+?) 
) close submatch 
(?=\s*\]) stop submatch when zero or more space founded and simple ] is founded but not include it in the submatch.

感谢您的详细解释，非常有帮助。后续问题，如果使用egrep而不使用perl regex语法，会有什么区别？ –

看看这里：https：//en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions –

答

如果你想得到儿童阵列试试这个，但我不知道这是你在找什么。

wget -O - https://www.reddit.com/r/NetflixBestOf/.json | sed -n '/children/,/],/p'

正则表达式与多个管道JSON文件

相关推荐