parsings strings: extracting words and phrases [JavaScript]
我需要在用空格分隔的术语列表中支持确切的短语(用引号引起来)。
因此,用空格字符分割各个字符串已不再足够。
例:
1 2
| input : 'foo bar"lorem ipsum" baz'
output: ['foo', 'bar', 'lorem ipsum', 'baz'] |
我想知道是否可以通过单个RegEx而不是执行复杂的解析或拆分和重新合并操作来实现。
任何帮助将不胜感激!
1 2
| var str = 'foo bar"lorem ipsum" baz';
var results = str.match(/("[^"]+"|[^"\s]+)/g); |
...返回您要查找的数组。
但是请注意:
-
包含引号,因此可以在结果上用replace(/^"([^"]+)"$/,"$1")删除。
-
引号之间的空格将保持不变。因此,如果lorem和ipsum之间存在三个空格,它们将出现在结果中。您可以通过在结果上运行replace(/\s+/,"")来解决此问题。
-
如果ipsum之后没有结尾的"(即,报价错误的短语),您将得到:['foo', 'bar', 'lorem', 'ipsum', 'baz']
尝试这个:
1 2 3 4 5
| var input = 'foo bar"lorem ipsum" baz';
var R = /(\w|\s)*\w(?=")|\w+/g;
var output = input.match(R);
output is ["foo","bar","lorem ipsum","baz"] |
请注意,lorem ipsum周围没有多余的双引号
尽管它假定输入在正确的位置具有双引号:
1 2 3 4 5
| var input2 = 'foo bar lorem ipsum" baz'; var output2 = input2.match(R);
var input3 = 'foo bar"lorem ipsum baz'; var output3 = input3.match(R);
output2 is ["foo bar lorem ipsum","baz"]
output3 is ["foo","bar","lorem","ipsum","baz"] |
并且不会处理转义的双引号(这是一个问题吗?):
1 2 3 4
| var input4 = 'foo b"ar bar" "bar"lorem ipsum" baz';
var output4 = input4.match(R);
output4 is ["foo b","ar bar","bar","lorem ipsum","baz"] |
非常感谢您的快速回复!
这是后代选项的摘要:
1 2 3 4 5 6 7
| var input = 'foo bar"lorem ipsum" baz';
output = input.match(/("[^"]+"|[^"\s]+)/g);
output = input.match(/"[^"]*"|\w+/g);
output = input.match(/("[^"]*")|([^\s"]+)/g)
output = /(".+?"|\w+)/g.exec(input);
output = /"(.+?)"|(\w+)/g.exec(input); |
作为记录,这是我想出的可憎之处:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| var input = 'foo bar"lorem ipsum""dolor sit amet" baz';
var terms = input.split("");
var items = [];
var buffer = [];
for(var i = 0; i < terms.length; i++) {
if(terms[i].indexOf('"') != -1) { // outer phrase fragment -- N.B.: assumes quote is either first or last character
if(buffer.length === 0) { // beginning of phrase
//console.log("start:", terms[i]);
buffer.push(terms[i].substr(1));
} else { // end of phrase
//console.log("end:", terms[i]);
buffer.push(terms[i].substr(0, terms[i].length - 1));
items.push(buffer.join(""));
buffer = [];
}
} else if(buffer.length != 0) { // inner phrase fragment
//console.log("cont'd:", terms[i]);
buffer.push(terms[i]);
} else { // individual term
//console.log("standalone:", terms[i]);
items.push(terms[i]);
}
//console.log(items,"
", buffer);
}
items = items.concat(buffer);
//console.log(items); |
怎么样,
1
| output = /(".+?"|\w+)/g.exec(input) |
然后传递输出以丢失引号。
交替,
1
| output = /"(.+?)"|(\w+)/g.exec(input) |
然后执行pass n输出以丢失空捕获。
ES6解决方案支持:
-
除空格外,按空格分隔
-
删除引号,但不删除反斜杠转义引号
-
转义报价成为报价
码:
1 2 3 4 5 6 7 8 9 10
| input.match(/\\?.|^$/g).reduce((p, c) => {
if(c === '"'){
p.quote ^= 1;
}else if(!p.quote && c === ' '){
p.a.push('');
}else{
p.a[p.a.length-1] += c.replace(/\\(.)/,"$1");
}
return p;
}, {a: ['']}).a |
输出:
1
| [ 'foo', 'bar', 'lorem ipsum', 'baz' ] |
这可能是一个很晚的答案,但我有兴趣回答
http://regex101.com/r/dZ1vT6/72
纯JavaScript示例
1
| 'The rain in"SPAIN stays" mainly in the plain'.match(/[\w]+|"[\w\s]+"/g) |
输出:
1
| ["The","rain","in",""SPAIN stays"","mainly","in","the","plain"] |
一个简单易懂的解决方案。适用于所有定界符和" join"字符。还支持长度超过两个单词的"连接"单词。
"hello my name is 'jon delaware smith fred' I have a 'long name'" ....
有点像AC的答案,但有点整洁...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| function split(input, delimiter, joiner){
var output = [];
var joint = [];
input.split(delimiter).forEach(function(element){
if (joint.length > 0 && element.indexOf(joiner) === element.length - 1)
{
output.push(joint.join(delimiter) + delimiter + element);
joint = [];
}
if (joint.length > 0 || element.indexOf(joiner) === 0)
{
joint.push(element);
}
if (joint.length === 0 && element.indexOf(joiner) !== element.length - 1)
{
output.push(element);
joint = [];
}
});
return output;
} |
一个简单的正则表达式将起作用,但会保留引号。例如
1 2
| 'foo bar"lorem ipsum" baz'.match(/("[^"]*")|([^\s"]+)/g)
output: ['foo', 'bar', '"lorem ipsum"', 'baz'] |
编辑:被shyamsundar殴打,对不起,双重回答
1
| 'foo bar"lorem ipsum" baz'.match(/"[^"]*"|\w+/g); |
尽管包含了边界引号
|