关于不可知的语言：忽略引号部分拆分字符串

Split a string ignoring quoted sections

给定这样的字符串：

a,"string, with",various,"values, and some",quoted

有什么好的算法可以基于逗号分割此内容，而忽略引号内的逗号？

输出应该是一个数组：

["a","string, with","various","values, and some","quoted" ]

您似乎在这里得到了一些不错的答案。

对于那些希望处理自己的CSV文件解析的人，请听从专家的建议，不要滚动自己的CSV解析器。

您首先想到的是，"我需要在引号内处理逗号"。

您的下一个想法是，"哦，糟糕，我需要处理引号内的引号。转义的引号。双引号。单引号..."

这是通往疯狂的道路。不要自己写。查找具有广泛的单元测试覆盖面的库，该库涉及所有困难部分，并为您解决了所有困难。对于.NET，请使用免费的FileHelpers库。

Python：

1
2
3
4

import csv
reader = csv.reader(open("some.csv"))
for row in reader:
print row

当然，使用CSV解析器更好，但仅出于乐趣，您可以：

1
2
3
4
5
6
7
8

Loop on the string letter by letter.
If current_letter == quote :
toggle inside_quote variable.
Else if (current_letter ==comma and not inside_quote) :
push current_word into array and clear current_word.
Else
append the current_letter to current_word
When the loop is done push the current_word into array

这是一个基于Pat伪代码的简单python实现：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

def splitIgnoringSingleQuote(string, split_char, remove_quotes=False):
string_split = []
current_word =""
inside_quote = False
for letter in string:
if letter =="'":
if not remove_quotes:
current_word += letter
if inside_quote:
inside_quote = False
else:
inside_quote = True
elif letter == split_char and not inside_quote:
string_split.append(current_word)
current_word =""
else:
current_word += letter
string_split.append(current_word)
return string_split

What if an odd number of quotes appear
in the original string?

这看起来与CSV解析异常相似，它在处理带引号的字段方面有些特殊之处。仅当该字段用双引号引起来时才对该字段进行转义，因此：

field1,"field2, field3", field4,"field5, field6" field7

变成

field1

field2, field3

field4

"field5

field6" field7

请注意，如果它不都以引号引起来，那么它不是带引号的字段，并且双引号被简单地视为双引号。

如果我没有记错的话，那么有人链接到的我的代码实际上并不能正确处理这个问题。

如果我选择的语言没有提供一种做到这一点的方法，那么我将首先考虑两种选择，这是简单的方法：

预解析并用另一个控制字符替换字符串中的逗号，然后分割它们，然后对数组进行后解析，以用逗号替换先前使用的控制字符。

或者用逗号将它们分开，然后将结果数组后解析为另一个数组，检查每个数组条目上的前导引号，并将条目连接起来，直到到达终止引号为止。

但是，这些都是骇客，如果这只是纯粹的"心理"练习，那么我怀疑它们将无济于事。如果这是一个现实世界的问题，那么这将有助于您了解该语言，以便我们提供一些具体建议。

作者在这里放入了一个C＃代码斑点，用于处理您遇到问题的情况：

NET中的CSV文件导入

翻译起来应该不太困难。

由于您说过不可知的语言，因此我以最接近伪代码的语言编写了算法：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

def find_character_indices(s, ch):
return [i for i, ltr in enumerate(s) if ltr == ch]

def split_text_preserving_quotes(content, include_quotes=False):
quote_indices = find_character_indices(content, '"')

output = content[:quote_indices[0]].split()

for i in range(1, len(quote_indices)):
if i % 2 == 1: # end of quoted sequence
start = quote_indices[i - 1]
end = quote_indices[i] + 1
output.extend([content[start:end]])

else:
start = quote_indices[i - 1] + 1
end = quote_indices[i]
split_section = content[start:end].split()
output.extend(split_section)

output += content[quote_indices[-1] + 1:].split()

return output

我只是忍不住想看看是否可以使它在Python一线模式下工作：

1	arr = [i.replace("\|",",") for i in re.sub('"([^"])\\,([^"])"',"\\g<1>\|\\g<2>", str_to_test).split(",")]

返回['a'，'string，with'，'variable'，'values and some'，'quoted']

它的工作原理是先将内部引号中的'，'替换为另一个分隔符(|)，
在'，'上分割字符串并替换|再次分隔。

这是一遍伪代码(也称为Python)的代码：-P

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52

def parsecsv(instr):
i = 0
j = 0

outstrs = []

# i is fixed until a match occurs, then it advances
# up to j. j inches forward each time through:

while i < len(instr):

if j < len(instr) and instr[j] == '"':
# skip the opening quote...
j += 1
# then iterate until we find a closing quote.
while instr[j] != '"':
j += 1
if j == len(instr):
raise Exception("Unmatched double quote at end of input.")

if j == len(instr) or instr[j] == ',':
s = instr[i:j] # get the substring we've found
s = s.strip() # remove extra whitespace

# remove surrounding quotes if they're there
if len(s) > 2 and s[0] == '"' and s[-1] == '"':
s = s[1:-1]

# add it to the result
outstrs.append(s)

# skip over the comma, move i up (to where
# j will be at the end of the iteration)
i = j+1

j = j+1

return outstrs

def testcase(instr, expected):
outstr = parsecsv(instr)
print outstr
assert expected == outstr

# Doesn't handle things like '1, 2,"a, b, c" d, 2' or
# escaped quotes, but those can be added pretty easily.

testcase('a, b,"1, 2, 3", c', ['a', 'b', '1, 2, 3', 'c'])
testcase('a,b,"1, 2, 3" , c', ['a', 'b', '1, 2, 3', 'c'])

# odd number of quotes gives a"unmatched quote" exception
#testcase('a,b,"1, 2, 3" ,"c', ['a', 'b', '1, 2, 3', 'c'])

这是标准的CSV样式解析。许多人尝试使用正则表达式来执行此操作。使用正则表达式可以达到90％左右，但是您确实需要一个真正的CSV解析器来正确执行。几个月前，我强烈建议在CodeProject上找到一个快速，出色的C＃CSV解析器！

这是一个简单的算法：

确定字符串是否以'"'字符开头

将字符串拆分为以'"'字符分隔的数组。

用占位符#COMMA#标记引号逗号

如果输入以'"'开头，则在数组中标记索引％2 == 0的那些项
否则，在数组中标记索引％2 == 1的那些项

连接数组中的项目以形成修改后的输入字符串。

将字符串拆分为以','字符分隔的数组。

用','字符替换#COMMA#占位符数组中的所有实例。

该数组是您的输出。

这是python的实现：
(固定为处理'" a，b"，c，" d，e，f，h"，" i，j，k"')

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

def parse_input(input):

quote_mod = int(not input.startswith('"'))

input = input.split('"')
for item in input:
if item == '':
input.remove(item)
for i in range(len(input)):
if i % 2 == quoted_mod:
input[i] = input[i].replace(",","#COMMA#")

input ="".join(input).split(",")
for item in input:
if item == '':
input.remove(item)
for i in range(len(input)):
input[i] = input[i].replace("#COMMA#",",")
return input

# parse_input('a,"string, with",various,"values, and some",quoted')
# -> ['a,string', ' with,various,values', ' and some,quoted']
# parse_input('"a,b",c,"d,e,f,h","i,j,k"')
# -> ['a,b', 'c', 'd,e,f,h', 'i,j,k']

我用它来解析字符串，不确定在这里是否有帮助；但是也许做了一些小的修改？

1
2
3
4
5
6
7
8
9
10
11
12
13

function getstringbetween($string, $start, $end){
$string ="".$string;
$ini = strpos($string,$start);
if ($ini == 0) return"";
$ini += strlen($start);
$len = strpos($string,$end,$ini) - $ini;
return substr($string,$ini,$len);
}

$fullstring ="this is my [tag]dog[/tag]";
$parsed = getstringbetween($fullstring,"[tag]","[/tag]");

echo $parsed; // (result = dog)

/ mp