[boost-users] tokenizer vs string algorithm split.

Discussion:

chun ping wang

2007-12-13 03:56:58 UTC

Hi I was wondering which one is better and faster to split a file of csv
value of number and put it into container of double.
1.) Which option is better.
// method 1.
std::vector<std::string> split_string;
boost::algorithm::trim(flist);
boost::algorithm::split(split_string, flist,
boost::algorithm::is_any_of(","));
std::vector<double> elements;
BOOST_FOREACH(std::string s, split_string)
{
elements += boost::lexical_cast<double>(s);
}

// method 2.
boost::char_separator<char> sep(",");
boost::tokenizer<boost::char_separator<char> > tokens(flist,
sep);
std::vector<double> elements;
BOOST_FOREACH(std::string token, tokens)
{
elements += boost::lexical_cast<double>(token);
}

2.) When is it better to use string algorithm split instead of tokenizer and
vice versa.

Larry

2007-12-13 14:07:51 UTC

Permalink

My limited experience is that tokenizer is faster. I have tried it several times in different schemes but the tokenizer always seems to come out faster by more than a little. I would prefer the split() scheme but I haven't found the way to make it go faster.

Larry
----- Original Message -----
From: chun ping wang
Newsgroups: gmane.comp.lib.boost.user
To: boost-***@lists.boost.org
Sent: Wednesday, December 12, 2007 10:56 PM
Subject: [boost-users] tokenizer vs string algorithm split.

Hi I was wondering which one is better and faster to split a file of csv value of number and put it into container of double.
1.) Which option is better.
// method 1.
std::vector<std::string> split_string;
boost::algorithm::trim(flist);
boost::algorithm::split(split_string, flist, boost::algorithm::is_any_of(","));
std::vector<double> elements;
BOOST_FOREACH(std::string s, split_string)
{
elements += boost::lexical_cast<double>(s);
}

// method 2.
boost::char_separator<char> sep(",");
boost::tokenizer<boost::char_separator<char> > tokens(flist, sep);
std::vector<double> elements;
BOOST_FOREACH(std::string token, tokens)
{
elements += boost::lexical_cast<double>(token);
}

2.) When is it better to use string algorithm split instead of tokenizer and vice versa.

Bill Buklis

2007-12-13 17:48:42 UTC

Permalink

This may not matter for the CSV file you're parsing, but at least for a more
general solution for CSV processing, you'd also have to handle fields that
are surrounded by quotes and may even contain embedded commas. I don't know
if split or tokenizer can handle that.

-- Bill --

_____

From: Larry [mailto:***@nc.rr.com]
Sent: Thursday, December 13, 2007 8:08 AM
To: boost-***@lists.boost.org
Subject: Re: [Boost-users] [boost-users] tokenizer vs string algorithm
split.

My limited experience is that tokenizer is faster. I have tried it several
times in different schemes but the tokenizer always seems to come out faster
by more than a little. I would prefer the split() scheme but I haven't found
the way to make it go faster.

Larry

----- Original Message -----

From: chun ping wang <mailto:***@gmail.com>

Newsgroups: gmane.comp.lib.boost.user

To: boost-***@lists.boost.org

Sent: Wednesday, December 12, 2007 10:56 PM

Subject: [boost-users] tokenizer vs string algorithm split.

Hi I was wondering which one is better and faster to split a file of csv
value of number and put it into container of double.
1.) Which option is better.
// method 1.
std::vector<std::string> split_string;
boost::algorithm::trim(flist);
boost::algorithm::split(split_string, flist,
boost::algorithm::is_any_of(","));
std::vector<double> elements;
BOOST_FOREACH(std::string s, split_string)
{
elements += boost::lexical_cast<double>(s);
}

// method 2.
boost::char_separator<char> sep(",");
boost::tokenizer<boost::char_separator<char> > tokens(flist,
sep);
std::vector<double> elements;
BOOST_FOREACH(std::string token, tokens)
{
elements += boost::lexical_cast<double>(token);
}

2.) When is it better to use string algorithm split instead of tokenizer and
vice versa.

_____

Edward Diener

2007-12-15 14:44:52 UTC

Permalink

This may not matter for the CSV file you’re parsing, but at least for a
more general solution for CSV processing, you’d also have to handle
fields that are surrounded by quotes and may even contain embedded
commas. I don’t know if split or tokenizer can handle that.

Tokenizer's escaped_list_separator handles quotes and embedded commas
properly.

Larry

2007-12-15 18:32:46 UTC

Permalink

If your CSV has empty fields (e.g., data,data,,data.....) the only way I
found to handle the empty field was to handle the separators yourself with
the tokenizer otherwise the tokenizer would skip the field (a la strtok()).

For CSVs I tried Spirit and came up with a scheme (with lots of help I would
add) that seemed to work. Not many lines of code. It takes more time than I
was interested in spending to figure it out.

Larry
----- Original Message -----
From: "Edward Diener" <***@tropicsoft.com>
Newsgroups: gmane.comp.lib.boost.user
To: <boost-***@lists.boost.org>
Sent: Saturday, December 15, 2007 9:44 AM
Subject: Re: [boost-users] tokenizer vs string algorithm split.

Tokenizer's escaped_list_separator handles quotes and embedded commas
properly.

Christian Henning

2007-12-15 18:38:05 UTC

Permalink

Hi Larry, can you share the code which can handle empty fields?

Thanks,
Christian

Post by Larry
If your CSV has empty fields (e.g., data,data,,data.....) the only way I
found to handle the empty field was to handle the separators yourself with
the tokenizer otherwise the tokenizer would skip the field (a la strtok()).
For CSVs I tried Spirit and came up with a scheme (with lots of help I would
add) that seemed to work. Not many lines of code. It takes more time than I
was interested in spending to figure it out.
Larry
----- Original Message -----
Newsgroups: gmane.comp.lib.boost.user
Sent: Saturday, December 15, 2007 9:44 AM
Subject: Re: [boost-users] tokenizer vs string algorithm split.

Post by Bill Buklis
This may not matter for the CSV file you're parsing, but at least for a
more general solution for CSV processing, you'd also have to handle
fields that are surrounded by quotes and may even contain embedded
commas. I don't know if split or tokenizer can handle that.

Tokenizer's escaped_list_separator handles quotes and embedded commas
properly.
_______________________________________________
Boost-users mailing list
http://lists.boost.org/mailman/listinfo.cgi/boost-users

Larry

2007-12-15 22:23:23 UTC

Permalink

This was more of brute force approach that I did when I first started using
Boost a few years ago. There may be (probably) better and/or more efficient
ways to do it: It was sufficient for what I was doing.

//-----------------------------------------------------------------
// Using tokenizer

using namespace boost;

typedef escaped_list_separator<char> CharTokens;
typedef tokenizer<CharTokens> EscapedTokenizer;
typedef tokenizer<CharTokens>::iterator EscapedIterator;

CharTokens cs(",",",",boost::keep_empty_tokens);
std::string str; // This has CSV input line
EscapedIterator eti;

EscapedTokenizer et(str,cs);

for (eti = et.begin(); eti != et,end(); eti++) {
if (*eti == ",") { // See if this is a separator
field_number++;
} else {
// *eti points to a value which could be an empty field
// field_number is the field in the list
}
}

//-----------------------------------------------------------------
// Using Spirit
//
// Result is a vector of items much list split() - including empty strings
in the
// vector for empty fields
//
// Probably could be used with any<>

using namespace boost::spirit;

char *plist_csv = new char[4096];

rule<> list_csv, list_csv_item;
std::vector<std::string> vec_item, vec_list;
parse_info<> result;

list_csv_item =
confix_p('\"', *c_escape_cha_p,'\"')
| longest_d(real_p | int_p | *(alnum_p | ch_p('_')))
;

list_csv =
list_p(
(!list_csv_item)[append(vec_item)],
',') [append(vec_list)]
;

result = parse(plist_csv,list_csv);

if (result.hit) // Got at least part
if (result.full) {
// All present
}
}

----- Original Message -----
From: "Christian Henning" <***@gmail.com>
Newsgroups: gmane.comp.lib.boost.user
To: <boost-***@lists.boost.org>
Sent: Saturday, December 15, 2007 1:38 PM
Subject: Re: [boost-users] tokenizer vs string algorithm split.

Post by Christian Henning
Hi Larry, can you share the code which can handle empty fields?
Thanks,
Christian

Christian Henning

2007-12-15 23:16:07 UTC

Permalink

Thanks Larry.

Post by Larry
This was more of brute force approach that I did when I first started using
Boost a few years ago. There may be (probably) better and/or more efficient
ways to do it: It was sufficient for what I was doing.
//-----------------------------------------------------------------
// Using tokenizer
using namespace boost;
typedef escaped_list_separator<char> CharTokens;
typedef tokenizer<CharTokens> EscapedTokenizer;
typedef tokenizer<CharTokens>::iterator EscapedIterator;
CharTokens cs(",",",",boost::keep_empty_tokens);
std::string str; // This has CSV input line
EscapedIterator eti;
EscapedTokenizer et(str,cs);
for (eti = et.begin(); eti != et,end(); eti++) {
if (*eti == ",") { // See if this is a separator
field_number++;
} else {
// *eti points to a value which could be an empty field
// field_number is the field in the list
}
}
//-----------------------------------------------------------------
// Using Spirit
//
// Result is a vector of items much list split() - including empty strings
in the
// vector for empty fields
//
// Probably could be used with any<>
using namespace boost::spirit;
char *plist_csv = new char[4096];
rule<> list_csv, list_csv_item;
std::vector<std::string> vec_item, vec_list;
parse_info<> result;
list_csv_item =
confix_p('\"', *c_escape_cha_p,'\"')
| longest_d(real_p | int_p | *(alnum_p | ch_p('_')))
;
list_csv =
list_p(
(!list_csv_item)[append(vec_item)],
',') [append(vec_list)]
;
result = parse(plist_csv,list_csv);
if (result.hit) // Got at least part
if (result.full) {
// All present
}
}
----- Original Message -----
Newsgroups: gmane.comp.lib.boost.user
Sent: Saturday, December 15, 2007 1:38 PM
Subject: Re: [boost-users] tokenizer vs string algorithm split.

Post by Christian Henning
Hi Larry, can you share the code which can handle empty fields?
Thanks,
Christian

_______________________________________________
Boost-users mailing list
http://lists.boost.org/mailman/listinfo.cgi/boost-users

Pavol Droba

2007-12-13 20:14:00 UTC

Permalink

Post by chun ping wang
Hi I was wondering which one is better and faster to split a file of csv
value of number and put it into container of double.
1.) Which option is better.
// method 1.
std::vector<std::string> split_string;
boost::algorithm::trim(flist);
boost::algorithm::split(split_string, flist,
boost::algorithm::is_any_of(","));
std::vector<double> elements;
BOOST_FOREACH(std::string s, split_string)
{
elements += boost::lexical_cast<double>(s);
}
// method 2.
boost::char_separator<char> sep(",");
boost::tokenizer<boost::char_separator<char> >
tokens(flist, sep);
std::vector<double> elements;
BOOST_FOREACH(std::string token, tokens)
{
elements += boost::lexical_cast<double>(token);
}
2.) When is it better to use string algorithm split instead of tokenizer
and vice versa.

Hi,

I didn't make any speed comparison between split and tokenizer, but
there are ways for significant speed improvements when using split
algorithm.

Most speed problems results from unvanted copying of strings. This is
quite costly operation and it should be avoided at all cost it the speed
is important.

First, there is an obvious problem in your code. In BOOST_FOREACH, you
are missing a reference in the string parameter. This means, that every
string will be copied in the loop.

You can improve the actual usage of split algorithm as well.
Quite significant speedup can be achieved if you use
std::vector<boost::iterator_range<std::string::iterator> > to hold
results instead of vector-of-strings.
This way split algorthm will only store references to tokens in the
original string, avoiding any copying until it is realy needes.

Going one step futher, you can avoid using intermediate vector at all.
You can use split_iterator directly.

split_iterator<string::iterator>
siter=make_split_iterator(
flist,
token_finder(is_any_of(","), token_compress_off));
BOOST_FOREACH(
iterator_range<string::iterator> rngToken,
make_range(siter, split_iterator<string::iterator>())
{
// Do whatever you want with token here.
// It is represented by an iterator_range so no copying
// has been done yet.

// You can make a copy if necessary
string strToken = copy_range<string>(rngToken)
}

Best Regards,
Pavol.

chun ping wang

2007-12-14 01:41:00 UTC

Permalink

sorry kind of confuse on your last example on it helps me store the value in
stl container of double.

thanks.

Post by Pavol Droba

Hi,
I didn't make any speed comparison between split and tokenizer, but
there are ways for significant speed improvements when using split
algorithm.
Most speed problems results from unvanted copying of strings. This is
quite costly operation and it should be avoided at all cost it the speed
is important.
First, there is an obvious problem in your code. In BOOST_FOREACH, you
are missing a reference in the string parameter. This means, that every
string will be copied in the loop.
You can improve the actual usage of split algorithm as well.
Quite significant speedup can be achieved if you use
std::vector<boost::iterator_range<std::string::iterator> > to hold
results instead of vector-of-strings.
This way split algorthm will only store references to tokens in the
original string, avoiding any copying until it is realy needes.
Going one step futher, you can avoid using intermediate vector at all.
You can use split_iterator directly.
split_iterator<string::iterator>
siter=make_split_iterator(
flist,
token_finder(is_any_of(","), token_compress_off));
BOOST_FOREACH(
iterator_range<string::iterator> rngToken,
make_range(siter, split_iterator<string::iterator>())
{
// Do whatever you want with token here.
// It is represented by an iterator_range so no copying
// has been done yet.
// You can make a copy if necessary
string strToken = copy_range<string>(rngToken)
}
Best Regards,
Pavol.
_______________________________________________
Boost-users mailing list
http://lists.boost.org/mailman/listinfo.cgi/boost-users

Pavol Droba

2007-12-15 08:55:32 UTC

Permalink

Hi,

The actual storing of values in the stl container is your specific
implementation detail. I was not writing about that.

Simple, implementation can look like this

elements.push_back( lexical_cast<double>(rngToken) );

Regards,
Pavol.

Post by chun ping wang
sorry kind of confuse on your last example on it helps me store the
value in stl container of double.
thanks.

Post by chun ping wang
Hi I was wondering which one is better and faster to split a file

of csv

Post by chun ping wang
value of number and put it into container of double.
1.) Which option is better.
// method 1.
std::vector<std::string> split_string;
boost::algorithm::trim(flist);
boost::algorithm::split(split_string, flist,
boost::algorithm::is_any_of(","));
std::vector<double> elements;
BOOST_FOREACH(std::string s, split_string)
{
elements += boost::lexical_cast<double>(s);
}
// method 2.
boost::char_separator<char> sep(",");
boost::tokenizer<boost::char_separator<char> >
tokens(flist, sep);
std::vector<double> elements;
BOOST_FOREACH(std::string token, tokens)
{
elements += boost::lexical_cast<double>(token);
}
2.) When is it better to use string algorithm split instead of

tokenizer

Post by chun ping wang
and vice versa.

Hi,
I didn't make any speed comparison between split and tokenizer, but
there are ways for significant speed improvements when using split
algorithm.
Most speed problems results from unvanted copying of strings. This is
quite costly operation and it should be avoided at all cost it the speed
is important.
First, there is an obvious problem in your code. In BOOST_FOREACH, you
are missing a reference in the string parameter. This means, that every
string will be copied in the loop.
You can improve the actual usage of split algorithm as well.
Quite significant speedup can be achieved if you use
std::vector<boost::iterator_range<std::string::iterator> > to hold
results instead of vector-of-strings.
This way split algorthm will only store references to tokens in the
original string, avoiding any copying until it is realy needes.
Going one step futher, you can avoid using intermediate vector at all.
You can use split_iterator directly.
split_iterator<string::iterator>
siter=make_split_iterator(
flist,
token_finder(is_any_of(","), token_compress_off));
BOOST_FOREACH(
iterator_range<string::iterator> rngToken,
make_range(siter, split_iterator<string::iterator>())
{
// Do whatever you want with token here.
// It is represented by an iterator_range so no copying
// has been done yet.
// You can make a copy if necessary
string strToken = copy_range<string>(rngToken)
}
Best Regards,
Pavol.
_______________________________________________
Boost-users mailing list
http://lists.boost.org/mailman/listinfo.cgi/boost-users
------------------------------------------------------------------------
_______________________________________________
Boost-users mailing list
http://lists.boost.org/mailman/listinfo.cgi/boost-users