Discussion:
regular expression to extract numbers
Neil Sutton
2012-12-03 00:10:11 UTC
Permalink
Hi,
I am new to using boost and am trying to learn basics so I can use regex
features in my code.
In my test program, I want a function to simply extract some numbers that
follow after a certain string pattern and then to read those numbers into a
vector.

This is an outline of what i would like to do:

// function definition
#ifndef findnumbers.h
#define findnumbers.h

#include <string>
#include <fstream>
#include <vector>
#include <boost/regex.hpp>
#endif

void findnumbers (ifstream& afile)
{
// assume file is opened in main - i have passed a reference
boost::regex expression ( ....having trouble with this);
// search for matches - there will be a fixed number of numbers e.g 10
std::string const_iterator start, end;
start = afile.begin( );
end = afile.end( );
std::vector<int> results;
boost::match_results<std::vector<int>::const_iterator> what;
for(int count = 0; count !=10; count++)
{
boost::regex_search(start,end,what,expression)
{
//if a number is found place in the vector
results.push_back(count);
}

So the first problem I have is defining a correct expression. The pattern
has the format "sometext>NUMBERS<sometext" newline. The NUMBERS may be one
or two digits combined and the pattern is repeated 10 times. I have
searched the archives and tried to build something based on the examples I
have read, but cannot seem to find a working solution. Can anyone help
please in defining an expression?

I do not need a solution to my function overall as I should be able to
write it correctly once I know how to define the pattern.

Kind regards
John Maddock
2012-12-03 09:23:13 UTC
Permalink
Post by Neil Sutton
So the first problem I have is defining a correct expression. The pattern
has the format "sometext>NUMBERS<sometext" newline. The NUMBERS may be one
or two digits combined and the pattern is repeated 10 times. I have
searched the archives and tried to build something based on the examples I
have read, but cannot seem to find a working solution. Can anyone help
please in defining an expression?
What's wrong with just:

"sometext\\>(\\d{1,2})\\<sometext"

then use a regex_iterator to iterate over all occurances, extracting $1 from
each match.

Even better use regex_token_iterator to spit out each number directly - take
a look at the last example on the bottom of this page:
http://www.boost.org/doc/libs/1_52_0/libs/regex/doc/html/boost_regex/ref/regex_token_iterator.html

HTH, John.
Neil Sutton
2013-01-29 19:24:36 UTC
Permalink
I am writing a very simple program that extracts numbers from a string. The
numbers are actually lottery numbers.
So far, my program connects to a certain url and downloads a file that
contains the latest lottery results. I have managed to reach the point
where the barest amount of relevant data is contained in a std::string.

The data is in the following format - though of course the date and numbers
vary:

26-Jan-2013,2,6,21,29,34,47,11,X,X

Note I am not interested - at this stage - in the last two numbers
represented by X,X. I am only interested in the first seven numbers
following the date.

So I figured that it should be easy to write a regular expression to match
this pattern:

boost::regex pattern("\d\d\d\d,\\>(\\d{1,2})\\<,");

I have written a function called getnumbers that is declared as void
getnumbers(std::string data); This function should eventually be passed the
string (In the format above) and it should extract the first seven numbers
after the integer year in my pattern. It is not completed yet as I am stuck
now. I have asked for help before and it was suggested I use a
sregex_token_iterator - so the code below is just an effort to experiment
with this that I found in the documentation.

--------- function definition ---------
#include <string>
#include <iostream
#include <boost/regex.hpp>

using namespace std;

void getnumbers(string data)
{
boost::regex pattern("\d\d\d\d,\\>(\\d{1,2})\\<,");
boost::sregex_token_iterator i(data.begin(), data.end(), pattern, -1); //
not sure what -1 does?
boost::sregex_token_iterator j;
unsigned count = 0;
while(i != j)
{
cout << *i++ << endl;
count++;
}
cout << "There were " << count << " tokens found." << endl;
return;
}

If passed the data string in the format above, count should be 9 -
representing 9 integers following the date.

How can I even get this to compile? I get 4 warnings about the pattern -
'unknown eascape sequence '\d'. and finally an error that says the linker
failed with exit code 1.
Also, what are the correct header guards to include in this function source
file?

Regards
Ted Byers
2013-01-29 19:57:38 UTC
Permalink
Post by Neil Sutton
I am writing a very simple program that extracts numbers from a string.
The numbers are actually lottery numbers.
So far, my program connects to a certain url and downloads a file that
contains the latest lottery results. I have managed to reach the point
where the barest amount of relevant data is contained in a std::string.
The data is in the following format - though of course the date and
26-Jan-2013,2,6,21,29,34,47,11,X,X
Note I am not interested - at this stage - in the last two numbers
represented by X,X. I am only interested in the first seven numbers
following the date.
So I figured that it should be easy to write a regular expression to match
boost::regex pattern("\d\d\d\d,\\>(\\d{1,2})\\<,");
I do not know regex well enough to know whether or not a regex can provide
the basis for the 'fastest' implementation (I know from some of my
experiments, there can be an order of magnitude difference in performance
between the fastest and slowest algorithms to do the same thing - subject
to the caveat that they all satisfy the functional requirements correctly),
but if the only consideration right now is to get it working, why not
examine boost more thoroughly. It has a tokenizer already (
http://www.boost.org/doc/libs/1_52_0/libs/tokenizer/) that, once you know
how to use it, may eliminate the need for you to roll your own. It also
has a split function in the string algorithms library (
http://www.boost.org/doc/libs/1_52_0/doc/html/string_algo.html). In both
cases, you'd just split your example string on the comma. The first
element so extracted would be your date, and the rest would be your numbers.

HTH

Ted
Ted Byers
2013-01-29 20:10:42 UTC
Permalink
Post by Neil Sutton
I am writing a very simple program that extracts numbers from a string.
The numbers are actually lottery numbers.
So far, my program connects to a certain url and downloads a file that
contains the latest lottery results. I have managed to reach the point
where the barest amount of relevant data is contained in a std::string.
I almost forgot, over a decade ago, I did write my own string splitter,
similar to:


void split(const string& str, const string& delimiters , vector<string>&
tokens)
{
// Skip delimiters at beginning.
string::size_type lastPos = str.find_first_not_of(delimiters, 0);
// Find first "non-delimiter".
string::size_type pos = str.find_first_of(delimiters, lastPos);

while (string::npos != pos || string::npos != lastPos)
{
// Found a token, add it to the vector.
tokens.push_back(str.substr(lastPos, pos - lastPos));
// Skip delimiters. Note the "not_of"
lastPos = str.find_first_not_of(delimiters, pos);
// Find next "non-delimiter"
pos = str.find_first_of(delimiters, lastPos);
}
}

If you search, you will find this simple string splitter algorithm has been
posted by several different people, and it is unknown to me whether those
who did so copied material others had posted or developed it themselves
(the algorithm itself is so simple and obvious it would not surprise me if
many who considered the problem developed it independently. Back when I
served as an educator, it would be something I'd have assigned a second
year programming class to implement as one of the course's exercises; as a
help in understanding the resources of STL and how to apply them in a
common problem.

Cheers

Ted

Loading...