CSE 250: Data Structures in C++

Assignment 2, Due at 11:59pm, Sunday Sep 14

Objectives

Getting slightly more comfortable with C++: make use of vector and set data structures, and learn how to sort elements of a vector.
Get a sense of how different data structure (and algorithm) choices can have a huge effect on the runtime of your program. Later on in the course, you will see that we could have known "in advanced" which of the two algorithms (sba and vba below) we should have chosen without even implementing them. We will have much more to say about analyzing the kind of data structures and algorithms to solve more complicated problems than this one.
Get a sense of the asymptotic behavior of your program once the input gets big. This is one of the key points of this course: proper use of data structures and algorithms will make programs much more robust and scalable.
Learn how to work within the constraints of a code-base which is already given to you. (In the wild, programmers rarely write codes from scratch: we improve, modify, maintain a code-base that other people have written.)

Overview of what to do

You are to write a C++ program that does roughly the following. The program reads a file -- in a specific format to be described below -- that stores the set of edges of a graph. The file might (or might not) store some edges more than once. The task of the program is to count the number of distinct edges. There are many ways to do this task. You will implement two different algorithms for performing the task.

Algorithm 1, called the vector-based algorithm (or vba for short) does the following:
- it reads the set of edges stored in the file into a vector.
- Sort the vector.
- Then, it loops through the elements of the vector one by one. If an element is equal to the next element, i.e. a duplicate edge is found, then the element is removed.
This algorithm uses the built-in sort algorithm and the erase function. When all elements of the vector have been traversed, the vector's size is reported.
Algorithm 2, called the set-based algorithm (or sba for short) does the following: it reads the edges stored in the file one by one, for each edge read the algorithm inserts the edge into a std::set. Since std::set is a data structure that stores unique elements, duplicated edges will not be inserted. In the end, the algorithm reports the size of the resulting set.

(Can you guess which algorithm is faster for large input graphs?)

Details on what to do

You are to write a C++ program that does the following.

It keeps reading user's inputs, line by line. Each input line the user types is supposed to be in one of the following three forms:
```
vba filename
sba filename
exit
```
where
- exit tells your program to quit
- vba and sba do the task described above using the aforementioned algorithms vba and sba.
- filename is the name of a file that stores edges of a graph in Stanford SNAP file format. The file looks something like this:
```
# Graph : p2p-Gnutella04.txt 
# Directed Gnutella P2P network from August 4 2002
# Nodes: 6 Edges: 11
# FromNodeId	ToNodeId
1	3
1	4
2	4
3	1
6	2
3	5
3	6
4	2
1	3
4	4
6	2
 
```
  The lines that start with # are comment lines, you will ignore those. The edges are stored using the format a b, where a and b are two integers (two vertices) separated by a tab character. Note that an edge a b might occur several times in the form of a b or b a. In the example file above, the output is 7, because even though there are 11 edges stored in the file, the edge (1,3) is stored three times, the edge (2,4) is stored twice, the edge (2,6) is stored twice.
The code base: to burden you less with parsing command lines, I have written a skeleton of the above program, leaving exactly the two functions vba and sba empty. You are to download the code base by typing:
```
wget http://www.cse.buffalo.edu/~hungngo/classes/2014/Fall/250/assignments/A2.tar
tar -xvf A2.tar
cd A2
```
Please read all the code in the code base, but you can only modify one file: algos.cpp to implement the two functions that were left empty there. You can compile the program by typing make. The Makefile is already written for you.

The test data: you can download the test data by obtaining real graphs from the Stanford SNAP data set. For example, here are some of the smaller data sets for you to test your implementation on:

wget http://snap.stanford.edu/data/p2p-Gnutella04.txt.gz
gunzip p2p-Gnutella04.txt.gz
wget http://snap.stanford.edu/data/wiki-Vote.txt.gz
gunzip wiki-Vote.txt.gz
wget http://snap.stanford.edu/data/email-EuAll.txt.gz
gunzip email-EuAll.txt.gz

Please feel free to explore other data sets from SNAP. Some of them are very large, which makes it fun to run your program on and see how long it takes.

My implementation

I have written a program called edgecount following the above specification and compiled it under timberlake. You can download and run it (in timberlake) to see how it works.
```
wget http://www.cse.buffalo.edu/~hungngo/classes/2014/Fall/250/assignments/edgecount
```
If needed, change its permission so that it's executable:
```
chmod 700 edgecount
```

How to submit

Submit only the algos.cpp file. We will put your submission into a directory that has all other files in the codebase and compile using make

submit_cse250 algos.cpp

Note again that the submission only works if you logged in to your CSE account and the cpp file is there. All previous things can be done at home, as long as you remember to upload the final file to your CSE account and run the submit script from there.

Grading

You'll get 0 point if the program doesn't compile using /usr/bin/g++ in timberlake. We grade mostly with an automatic script, and due to extreme lack of personels we don't have the resource to read partial solutions.
10 points if the exit command works. (It already worked in the code-base I provided. So these 10 points are free. If you do nothing, just submit the algos.cpp file as is, you'll get 10 points)
45 points if the vba command works and runs the vba algorithm as described above.
45 points if the sba command works and runs the sba algorithm as described above.

Supporting materials

Converting string to int: to convert a string to an integer in C++ (before C++11), there are two typical ways:

                                              
#include <iostream>
#include <sstream>
#include <cstdlib> // for atoi()

int main()
{
    std::string s = "1234";
    std::string t = "4567";

    int i = atoi(s.c_str());
    std::cout << "i = " << i << std::endl;

    std::istringstream iss(s);
    int j;
    iss >> j;
    std::cout << "j = " << j << std::endl;

    iss.clear(); // clear previous stream
    iss.str(t);  // set t to be characters in the new stream
    int k;
    iss >> k;
    std::cout << "k = " << k << std::endl;
    return 0;
}

Edges as pairs of integers: the best way to store edges of a graph is to treat each edge as a pair of integers. C++ has a pair type that you can use. (here are some examples on generic usage of pair.)

                                              
// from this example, you can see that pairs are compared lexicographically
#include <iostream>

int main()
{
    std::pair<int, int> p1; // p1 is a pair of ints
    std::pair<int, int> p2; // p2 is also a pair of ints
    std::pair<int, int> p3; // p3 is also a pair of ints

    p1 = std::make_pair(1, 5);
    p2 = std::make_pair(5, 1);
    p3 = std::make_pair(1, 5);

    std::cout << "p1 " << (p1 == p2? "=" : "not =") << " p2" << std::endl;
    std::cout << "p1 " << (p1 < p2? "<" : "not <") << " p2" << std::endl;
    std::cout << "p1 " << (p1 == p3? "=" : "not =") << " p3" << std::endl;
    std::cout << "p1 " << (p1 < p3? "=" : "not <") << " p3" << std::endl;

    return 0;
}

Inserting elements into a set. set is one of the most straightforward data structures to use.

                                              
#include <iostream>
#include <set>

int main()
{
    std::pair<int, int> p1;
    std::pair<int, int> p2;
    std::pair<int, int> p3;
    std::pair<int, int> p4;

    p1 = std::make_pair(1, 5);
    p2 = std::make_pair(5, 1);
    p3 = std::make_pair(1, 5);
    p4 = std::make_pair(2, 3);

    std::set<std::pair<int, int> > edgeSet;
    edgeSet.insert(p1);
    edgeSet.insert(p2);
    edgeSet.insert(p3);
    edgeSet.insert(p4);

    // the following prints "# of inserted edges = 3", do you see why?
    std::cout << "# of inserted edges = " << edgeSet.size() << std::endl;
    
    return 0;
}

Sort, traverse a vector, and the erase function:

                                              
#include <iostream>
#include <vector>
#include <algorithm> // for sort()

using namespace std; // I'm lazy now, so let's get rid of all the std::

void printVector(vector<pair<int, int> > & myVec)
{
    // traverse the vector
    vector<pair<int, int> >::iterator i;
    for (i = myVec.begin(); i != myVec.end(); ++i) {
        cout << "(" << i->first << ", " <<  i->second << ") ";
    }
    cout << endl;
}

int main()
{
    pair<int, int> p1;
    pair<int, int> p2;
    pair<int, int> p3;
    pair<int, int> p4;

    p1 = make_pair(2, 4);
    p2 = make_pair(5, 1);
    p3 = make_pair(2, 4);
    p4 = make_pair(4, 2);

    vector<pair<int, int> > pairVector;
    pairVector.push_back(p1);
    pairVector.push_back(p2);
    pairVector.push_back(p3);
    pairVector.push_back(p4);

    cout << "# of inserted pairs = " << pairVector.size() << endl;
    printVector(pairVector);

    sort(pairVector.begin(), pairVector.end());
    printVector(pairVector);

    // finally, remove the first duplicate pair that it found
    vector<pair<int, int> >::iterator i = pairVector.begin();
    while (i != pairVector.end()) {
        vector<pair<int, int> >::iterator j = i+1;
        if (j != pairVector.end() && *j == *i) {
            // remove the pair pointed to by i as a duplicate was found
            i = pairVector.erase(i);
            break;
        }        
    }
    printVector(pairVector);
    
    return 0;
}