GIDForums  

Go Back   GIDForums > Computer Programming Forums > C Programming Language
User Name
Password
Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

 
 
Thread Tools Search this Thread Rate Thread
  #1  
Old 04-May-2004, 09:18
dsmith's Avatar
dsmith dsmith is offline
Senior Member
 
Join Date: Jan 2004
Location: Utah, USA
Posts: 1,351
dsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of light

[CONTEST?]Data Structure Test


Data Structure Comparison


Overview
I have written a test program to test the viability and performance of different data structures. I have always been intrigued by data structures and a recent post got me to thinking about the performance of distinct type of data structures. I wanted to have some numbers to back statements about different data structures. I have based this upon a quasi-realworld example that actually does something with the numbers. I would like to test several data structures and would like to invite anyone interested in this to submit a data structure to be reviewed.

Real World Example
Let's say that I have gathered a lot of flow data from a drainage stream. This is a long list of unordered data that I want to be able to do some statistics on. The input of this data is to be simulated by the random number generator routine that I have included below. I want to be able to have easy access to the data as well as being completely seperated from the data structure. The required functions are discussed further below. There are four possible sizes of data sets that will be tested. For fairness, these numbers and orders will be identical for each test. There is a small (20,000 numbers), a moderate (100,000 numbers), a large (500,000 numbers), and an extreme (1,000,0000 numbers) test. All structures will be started at the small level and proceed upwards until test results become unfeasible.

Requirements/Goals
  • Generic data type
    This data structure needs to be usable with any type of data. For this reason the entire module should work on with data that is simply pointed to by a (void *).
  • Interface
    The data structure must be interfaced with the test program. As mentioned, the required calls are given below. Even if you have an existing data structure class, you can easily adapt it using inheritance.
  • Hidden structure to user
    Once again, the end-user of the data structure should have no idea of what type of structure is being used. So there can be absolutely no changes to the test program. There are two calls that I have included in the test program that can be used if needed. One is the finalize routine that allows for all data to be inserted into the structure and then sorted. The other is a balance call that will allow for balancing of a sorted list if necessary.
  • Proper results
    Obviously, the fastest data structure in the world isn't worth alot if it is buggy or doesn't produce the proper results. The storage structure must past the basic tests.
  • Speed
    Okay, if the structure meets all of the first rquirements then the speed of the structure will be measeured by the following list of tests.

Summary of Tests
  1. Insert unsorted numbers into a sorted structure.
    This is a list of randomly generated numbers. For consistency, the same list will be used for each test.
  2. Verification that list is sorted properly.
    This is more an accuracy test than speed test, but this simply makes sure that the numbers are properly sorted.
  3. Sorted Listing of numbers.
    Just a quick ouput to the screen. This test is mostly dependant upon the speed of the printf command, but it still assures that the controls are in place to print out the list.
  4. Calculation of Median (100 times)
    The median is the number that is in the exact middle of the data set. This may not a number in the dataset if the dataset count is odd. This is done 100 times in order to get some processing time for the lower count lists.
  5. Calculation of Mean (100 times)
    The mean(average) is the sum of all the numbers, divided by the count of all the numbers. This function performs its own count. On perfectly balanced data sets, the median and the mean are identical. Once again this is done 100 times.
  6. Calculation of Mode (100 times)
    The mode is the most reoccuring number. Datasets can be multi-modal, meaning that more than one number can be the most reoccuring. Once again this is done 100 times.
  7. Calculation of Frequency Distribution (100 times)
    A Frequency distribution is the count of items divided into a determined range of catagories. A graph is created at the end of the run to show this distribution. Again, done 100 times.
  8. Writing sorted list to file (and deletion)
    This is just the writing of the list to a file in a sorted fashion. This does not test anything that hasn't been tested, but it is a setup for the next few tests.
  9. Reading of sorted list
    This is done especially for tree like structures. A lot of times once a list is sorted it may be written out in the sorted form for future use. This makes a list more feasible once the initial large sort is done
  10. Finding of 1000 values + median
    This finds 1000 not so random values in the list. The numbers represent a repeating oscillation from one end of the dataset to the other. Also, the median must be found properly.
  11. Insertion of 1000 Random numbers
    These are truly random numbers that are generated at run time. There may be some variation for identical tests because of the luck of the draw. I have choosen to do this here because of the potential unbalanced nature of a tree-like structure at this point.
  12. Printing list in reverse order
    This is just to be cruel to a singly-linked list. This is a simple output of the numbers in reverse order.

Needed public functions
  • constructor
    All initialization of the data structure should occur here.
  • destructor
    This should do the clean up as well as attempting to delete all data.
  • CPP / C++ / C Code:
    void add_sort(void* new_data, int (*compare)(void*,void*));
    This function should add the data into the list in the proper position. Alternatively, this function can just throw the data into a list (by calling add_data) to be sorted later with a call to finalize. new_data is a memory pointer to the allocated data. compare is a pointer to a user written compare function that will compare the data pointed to by the two parameters. It will return -1 if the first is less than the second, a 1 if the first is greater than the second and 0 if they are equal.
  • CPP / C++ / C Code:
    void find_data(void* search_data, int (*compare)(void*,void*));
    This searches the list for the data pointed to by search_data. If there is not an exact match, the current pointer will be at the data position that is the closest but still less than search_data. compare is identical to that explained in the add_sort function.
  • CPP / C++ / C Code:
    int home();
    This places the "current" pointer to the first (sorted) item of the structure. It returns 0 if the list is empty, otherwise it returns 1.
  • CPP / C++ / C Code:
    void end();
    This places the "current" pointer at the end (sorted) item of the structure.
  • CPP / C++ / C Code:
    void* get_data();
    This returns a pointer to the data at the "current" position.
  • CPP / C++ / C Code:
    int next();
    This goes to the next (sorted) position of the structure. It returns 0 if the list is empty or at the end and returns 1 upon success.
  • CPP / C++ / C Code:
    int prev();
    This goes to the previous (sorted) position of the structure. It returns 0 if the list is empty or at the beginning and returns 1 upon sucess.
  • CPP / C++ / C Code:
    int is_empty();
    This is a simple query that returns a 1 if the structure is empty and a 0 if the structure contains data.
  • CPP / C++ / C Code:
    void* get_atpos( long index );
    This places the current position at the index (0 based) that is pointed to by index
  • CPP / C++ / C Code:
    void add_data(void* new_data);
    This adds data pointed to by new_data at the end of the list.
  • CPP / C++ / C Code:
    long cur_index();
    This returns a long that is the current index (0 based) that is pointed to by the current pointer.
  • CPP / C++ / C Code:
    void finalize(int (*compare)(void*,void*));
    This is only necessary to implement if your data structure sorts the data after being inserted into the list. The paramater compare is explained in the add_sort function.
  • CPP / C++ / C Code:
    void balance( );
    This is only necessary to implement if you want to be able to adjust your structure after reading a known sorted list. This is something you may want to consider with a tree like structure.

Baseline
For a baseline, I have used my singly linked list. This was not originally created for this type of usage, but I thought it would be a good measuring gauge. I had to use inheritance to define a new class based on this list that would "plug in" to the test program. The results and source are given in the next post.

Testing Machine
Dell Inspiron 8500 Laptop
Intel Pentium 4 2.2 Ghz with speedstep
1 Gigabyte of Ram

Platform
Linux: Slackware Linux running 2.6.5 kernel & KDE 3.2.1
Windows: Windows XP Pro
All unnecessary programs stopped on both.

Summary
This is a good exercise to use, just for testing a data structure. I found a single line that I removed in my list implementation that improved the data insertion on a 20000 item structure from 13.38 seconds to 0.02 seconds. I just have never done any extensive testing on it and never realized that I had such a time-consuming implementation.

I really hope that I get some participation on this. If you have an existing data structure algorithm, you should be able to code the "adapter" header file fairly quickly. If not, it is a really good exercise (IMHO) to write and understand these different structures. And they shouldn't take that long. The longest part of doing this whole exercise for me is writing this post.

I have compiled these using gcc under linux and dev-cpp under windows. They should compile under about anything with a few minor tweaks.

Attachments
  • random.c - Random Number Generator Source Code
  • test.cpp - Data Structure Testing Source Code
  • template.h - Header Template File for data structures
Attached Files
File Type: zip test.zip (4.6 KB, 609 views)
Last edited by dsmith : 04-May-2004 at 10:02.
  #2  
Old 04-May-2004, 09:22
dsmith's Avatar
dsmith dsmith is offline
Senior Member
 
Join Date: Jan 2004
Location: Utah, USA
Posts: 1,351
dsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of light
Single Linked List Results
Submitted by dsmith

Okay, here are the results of my single linked list. I only ran this through the small (20,000 numbers) and the moderate (100,000 numbers) cycles. As can be seen, there is a major difference when going up to 100,000. I ran these both in Linux and Windows with similar results. The Windows version edged out the Linux version in both cases. Here are the listings from the Windows versions for the small and moderate size resepectively.

Small Test Printout
Code:
DATA STRUCTURE TEST ------------------- STRUCTURE: Singly Linked List AUTHOR: dsmith CPU CYCLE: 1000 TEST #1 - Insert original data to sorted structure Processing cycles: 4115 Processing time: 4.115 seconds TEST #2 - Verification of proper sort There were 0 mis-sorts Processing cycles: 10 Processing time: 0.010 seconds TEST #3 - Listing of sorted list to stdout Processing cycles: 1372 Processing time: 1.372 seconds TEST #4 - Calculation of Median Value(100 Times) Processing cycles: 181 Processing time: 0.181 seconds TEST #5 - Calculation of Mean Value(100 Times) Processing cycles: 160 Processing time: 0.160 seconds TEST #6 - Calculation of Mode(100 Times) Processing cycles: 411 Processing time: 0.411 seconds TEST #7 - Frequency Distribution(100 Times) Processing cycles: 170 Processing time: 0.170 seconds TEST #8 - Writing sorted list to file (and deletion of all data) Processing cycles: 30 Processing time: 0.030 seconds TEST #9 - Reading of a known sorted list Processing cycles: 60 Processing time: 0.060 seconds TEST #10 - Finding 1000 values + median Processing cycles: 1132 Processing time: 1.132 seconds TEST #11 - Insertion of 1000 Random numbers Processing cycles: 901 Processing time: 0.901 seconds TEST #12 - Printing of numbers in reverse order Processing cycles: 25667 Processing time: 25.667 seconds Entire operation took: 0 minutes, 34.05 seconds


Moderate Test Printout
Code:
DATA STRUCTURE TEST ------------------- STRUCTURE: Singly Linked List AUTHOR: dsmith CPU CYCLE: 1000 TEST #1 - Insert original data to sorted structure Processing cycles: 307592 Processing time: 307.592 seconds TEST #2 - Verification of proper sort There were 0 mis-sorts Processing cycles: 20 Processing time: 0.020 seconds TEST #3 - Listing of sorted list to stdout Processing cycles: 6639 Processing time: 6.639 seconds TEST #4 - Calculation of Median Value(100 Times) Processing cycles: 2514 Processing time: 2.514 seconds TEST #5 - Calculation of Mean Value(100 Times) Processing cycles: 1873 Processing time: 1.873 seconds TEST #6 - Calculation of Mode(100 Times) Processing cycles: 3865 Processing time: 3.865 seconds TEST #7 - Frequency Distribution(100 Times) Processing cycles: 1983 Processing time: 1.983 seconds TEST #8 - Writing sorted list to file (and deletion of all data) Processing cycles: 140 Processing time: 0.140 seconds TEST #9 - Reading of a known sorted list Processing cycles: 311 Processing time: 0.311 seconds TEST #10 - Finding 1000 values + median Processing cycles: 10054 Processing time: 10.054 seconds TEST #11 - Insertion of 1000 Random numbers Processing cycles: 7942 Processing time: 7.942 seconds TEST #12 - Printing of numbers in reverse order Processing cycles: 898762 Processing time: 898.762 seconds Entire operation took: 20 minutes, 39.82 seconds


Two things, just kill the single-linked list. Those are insertion of unsorted numbers into a sorted list and the reverse print-out. This is a true single-linked list, so the implimentation of the call to prev is a killer.

I have made a summary sheet that summarizes all of the tests and results for easy viewing. It is attached below.

My single linked list is done as an in-line class. This is not a requirement for this test.
Attached Files
File Type: zip summary.zip (3.2 KB, 42591 views)
File Type: zip list.zip (4.1 KB, 154 views)
  #3  
Old 06-Jun-2004, 15:13
dsmith's Avatar
dsmith dsmith is offline
Senior Member
 
Join Date: Jan 2004
Location: Utah, USA
Posts: 1,351
dsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of lightdsmith is a glorious beacon of light
Changes
I have made some slight changes to my original test. The changes that I have made are as follows:
  1. I have added a function to the test program that can help in insert sort type structures. This function simply returns a float value indicating where a certain value is located based on two other values. This function is as follows:
    CPP / C++ / C Code:
    float range(void* begin, void* end, void* eval)
    {
    	float	value1;
    	float	value2;
    	
    	value1 = *( (int*) eval ) - *( (int*) begin);
    	value2 = *( (int*) end ) - *( (int*) begin);
    
    	if(value2 == 0)
    		return 0.0;
    	else
    		return (value1/value2);	
    }
    
  2. Also, I have added another test that deletes 1000 random (oscillating-random) values from the structure. I felt that this could be a telling test, as there are some structures that I can think of that may struggle to be efficient at data deletion.

This will cause some changes to the original list of required functions for this test, simply because of the added possible use of the range function. The changed functions are as follows:
  • CPP / C++ / C Code:
    int find_data(void* search_data, int (*compare)(void*,void*), float (*range)(void*,void*,void*))
  • CPP / C++ / C Code:
    void add_sort(void* new_data, int (*compare)(void*,void*), float (*range)(void*,void*,void*))
  • CPP / C++ / C Code:
    void finalize(int (*compare)(void*,void*), float (*range)(void*,void*,void*))

Doubly Linked List Results
I have actually tested two more data structures that I made, although both are similar. The first is a routine doubly linked list. I was hoping that it would yeild some better results than the singly linked list due to its ability to search from the current position either forwards or backwards. Unfortunately it gave very similar results (even worse) in most of the search/sort functions. The only place that it pays real dividends is the descending print-out test for obvious reasons. Other than this one fairly useless test, there is not any advantage to a standard doubly-linked list. However, I tried it with a little twist as detailed below and got more favorable results.

The second data structure is a slightly modified doubly linked list that uses the new range function that I have defined above. In the find function, a single call is made to this range function to determine if the search should be started at the first of the list, the end of the list or the current position in the list. Other than that it is identical. This one call almost halfs the time for the insertion of the numbers. In addition, the finding, inserting and deleting of 1000 numbers are all about a quarter of the time of the first two tests. These are the most time consuming items. While this is outstanding, it still does not qualify this modified doubly linked list to take the large test.

Results
Here is the current total time for the moderate sized list run under Linux:
  • Singly Linked List - 21 Minutes, 21.86 Seconds
  • Doubly Linked List - 6 Minutes, 2.8 Seconds
  • Modified Doubly Linked List - 3 Minutes, 27 Seconds

Graphs
Here are some graphs of the most telling tests:





Future Possible Structures
I am now contemplating a tree structure as well as a dynamic array. The tree structure should work good at insertions, but I am wary about its performance in other tests due to the amount of comparisons that need to be made for trivial tasks (next & prev). The dynamic array is going to be unbeatable in the search test I think, but I am worried about the data insertion and deletion penalties that it will incur.

Attachments
  • test.zip - Contains the new test with the additional function and test as well as the summary file.
  • dlist.zip - The header files for the interface & the structures for both doubly linked lists.
Attached Files
File Type: zip test.zip (10.1 KB, 241 views)
File Type: zip dlist.zip (5.3 KB, 187 views)
Last edited by dsmith : 07-Jun-2004 at 08:46.
 
 

Recent GIDBlogConfigure sshd Securely by gidnetwork

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Repetition Structure ewallo C++ Forum 4 25-Mar-2004 17:46
CSS Layout question oihjk Web Design Forum 3 28-May-2003 11:36
Review/Beta Test Program - Free Ticketing/Project management software phantomlight Websites Reviewed Forum 0 28-Apr-2003 22:52
Could you test the speed of this domain for me jrobbio Websites Reviewed Forum 8 21-Jan-2003 14:24

Network Sites: GIDNetwork · GIDApp · GIDSearch · Learning Journal by J de Silva, The

All times are GMT -6. The time now is 01:55.


vBulletin, Copyright © 2000 - 2013, Jelsoft Enterprises Ltd.