![]() |
|
|||||||
|
|
Thread Tools | Search this Thread | Rate Thread |
|
#1
|
|||
|
|||
malloc/free usage, and general good programming behaviorHi,
First, apologies for a long post. I'm not a very good C programmer, having done it only a bit in the past. But I'm going to have to use it for my new job, and this means I need to make my code somewhat standardized, so others can read it. Hence, in the code I post below, please comment not only on incorrect usages, but also on correct, but poor usages (or, if I did something well, please point that out too). Any general tips or advice is very welcome, no matter how picky (i.e. if you don't think I use curly braces well, let me know). I'd rather learn these things now. About the code I'm posting: it's just a quick routine based on the data that netflix provided for their netflix prize. I wanted to see what kind of result you get if you just provide the average ranking of a movie as the prediction for a users's preference for the movie, as a means to give myself practice working with large datasets. For those of you unfamiliar with the data, it comes in 17,770 files, about 2 GB when uncompressed, and each file corresponds to a movie and each line is a triplet: user, their rating, and the date. I processed the files a little bit in a shell to get rid of the date and change the formatting. With such large amounts of data, I needed to use "short ints" wherever possible so that it fit into my 1 GB of RAM. I use calloc to dynamically allocate the data. First question: when is it ESSENTIAL to use "free" ? Why do you use it with calloc'ed/malloc'ed variables only? (or, is that true?). I didn't use it, and my program looks OK. I check the RAM with a system monitor, and it goes up to about 800 or 900 MB, and then back down to the baseline once I quit the program, so it doesn't seem like there's a memory leak. But when I use "free", I get this error: *** glibc detected *** free(): invalid pointer: 0xbfe914d0 *** Second question: I'd like to get used to using a makeful, but for some reason these things seem like the most complicated electronic entities ever. Can someone post a very simple makefile? I'm using a makefile that I modified from a friend, but I don't understand half of it. Third problem: the code doesn't work. SOMETIMES I get a seg fault, but usually it works. So I must be messing up. Anyone know why? Fourth problem: in my loop (I mention it in the comments) from m = 1 to 17770, it seems like "m" should end at the value 17770, but instead it ends at 17771, and if I watch the loop through a debugger (I used ddd), after m reaches 17770, it goes through just a part of the loop once more. So, I'm messing up again. Anyone see the error? Fifth question: I use this on different computers, and the files are in different locations. What's an easy way to be able to have the program automatically know where to find the data (assuming it's only on a handful of computers, and I can do a "case" or something)? Can you define a string in the preprocessor? If it's useful, I'm using redhat with gcc 3.4.6 and using "-O3" as the optimization level. Ok, thanks to anyone who answers. It's much appreciated. The code is in two files: read_data.c and netflix.h (is this good practice to separate the files? I'm assuming that you might want to expand the code later) Netflix.h CPP / C++ / C Code:
load_data.c CPP / C++ / C Code:
Thanks in advance! |
|||
|
#2
|
|||
|
|||
Re: malloc/free usage, and general good programming behaviorQuote:
CPP / C++ / C Code:
freeing movieUsers and movieRatings should be done in a loop freeing each cell. |
|
#3
|
||||||
|
||||||
Re: malloc/free usage, and general good programming behaviorQuote:
You use free() whenever you are through with the memory block. If you have a function that calls malloc() and the function is called multiple times in your program and you don't free() stuff when you are finished with it, that is a memory leak. When your program terminates, any modern operating system (Linux, Windows XP, etc.) knows what resources were allocated to that process and frees them up for you. It's probably a good idea to get into the habit of calling free before leaving a program for a number of reasons. In particular, the things you do as a standalone program for a "quick routine" may end up being a function of a bigger more meaningful program. The habit of cleaning up after yourself is never wrong, and might save you some grief later on. Also, there are some tools that test programs for memory leaks and they sometimes give messages about memory leaks that are due to leaving the program without deallocating memory. You get into the habit of ignoring the warnings like this, and some day you may find that you have glossed over a real problem due to the clutter of irrelevant warning messages. Try to make things clean and you won't be sorry. Leave them messy, and... Quote:
Variables declared like movieUsers are called "automatic" variables: CPP / C++ / C Code:
Quote:
Code:
When projects have a lot of source files, makefiles are usually created to compile individual source files into object files. A rule is made to combine objects into the target (executable) file. If only one source file is changed, then only that one is recompiled and its new object file is combined with the other, unchanged, ones to make the target. Quote:
My biggest quarrel is that you never test input to see if it's valid. Too many things in your program depend on input being exactly right. Always (yes, always) test for valid input as much as is humanly possible (or programatically possible). Also you have several places where it depends on the very large number of input files. Why not make it so that you can start and test with smaller files and then go for the big stuff? For example CPP / C++ / C Code:
Your kludgy correction at the end points out that it always goes through the loop one time extra. (That's because the eof flag is set only after you have tried to read past the end of the file.) I would recommend something like this: CPP / C++ / C Code:
This gives the correct value for the number of items that can be read correctly from the file. Then later you can do the same thing again, and even test for consistency: CPP / C++ / C Code:
The following is particularly onerous to me, since a program manifest constant, MAXUSERS, depends on your entering its value manually after running an external program. To me, that's just wrong-headed. CPP / C++ / C Code:
I suggest that you find MAXUSERS in the program itself, instead of things like CPP / C++ / C Code:
Always test the return value of scanf to make sure it was able to find what it was expecting. Quote:
How does the loop work? It initializes the value to 1. It sees if the value is less than or equal to 17770. If it isn't then executes the loop and increments the value of m. Then it tests to see if m is less than or equal to 17770. If it isn't, it executes the loop and increments the value of m. It continues this way until m is no longer less than or equal to 17770. At this time (that is when n is equal to 17771) it exits the loop. Since this is a nicely structured loop, there is no other way that it can exit the loop, so what the heck value could it have other than 17771? Quote:
Yes. You can define it in the program (with a "#define" statement) or you can define it on the compiler command line (in the makefile, for example) or, probably the best choice, you can use a command line argument when you invoke the program. Ok, thanks to anyone who answers. It's much appreciated. The code is in two files: read_data.c and netflix.h (is this good practice to separate the files? I'm assuming that you might want to expand the code later). In general, most people put various definitions (typedefs, struct definitions, things with #define) into a common file. Function prototypes and external variable definitions are also typically put into header files. It's good to be thinking about future needs when you start, and you can always change the details as the project grows. Regards, Dave Footnote: It really doesn't make sense to me to create a program that can't be tested with a small input data base. Writing a program, no matter how simple, that only works with 17,700 files is a recipe for madness. Writing a program that requires it to be recompiled when the size of the data base changes also doesn't make any sense at all. These comments are only my opinions, of course, but you did ask. |
|
#4
|
|||
|
|||
Re: malloc/free usage, and general good programming behaviorDave, I did ask, so thanks a lot for the lengthy response. I will take most of your suggestions. When I started the code, I did actually test on fewer than 17770 files -- I changed the NMOVIES (by hand) to something like 20.
One of your suggestions is that I find the value of a variable like "NMOVIES" in the program, and then dynamically allocate. The reason I was hesitant is that for in code dealing with large datasets, it might take a while (everytime I run the program) to find the value of the variable. But you're right, for robustness, I should change this. I did mean "makefile". I think I have a good understanding of their purpose and usefulness, but it's the syntax that confuses me. Here's the one I'm using: Python Code:
|
|
#5
|
|||
|
|||
Re: malloc/free usage, and general good programming behaviorQuote:
There are two issues: 1. How can I create and use a helpful Makefile? Answer: Look for examples 2. How can I understand the examples? Answer: Look at the GNU documentation. Start here: http://www.gnu.org/software/make/. Click here: www.gnu.org/software/make/manual/ Follow links to download or browse. It's really good! Code:
Code:
Code:
Code:
Code:
$(CC) is the value of the CC variable set somewhere (either in this makefile or an environment variable named CC, or for GNU, the default is gcc if it is not set anywhere else. $(INCDIR) is the value of INCDIR set somewhere. $(CFLAGS) is the value of CFLAGS set somewhere. $*.c means c source files. So the rule given by this expression to make "something.o" is gcc -c -g something.c The rule will be invoked whenever "something.c" is changed or whenever one of the dependencies is changed (in this case the dependency is netflix.h). As a side note: I usually turn on compiler warnings so that the compiler can not only check for fatal errors, but can find certain "common programmer missteps" that might signify something amiss. So I would set cflags to -Wall -W -pedantic. the -g means to put debugging information into the object file also. Since you have a target that depents on OBJECTS and OBJECTS is set to read_data.o, then every time that netflix.h is changed or read_data.c is changed, this rule will be executed. . Code:
The first target will be processed is something named "all". But it doesn't create anything named "all", so its rule will be executed each and every time you enter "make" from the command line (assuming that this is called "makefile"). Even if nothing is changed, it will execute this rule (thus kind of missing the point of having a makefile in the first place.). I would expect something like the following, since the target is "netflix" Code:
So "make" will be the same as "make all" from the command line, but this will be the same as if you had entered "make netflix", since netflix is a dependancy. (You don't really need the all:, but I like it as a matter of style. Also, it is possible to have more than one target that you want to make by default, and it's handy to have a default target named all) Now the rule (the thing that links all of the object files together to create the executable) will be executed only of one of the dependencies changes. That is, if you change the makefile iteslf, or if you change netflix.h, this rule will be invoked. If something happend to make any of the OBJECTS change (like a change in its source file), then the OBJECT will be made first and then this rule will be invoked. I'll leave the rest for your document-browsing pleasure. If you have any questions, then ask. Now, there are a few verrrry important things about makefiles: 1. Rules are always introduced with a tab character, not spaces. You can't always tell in your editor, but GNU make will complain if you forget. (vim and gvim users can enter ":set list" and they will see the difference beteween spaces and tabs. Then ":set nolist" returns to the "normal" view). 2. Unlike C and C++ The makefile is processed as a whole, not sequentially. That is, you can define something anywhere in the file and it will be defined everywhere. The only thing that depends on its order of apperance is that the first target will be the default. (That is, just entering "make" on the command line will make the first target that appears in the file.) 3. GNU make has a lot of defaults that may make it unnecessary to have everything that you see in every make file, but these defaults have changed from version to version over the years, and they may change again, so many people put a lot of stuff that you may not really need. It's good to try to understand what is there, and I appreciate your inquisitiveness. 4. If you have a file named "something.c" and you don't have a makefile, then you can enter "make something" and GNU make will execute the following rule: $(CC) $(CFLAGS) something.c -o something If you have environment variables CC and CFLAGS set, it will use their values. If you con't have CC set, it will use the default cc (which is probably a symbolic link to gcc; at least it is on my various linux systems). Note that using the "no makefile" make doesn't allow you to combine more than one source file and it doesn't allow for dependencies on header files, but it's handy. In my .bashrc, I have an alias for setcflags that looks like this: Code:
Then if I first execute "setcflags", I can make C or C++ programs just by typing "make hello.c" or "make hello.cpp" or whatever. Since GNU make, by default, echos the rules it is executing, I can tell what is being done, whether I use a makefile or I use the "no makefile" make. Regards, Dave |
|
#6
|
||||
|
||||
Re: malloc/free usage, and general good programming behaviorSince Dave helped on the code very well, I'll mention a few things about formatting (see this link) Since formatting is a very personal style, take what you like, ignore the rest...
My preference is 4 space indent. I set up my editor to tab 4 spaces and convert all TABs into SPACEs. This makes the code easiest for me to follow, and we keep the multiples of 4 for all indentation. CPP / C++ / C Code:
CPP / C++ / C Code:
[/EDIT] CPP / C++ / C Code:
CPP / C++ / C Code:
CPP / C++ / C Code:
CPP / C++ / C Code:
CPP / C++ / C Code:
CPP / C++ / C Code:
CPP / C++ / C Code:
Plus, you can't hit any key using scanf() -- you must hit ENTER. Hope this helps... __________________
Cow: You're a lawyer too? Mooseblood (mosquito): Ma'am, I was already a bloodsucking parasite. All I needed was a briefcase! |
|
#7
|
|||
|
|||
Re: malloc/free usage, and general good programming behaviorQuote:
|
|
#8
|
||||
|
||||
Re: malloc/free usage, and general good programming behaviorQuote:
__________________
Cow: You're a lawyer too? Mooseblood (mosquito): Ma'am, I was already a bloodsucking parasite. All I needed was a briefcase! |
Recent GIDBlog
Updates On The All New Toyota VIOS - Part III by Nihal
| Thread Tools | Search this Thread |
| Rate This Thread | |
|
|
Network Sites: GIDNetwork · GIDWebHosts · GIDSearch · Learning Journal by J de Silva, The