What's new?
April 20, 2003
- It looks like just about everything is working now, except for what I believe
to be a bug in MySQL. The bug is consistent across version 3.23.56 and 4.0.12.
For some reason, when running a webstone test that takes longer than an hour or so,
as the event table in the database approaches 600 megabytes in size, mysql dies.
The system locks up. When I use gdb to attach to my net_client process, it is blocked
in a "mysql_real_insert" call (and, at the lowest level, in a read system call). I believe
that it is attempting to read data from a pipe to the mysqld daemon, and never
receiving any. Generally after this happens, I have to manually delete the database in
order to make MySQL work properly again - I can't even connect to the database
without the mysql program freezing up on me. Weird.
-
So, with this in mind, I ran some webstone tests that took about 48 minutes. The
results are here.
- I have also generated timing data for kernel compiles Here are the results.
April 9, 2003
-
I've discovered that that "final bug" I thought I'd fixed yesterday, well,
I didn't actually fix it. And it's also not my bug. There was
a problem with the memory cleanup code - but it wasn't what was causing
my strange lmbench problems. The actual source of the problem was that
lmbench, when running the lat_sig test, calculates two values: "How long
did it take to execute this system call" and "What is the overhead associated
with doing system calls in general". Except that sometimes, if the kernel blocks
the system call, the overhead comes back as bigger than the actual
system call itself. The you end up with negative values in bad places, and they
don't get checked, and loops go on for Very Long Times. At least, I think that's
what's happening. So: No more lmbench!
April 8, 2003
-
I fixed the final bug that affects operation of the front-end today.
At least, the final one I know about. There was an error in the memory
cleanup code in the module, code that gets executed when the daemon closes
the file, or when the module is unloaded. Anyway, it's fixed now.
-
Ash figured out my compiler problem. Hurray for Ash! It was an order of
library include sort of thing.
-
I'm running a test script right now... Just a bit of a trial run.
We'll see how it ended up in the morning.
April 7, 2003
-
Today I started writing scripts to do testing in the lab. I also installed
vanilla 2.4.18 kernels on two of the machines in there, so that I can use
them with the auditmodule. I am currently fighting with MySQL, trying to
get it installed on one of them. I am very angry with MySQL; I keep
getting linker errors when I try to compile my programs. I think I am
going to give up for tonight and make Ash figure it out tomorrow.
-
I ran some preliminary benchmarking tests on the IBM thinkpad that I've
been using all term to test the front-end code.
Check out the results. I believe that
the network-enabled results are excessively slow due to a flaky network
card in the ThinkPad. We'll find out for sure when I do the tests in the
lab.
-
I've fixed the problem with the OOM killer. That was just me being dumb.
I was allocating huge chunks of memory and then never freeing them.
I am so smart. S-M-R-T.
April 3, 2003
-
Today I fixed a bug that had been hiding in the auditmodule code, confounding me for
two days. I didn't discover it until I tried to start running lmbench. It seems that when
a whole lot of concurrent events are going on, stuff can get screwed up. The size of the
event that gets stuck in the queue to be read by the userspace daemon is wrong. After much
head-scratching (two days worth, to be exact), I disocvered that it was because there were
some variables in the "audit_event" routine (which gets called by every syscall
hijack routine) that were declared as
static, and were being manipulated
outside of the lockdown. Helloooo, race condition! So that's fixed.
-
There are two other issues (also known as "bugs"):
-
The audit daemon sometimes gets killed by the OOM killer, when memory is low. This
should not be hard to fix (just tell it to ignore the "SIGKILL" signal).
-
lmbench's "sig_lat" program does not seem to work properly. For some reason, it
repeatedly receives the "SIGSEGV" signal (presumably from the kernel), on a line of
code in which it is trying to access a memory-mapped address. Why? I have no idea.
We don't deal with memory-mapping at all. This implies to me that some code, somewhere
in my module, is munging a memory address that it should not be munging. Ugh.
The sig_lat program works perfectly well when my module isn't loaded, so I'm pretty
sure it's my problem.
Update: lmbench now seems to be working perfectly. I believe that the segfault
weirdness was a result of the fact that things get screwed up when you unload the
kernel module (there's no way around this - the module unload race condition is a
fact of life). But it's all good now. Hurray! Time for testing scripts soon!
-
My brain-dump to Ash has been completed, pretty much. He now knows basically how this system
works. He had some good ideas for me to use, especially in regard to database design. He has
also introduced me to his testing scripts, which I will be adapting when I start writing my own.