March
March 31, 2003
- I've been a bit lax about updating lately. So...
- I finished the new get_io program. It now maintains a list of open file
descriptors while it is running, rather than rebuilding the list every time
it encounters a write. The one thing that it does not handle (that I'm aware
of, at any rate) is interleaving
of opens and closes due to cloned processes. This can screw things up.
- I've started writing my own version of strace. This is somewhat tedious. Easy,
but tedious.
- I noticed that I'm not monitoring the sendmsg and recvmsg family
of system calls. I will have to add those. That shouldn't be too hard.
-
Here is a link to the source code for the new get_io
program, written in C. It is much more efficient than the old "get_io.sh".
-
I've begun trying to test the system using lmbench.
Unfortunately, I am encountering some snags. For some reason, benchmark test sometimes enters an
infinite loop when it tries to measure how long it takes handle a protection fault. I am unsure
as to why this is happening. More investigation is required. Also, it appears that I am handling
something improperly when I fail to allocate memory in the kernel and bail out... The
data received by the client sometimes gets munged - and the data stream right now is very brittle.
If it gets munged somewhere, it's not very good at getting back in sync. Rather than coding in
fault-tolerance, however, I would like to figure out where the fault is. Hopefully I can
figure this out tomorrow.
March 25, 2003
- I've been working on re-writing the get_io stuff. I'm doing it in C this time
(rather than in bash), and I'm doing it much more efficiently.
- I upgraded to MySQL 4.0, because it gives me UNIONS. It also breaks a few
things, and I have to fix those. Nothing serious seems broken; mainly minor details.
-
I left the client and server running over the weekend. For some reason, after
around 10 hours, the audit daemon on the front-end machine died. It didn't
segfault; it just interpreted a signal that it received as "time to shutdown".
It looks like somebody is sending it signal 2 - I believe that is SIGINT.
I will look into this further.
-
In the process of rewriting the get_io stuff, I've discovered a way that
someone could fake it out. Suppose the following sequence of events occurs:
- Process A spawns process B with pid X.
- Process B does stuff, and eventually dies.
- -- things happen, pids wrap around --
- Process C spawns Process D with pid X.
- Process D does things (ie: generates events).
Question: How do we know whether process A or process C is the parent of
process D? We know the pid of the child process at the time of the fork.
However, we do not know its rollcount, and we can't map to it uniquely.
So, I've added a row called child_roll_count to the event table, that
will get filled in at post-processing time (and will be null for everything
that's not a fork event). I could add another table for this, but I'm not sure
that I want to do the joins. Perhaps I will mention this to Dave at the next
meeting.
March 21, 2003
- Had a meeting with Wu-chang, Sourabh, and Dave. Sourabh is now working on
installing SAP DB, and getting a branch of
4N6 set up to work with it.
-
I moved some of the scripts around, and wrote a nice friendly wrapper
around net_client, called "client_session.sh". It's documented on the utils page.
-
I began adding post-processing functionality, and wrote the rollcounting
stuff. I haven't yet modified the scripts to take into account rollcounts.
-
I have started the server running, and a client to read from it. I'm gonna
leave it running over the weekend, and hope everything is still working
when I get back. My database right now, after running for about half
an hour, is at about 70 megs.
March 19, 2003
-
Bug discovery: Sourabh has found that, because we haven't been monitoring the
fcntl system call, it is possible for us to either miss certain reads
or writes, or get extraneous information. It turns out that dup and
dup2 are actually implemented via fcntl. So, I will be
removing logging for the dup calls and adding logging for the fcntl call.
I must check the Linux sourcecode first, though, to make sure this is actually
the case. Update: It's not the case.
- I've started implementing the rollcounting stuff, to handle pid-wrapping.
Check out the textfile I wrote about it, it's linked from the Miscellaneous
Documentation section.
March 18, 2003
- It's alive!
There is now a shell replay utility program. I have named it, imaginatively,
replay_shell. Here's a link to a compiled
binary, which you should have no trouble running on an x86 Linux box. I think.
For details on how to use it, check out the Utilities page. Here's a
link to a file that you can run it on. For other files that you
can run it on, check out the test data page. Right now there's only the one. But
it's pretty cool, it replays a shell session, with timing. Here's a link
to the source code, if you want or need to compile it yourself. To run this thing:
- Download the binary.
- Download the session.
- Start up an xterm (or other console program of your choice)
- Make sure it's 80x24 (it probably already is).
- From the console window, go to the directory you've downloaded the
stuff to, and type "./replay_shell -r -f shell_dump_1".
March 14, 2003
It's almost 8:00 on Friday night, so I'm gonna make this one as short as I can.
- I wrote some more scripts today, I've only updated the utils page with
one of them. It's a doozy, though. get_io.sh. Check the page.
- Work on the 4N6 webpage frontend is proceeding. Now that I've written
get_io.sh, I should be able to dump the output from a shell session.
I should have a rudimentary demo for this up and running either on monday
or sometime this weekend, if I decide to come to work.
March 13, 2003
-
Wrote several shell scripts to do various database retrieval functions.
These are documented on the new Utilities page.
-
Added monitoring of the close system call.
-
Changed the monitoring of both close and open so that they
update the dup table when an event occurs. I did this because you
can't do UNIONs in MySQL. Basically, I wanted to write a query that
would get entries from the dup table and open calls from the
event table, merge them using UNION, and use the result set to
figure out which descriptors were available at a given time. But since I
couldn't do that, I decided to put the information redundantly into the
dup table. I believe that some of the processing that is now going
on at runtime is unnecessary... Perhaps this redundant information should
be calculated by a script, once the database goes offline. However, that
sounds like a concern for Dave Maier, our database wizard, rather than
for me, the grunt.
March 12, 2003
- Added some links to database information and my TODO list
- Wrote an interpreter that takes a SQL file as input and replaces
occurrences of %x is input and replaces it with the appropriate cmd-line
argument. You can then pipe this to "mysql". This is so that I don't have
to write C programs to do simple tasks.
- Realized that I'm not monitoring the
close system call.
I'll fix that tomorrow.
- Wrote the list_all_dups.sh shell script. I haven't tested this script
yet, but I think that it should accurately (at least, once I've got
close
syscall monitoring implemented) tell you what descriptors a pid has that point
to a given file at any given time.
-
I started adding a new table to the databse, called "proc_life", but I have
decided that it's a bad idea and, on top of that, it's not possible to implement.
The idea was that this table would get updated whenever a process was spawned or
died, so that you could tell which actual "process" was associated with a pid
at any given time (in case of pid recycling). However, you don't know
when processes die. If it segfaults, you don't know. If it's killed with
SIGABRT, it can choose to handle the signal and continue running, even if the
kill returns success.
I believe, however, that there are still ways to get this sort of information.
What I am going to do is add a table that has a pid and a count
field. Whenever a process forks, we will update this table: Count will become equal
to count + 1 for that pid. If there is no entry in the table for the given pid, we
will insert a new row, and set count to 0. I will also add a "count" row to the event
table, and it will be updated with the current count for the given pid whenever an
event occurs. So, this will solve the problem of pid-wrapping: Processes will be uniquely
identified by (pid, count) pairs.
-
Wrote a test program called dupper (in the testing directory). This program
opens a specified file, then forks itself n times (creating an n-deep
process tree). The final process to fork writes a test string to the file, and
then returns. I intend to use this to test whether or not my list_all_dups.sh
script is working properly.
March 11, 2003
- Added "events per second" to throughput test.
- Working on a web-based interface to access the database back-end. It should
be able to answer this sequence of questions:
- Get a list of all pids that wrote to a file (eg:"/etc/passwd") between
time A and B
- Select a pid from this list. Get the pid of the shell process that
spawned this pid, and the tty of the that the shell used.
- Dump the session for the shell.
- Dump the syscall activity for any process executed by the shell.
... among others.
- Wrote some utils to help with the web interface:
- db_util/list_dups: Prints a list of all file descriptors that point
to a given filename (in a given process) between time A and B. Also lets
you specify a starting set of file descriptors to put in the set.
Oh yeah. this program is buggy right now... It doesn't take into account
the fact that people can close file descriptors. Hopefully I'll fix that tomorrow.
- db_util/list_ancestors: Prints a list of all ancestors of a given
process.
I think that I can write another utility function that uses these two programs
to figure out which processes have a file open, even if they inherit the
descriptors from their parent... currently, list_dups only figures it out
if the process opened the file itself. If it inherited a descriptor
referencing /etc/passwd from its parent, that doesn't get caught.
Anyway, the ultimate goal of these utils is to answer question 1 in the
above series of questions (although they may be useful for other questions, too).
March 6, 2003
- Wrote "db_init.sh" script, to automate database initialization.
Hooray for ease of installation!
- Generated a bunch of numbers. See the pages in the "Links" section for details.
- Discussed stuff with Wu-chang. My next priority is to write a web front-end for
doing queries on a sample database, so you folks can see what this thing can do,
and then I'm gonna write a program to replay a terminal session. Fun stuff!
March 5, 2003
- Added the test data section, and output from one test run.
- Added the "db_users.sql" and "init_db.sh" scripts to make setting up
the database on a new machine easier.
- Verified that the reason for the instability was, indeed, that gaping memory
hole. The system now seems to be rock-solid on my testing platform.
- Fixed a gaping memory hole in the data_buffer stuff. I am hopeful that now the
system will be able to run for long amounts of time (read: overnight) without crashing.
- Added "throttle" code. Basically, whenever a syscall occurs now, it calls
"throttle_wait" before it allocates any memory. If there are too many events in the buffer
(the default is set at 100), the calling process gets blocked on an event queue. This
should avoid potential DoS attacks.
- Wrote the "calc_throughput" utility. This program lets you specify a number of
seconds. Then it figures out the number of bytes that are written to the
auditdaemon by the kernel in that time period. It uses the /proc/auditinfo
device to get the data (I had to add some stuff to the stats code to get the
info about the number of bytes in there, too).
Mike Shea
Last modified: Thu Apr 10 20:03:15 PDT 2003