From 165125e1e480f9510a5ffcfbfee4e3ee38c05f23 Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Tue, 24 Jul 2007 09:28:11 +0200 Subject: [BLOCK] Get rid of request_queue_t typedef Some of the code has been gradually transitioned to using the proper struct request_queue, but there's lots left. So do a full sweet of the kernel and get rid of this typedef and replace its uses with the proper type. Signed-off-by: Jens Axboe --- Documentation/block/barrier.txt | 6 +++--- Documentation/block/biodoc.txt | 10 +++++----- Documentation/block/request.txt | 2 +- Documentation/iostats.txt | 2 +- 4 files changed, 10 insertions(+), 10 deletions(-) (limited to 'Documentation') diff --git a/Documentation/block/barrier.txt b/Documentation/block/barrier.txt index 7d279f2f5bb2..2c2f24f634e4 100644 --- a/Documentation/block/barrier.txt +++ b/Documentation/block/barrier.txt @@ -79,9 +79,9 @@ and how to prepare flush requests. Note that the term 'ordered' is used to indicate the whole sequence of performing barrier requests including draining and flushing. -typedef void (prepare_flush_fn)(request_queue_t *q, struct request *rq); +typedef void (prepare_flush_fn)(struct request_queue *q, struct request *rq); -int blk_queue_ordered(request_queue_t *q, unsigned ordered, +int blk_queue_ordered(struct request_queue *q, unsigned ordered, prepare_flush_fn *prepare_flush_fn); @q : the queue in question @@ -92,7 +92,7 @@ int blk_queue_ordered(request_queue_t *q, unsigned ordered, For example, SCSI disk driver's prepare_flush_fn looks like the following. -static void sd_prepare_flush(request_queue_t *q, struct request *rq) +static void sd_prepare_flush(struct request_queue *q, struct request *rq) { memset(rq->cmd, 0, sizeof(rq->cmd)); rq->cmd_type = REQ_TYPE_BLOCK_PC; diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt index 3adaace328a6..8af392fc6ef0 100644 --- a/Documentation/block/biodoc.txt +++ b/Documentation/block/biodoc.txt @@ -740,12 +740,12 @@ Block now offers some simple generic functionality to help support command queueing (typically known as tagged command queueing), ie manage more than one outstanding command on a queue at any given time. - blk_queue_init_tags(request_queue_t *q, int depth) + blk_queue_init_tags(struct request_queue *q, int depth) Initialize internal command tagging structures for a maximum depth of 'depth'. - blk_queue_free_tags((request_queue_t *q) + blk_queue_free_tags((struct request_queue *q) Teardown tag info associated with the queue. This will be done automatically by block if blk_queue_cleanup() is called on a queue @@ -754,7 +754,7 @@ one outstanding command on a queue at any given time. The above are initialization and exit management, the main helpers during normal operations are: - blk_queue_start_tag(request_queue_t *q, struct request *rq) + blk_queue_start_tag(struct request_queue *q, struct request *rq) Start tagged operation for this request. A free tag number between 0 and 'depth' is assigned to the request (rq->tag holds this number), @@ -762,7 +762,7 @@ normal operations are: for this queue is already achieved (or if the tag wasn't started for some other reason), 1 is returned. Otherwise 0 is returned. - blk_queue_end_tag(request_queue_t *q, struct request *rq) + blk_queue_end_tag(struct request_queue *q, struct request *rq) End tagged operation on this request. 'rq' is removed from the internal book keeping structures. @@ -781,7 +781,7 @@ queue. For instance, on IDE any tagged request error needs to clear both the hardware and software block queue and enable the driver to sanely restart all the outstanding requests. There's a third helper to do that: - blk_queue_invalidate_tags(request_queue_t *q) + blk_queue_invalidate_tags(struct request_queue *q) Clear the internal block tag queue and re-add all the pending requests to the request queue. The driver will receive them again on the diff --git a/Documentation/block/request.txt b/Documentation/block/request.txt index 75924e2a6975..fff58acb40a3 100644 --- a/Documentation/block/request.txt +++ b/Documentation/block/request.txt @@ -83,6 +83,6 @@ struct bio *bio DBI First bio in request struct bio *biotail DBI Last bio in request -request_queue_t *q DB Request queue this request belongs to +struct request_queue *q DB Request queue this request belongs to struct request_list *rl B Request list this request came from diff --git a/Documentation/iostats.txt b/Documentation/iostats.txt index 09a1bafe2528..b963c3b4afa5 100644 --- a/Documentation/iostats.txt +++ b/Documentation/iostats.txt @@ -79,7 +79,7 @@ Field 8 -- # of milliseconds spent writing measured from __make_request() to end_that_request_last()). Field 9 -- # of I/Os currently in progress The only field that should go to zero. Incremented as requests are - given to appropriate request_queue_t and decremented as they finish. + given to appropriate struct request_queue and decremented as they finish. Field 10 -- # of milliseconds spent doing I/Os This field is increases so long as field 9 is nonzero. Field 11 -- weighted # of milliseconds spent doing I/Os -- cgit v1.2.3 From 6570c45995a6339597462434a81f358a38941ac4 Mon Sep 17 00:00:00 2001 From: Rusty Russell Date: Mon, 23 Jul 2007 18:43:56 -0700 Subject: link lguest example launcher non-static S.Caglar Onur points out that many distributions don't ship a static zlib. Unfortunately the launcher currently maps virtual device memory where shared libraries want to go. The solution is to pre-scan the args to figure out how much memory we have, then allocate devices above that, rather than down from the top possible address. This also turns out to be simpler. Signed-off-by: Rusty Russell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/lguest/Makefile | 3 +- Documentation/lguest/lguest.c | 84 ++++++++++++++++++------------------------- 2 files changed, 35 insertions(+), 52 deletions(-) (limited to 'Documentation') diff --git a/Documentation/lguest/Makefile b/Documentation/lguest/Makefile index b9b9427376e9..31e794ef5f98 100644 --- a/Documentation/lguest/Makefile +++ b/Documentation/lguest/Makefile @@ -11,8 +11,7 @@ endif include $(KBUILD_OUTPUT)/.config LGUEST_GUEST_TOP := ($(CONFIG_PAGE_OFFSET) - 0x08000000) -CFLAGS:=-Wall -Wmissing-declarations -Wmissing-prototypes -O3 \ - -static -DLGUEST_GUEST_TOP="$(LGUEST_GUEST_TOP)" -Wl,-T,lguest.lds +CFLAGS:=-Wall -Wmissing-declarations -Wmissing-prototypes -O3 -Wl,-T,lguest.lds LDLIBS:=-lz all: lguest.lds lguest diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c index 1432b502a2d9..62a8133393e1 100644 --- a/Documentation/lguest/lguest.c +++ b/Documentation/lguest/lguest.c @@ -47,12 +47,14 @@ static bool verbose; #define verbose(args...) \ do { if (verbose) printf(args); } while(0) static int waker_fd; +static u32 top; struct device_list { fd_set infds; int max_infd; + struct lguest_device_desc *descs; struct device *dev; struct device **lastdev; }; @@ -324,8 +326,7 @@ static void concat(char *dst, char *args[]) static int tell_kernel(u32 pgdir, u32 start, u32 page_offset) { u32 args[] = { LHREQ_INITIALIZE, - LGUEST_GUEST_TOP/getpagesize(), /* Just below us */ - pgdir, start, page_offset }; + top/getpagesize(), pgdir, start, page_offset }; int fd; fd = open_or_die("/dev/lguest", O_RDWR); @@ -382,7 +383,7 @@ static int setup_waker(int lguest_fd, struct device_list *device_list) static void *_check_pointer(unsigned long addr, unsigned int size, unsigned int line) { - if (addr >= LGUEST_GUEST_TOP || addr + size >= LGUEST_GUEST_TOP) + if (addr >= top || addr + size >= top) errx(1, "%s:%i: Invalid address %li", __FILE__, line, addr); return (void *)addr; } @@ -629,24 +630,26 @@ static void handle_input(int fd, struct device_list *devices) } } -static struct lguest_device_desc *new_dev_desc(u16 type, u16 features, - u16 num_pages) +static struct lguest_device_desc * +new_dev_desc(struct lguest_device_desc *descs, + u16 type, u16 features, u16 num_pages) { - static unsigned long top = LGUEST_GUEST_TOP; - struct lguest_device_desc *desc; + unsigned int i; - desc = malloc(sizeof(*desc)); - desc->type = type; - desc->num_pages = num_pages; - desc->features = features; - desc->status = 0; - if (num_pages) { - top -= num_pages*getpagesize(); - map_zeroed_pages(top, num_pages); - desc->pfn = top / getpagesize(); - } else - desc->pfn = 0; - return desc; + for (i = 0; i < LGUEST_MAX_DEVICES; i++) { + if (!descs[i].type) { + descs[i].type = type; + descs[i].features = features; + descs[i].num_pages = num_pages; + if (num_pages) { + map_zeroed_pages(top, num_pages); + descs[i].pfn = top/getpagesize(); + top += num_pages*getpagesize(); + } + return &descs[i]; + } + } + errx(1, "too many devices"); } static struct device *new_device(struct device_list *devices, @@ -669,7 +672,7 @@ static struct device *new_device(struct device_list *devices, dev->fd = fd; if (handle_input) set_fd(dev->fd, devices); - dev->desc = new_dev_desc(type, features, num_pages); + dev->desc = new_dev_desc(devices->descs, type, features, num_pages); dev->mem = (void *)(dev->desc->pfn * getpagesize()); dev->handle_input = handle_input; dev->watch_key = (unsigned long)dev->mem + watch_off; @@ -866,30 +869,6 @@ static void setup_tun_net(const char *arg, struct device_list *devices) verbose("attached to bridge: %s\n", br_name); } -/* Now we know how much memory we have, we copy in device descriptors */ -static void map_device_descriptors(struct device_list *devs, unsigned long mem) -{ - struct device *i; - unsigned int num; - struct lguest_device_desc *descs; - - /* Device descriptor array sits just above top of normal memory */ - descs = map_zeroed_pages(mem, 1); - - for (i = devs->dev, num = 0; i; i = i->next, num++) { - if (num == LGUEST_MAX_DEVICES) - errx(1, "too many devices"); - verbose("Device %i: %s\n", num, - i->desc->type == LGUEST_DEVICE_T_NET ? "net" - : i->desc->type == LGUEST_DEVICE_T_CONSOLE ? "console" - : i->desc->type == LGUEST_DEVICE_T_BLOCK ? "block" - : "unknown"); - descs[num] = *i->desc; - free(i->desc); - i->desc = &descs[num]; - } -} - static void __attribute__((noreturn)) run_guest(int lguest_fd, struct device_list *device_list) { @@ -934,8 +913,8 @@ static void usage(void) int main(int argc, char *argv[]) { - unsigned long mem, pgdir, start, page_offset, initrd_size = 0; - int c, lguest_fd; + unsigned long mem = 0, pgdir, start, page_offset, initrd_size = 0; + int i, c, lguest_fd; struct device_list device_list; void *boot = (void *)0; const char *initrd_name = NULL; @@ -945,6 +924,15 @@ int main(int argc, char *argv[]) device_list.lastdev = &device_list.dev; FD_ZERO(&device_list.infds); + /* We need to know how much memory so we can allocate devices. */ + for (i = 1; i < argc; i++) { + if (argv[i][0] != '-') { + mem = top = atoi(argv[i]) * 1024 * 1024; + device_list.descs = map_zeroed_pages(top, 1); + top += getpagesize(); + break; + } + } while ((c = getopt_long(argc, argv, "v", opts, NULL)) != EOF) { switch (c) { case 'v': @@ -974,16 +962,12 @@ int main(int argc, char *argv[]) setup_console(&device_list); /* First we map /dev/zero over all of guest-physical memory. */ - mem = atoi(argv[optind]) * 1024 * 1024; map_zeroed_pages(0, mem / getpagesize()); /* Now we load the kernel */ start = load_kernel(open_or_die(argv[optind+1], O_RDONLY), &page_offset); - /* Write the device descriptors into memory. */ - map_device_descriptors(&device_list, mem); - /* Map the initrd image if requested */ if (initrd_name) { initrd_size = load_initrd(initrd_name, mem); -- cgit v1.2.3 From be1ff386e768ee4fc19bb7da48cee4fc4cb4e75b Mon Sep 17 00:00:00 2001 From: David Brownell Date: Mon, 23 Jul 2007 18:43:57 -0700 Subject: minor gpio doc update MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fix doc bug noted by Uwe Kleine-König: gpio_set_direction() is long gone, replaced by gpio_direction_input() and gpio_direction_output(). Signed-off-by: David Brownell Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/gpio.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/gpio.txt b/Documentation/gpio.txt index 218a8650f48d..6bc2ba215df9 100644 --- a/Documentation/gpio.txt +++ b/Documentation/gpio.txt @@ -148,7 +148,7 @@ pin ... that won't always match the specified output value, because of issues including wire-OR and output latencies. The get/set calls have no error returns because "invalid GPIO" should have -been reported earlier in gpio_set_direction(). However, note that not all +been reported earlier from gpio_direction_*(). However, note that not all platforms can read the value of output pins; those that can't should always return zero. Also, using these calls for GPIOs that can't safely be accessed without sleeping (see below) is an error. @@ -239,7 +239,7 @@ map between them using calls like: Those return either the corresponding number in the other namespace, or else a negative errno code if the mapping can't be done. (For example, some GPIOs can't used as IRQs.) It is an unchecked error to use a GPIO -number that hasn't been marked as an input using gpio_set_direction(), or +number that wasn't set up as an input using gpio_direction_input(), or to use an IRQ number that didn't originally come from gpio_to_irq(). These two mapping calls are expected to cost on the order of a single -- cgit v1.2.3 From b762f3ffb797c1281a38a1c82194534055fba5ec Mon Sep 17 00:00:00 2001 From: Joachim Deguara Date: Thu, 26 Jul 2007 13:40:43 +0200 Subject: [PATCH] sched: update Documentation/sched-stats.txt While learning about schedstats I found that the documentation in the tree is old. I updated it and found some interesting stuff like schedstats version 14 is the same as version and version 13 never saw a kernel release! Also there are 6 fields in the current schedstats that are not used anymore. Nick had made them irrelevant in commit 476d139c218e44e045e4bc6d4cc02b010b343939 but never removed them. Thanks to Rick's perl script who I borrowed some of the updated descriptions from. Signed-off-by: Joachim Deguara Acked-by: Nick Piggin Cc: Rick Lindsley Signed-off-by: Andrew Morton Signed-off-by: Ingo Molnar --- Documentation/sched-stats.txt | 195 +++++++++++++++++++++--------------------- 1 file changed, 99 insertions(+), 96 deletions(-) (limited to 'Documentation') diff --git a/Documentation/sched-stats.txt b/Documentation/sched-stats.txt index 6f72021aae51..442e14d35dea 100644 --- a/Documentation/sched-stats.txt +++ b/Documentation/sched-stats.txt @@ -1,10 +1,11 @@ -Version 10 of schedstats includes support for sched_domains, which -hit the mainline kernel in 2.6.7. Some counters make more sense to be -per-runqueue; other to be per-domain. Note that domains (and their associated -information) will only be pertinent and available on machines utilizing -CONFIG_SMP. - -In version 10 of schedstat, there is at least one level of domain +Version 14 of schedstats includes support for sched_domains, which hit the +mainline kernel in 2.6.20 although it is identical to the stats from version +12 which was in the kernel from 2.6.13-2.6.19 (version 13 never saw a kernel +release). Some counters make more sense to be per-runqueue; other to be +per-domain. Note that domains (and their associated information) will only +be pertinent and available on machines utilizing CONFIG_SMP. + +In version 14 of schedstat, there is at least one level of domain statistics for each cpu listed, and there may well be more than one domain. Domains have no particular names in this implementation, but the highest numbered one typically arbitrates balancing across all the @@ -27,7 +28,7 @@ to write their own scripts, the fields are described here. CPU statistics -------------- -cpu 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 +cpu 1 2 3 4 5 6 7 8 9 10 11 12 NOTE: In the sched_yield() statistics, the active queue is considered empty if it has only one process in it, since obviously the process calling @@ -39,48 +40,20 @@ First four fields are sched_yield() statistics: 3) # of times just the expired queue was empty 4) # of times sched_yield() was called -Next four are schedule() statistics: - 5) # of times the active queue had at least one other process on it - 6) # of times we switched to the expired queue and reused it - 7) # of times schedule() was called - 8) # of times schedule() left the processor idle - -Next four are active_load_balance() statistics: - 9) # of times active_load_balance() was called - 10) # of times active_load_balance() caused this cpu to gain a task - 11) # of times active_load_balance() caused this cpu to lose a task - 12) # of times active_load_balance() tried to move a task and failed - -Next three are try_to_wake_up() statistics: - 13) # of times try_to_wake_up() was called - 14) # of times try_to_wake_up() successfully moved the awakening task - 15) # of times try_to_wake_up() attempted to move the awakening task - -Next two are wake_up_new_task() statistics: - 16) # of times wake_up_new_task() was called - 17) # of times wake_up_new_task() successfully moved the new task - -Next one is a sched_migrate_task() statistic: - 18) # of times sched_migrate_task() was called +Next three are schedule() statistics: + 5) # of times we switched to the expired queue and reused it + 6) # of times schedule() was called + 7) # of times schedule() left the processor idle -Next one is a sched_balance_exec() statistic: - 19) # of times sched_balance_exec() was called +Next two are try_to_wake_up() statistics: + 8) # of times try_to_wake_up() was called + 9) # of times try_to_wake_up() was called to wake up the local cpu Next three are statistics describing scheduling latency: - 20) sum of all time spent running by tasks on this processor (in ms) - 21) sum of all time spent waiting to run by tasks on this processor (in ms) - 22) # of tasks (not necessarily unique) given to the processor - -The last six are statistics dealing with pull_task(): - 23) # of times pull_task() moved a task to this cpu when newly idle - 24) # of times pull_task() stole a task from this cpu when another cpu - was newly idle - 25) # of times pull_task() moved a task to this cpu when idle - 26) # of times pull_task() stole a task from this cpu when another cpu - was idle - 27) # of times pull_task() moved a task to this cpu when busy - 28) # of times pull_task() stole a task from this cpu when another cpu - was busy + 10) sum of all time spent running by tasks on this processor (in jiffies) + 11) sum of all time spent waiting to run by tasks on this processor (in + jiffies) + 12) # of timeslices run on this cpu Domain statistics @@ -89,65 +62,95 @@ One of these is produced per domain for each cpu described. (Note that if CONFIG_SMP is not defined, *no* domains are utilized and these lines will not appear in the output.) -domain 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 +domain 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 The first field is a bit mask indicating what cpus this domain operates over. -The next fifteen are a variety of load_balance() statistics: - - 1) # of times in this domain load_balance() was called when the cpu - was idle - 2) # of times in this domain load_balance() was called when the cpu - was busy - 3) # of times in this domain load_balance() was called when the cpu - was just becoming idle - 4) # of times in this domain load_balance() tried to move one or more - tasks and failed, when the cpu was idle - 5) # of times in this domain load_balance() tried to move one or more - tasks and failed, when the cpu was busy - 6) # of times in this domain load_balance() tried to move one or more - tasks and failed, when the cpu was just becoming idle - 7) sum of imbalances discovered (if any) with each call to - load_balance() in this domain when the cpu was idle - 8) sum of imbalances discovered (if any) with each call to - load_balance() in this domain when the cpu was busy - 9) sum of imbalances discovered (if any) with each call to - load_balance() in this domain when the cpu was just becoming idle - 10) # of times in this domain load_balance() was called but did not find - a busier queue while the cpu was idle - 11) # of times in this domain load_balance() was called but did not find - a busier queue while the cpu was busy - 12) # of times in this domain load_balance() was called but did not find - a busier queue while the cpu was just becoming idle - 13) # of times in this domain a busier queue was found while the cpu was - idle but no busier group was found - 14) # of times in this domain a busier queue was found while the cpu was - busy but no busier group was found - 15) # of times in this domain a busier queue was found while the cpu was - just becoming idle but no busier group was found - -Next two are sched_balance_exec() statistics: - 17) # of times in this domain sched_balance_exec() successfully pushed - a task to a new cpu - 18) # of times in this domain sched_balance_exec() tried but failed to - push a task to a new cpu - -Next two are try_to_wake_up() statistics: - 19) # of times in this domain try_to_wake_up() tried to move a task based - on affinity and cache warmth - 20) # of times in this domain try_to_wake_up() tried to move a task based - on load balancing - +The next 24 are a variety of load_balance() statistics in grouped into types +of idleness (idle, busy, and newly idle): + + 1) # of times in this domain load_balance() was called when the + cpu was idle + 2) # of times in this domain load_balance() checked but found + the load did not require balancing when the cpu was idle + 3) # of times in this domain load_balance() tried to move one or + more tasks and failed, when the cpu was idle + 4) sum of imbalances discovered (if any) with each call to + load_balance() in this domain when the cpu was idle + 5) # of times in this domain pull_task() was called when the cpu + was idle + 6) # of times in this domain pull_task() was called even though + the target task was cache-hot when idle + 7) # of times in this domain load_balance() was called but did + not find a busier queue while the cpu was idle + 8) # of times in this domain a busier queue was found while the + cpu was idle but no busier group was found + + 9) # of times in this domain load_balance() was called when the + cpu was busy + 10) # of times in this domain load_balance() checked but found the + load did not require balancing when busy + 11) # of times in this domain load_balance() tried to move one or + more tasks and failed, when the cpu was busy + 12) sum of imbalances discovered (if any) with each call to + load_balance() in this domain when the cpu was busy + 13) # of times in this domain pull_task() was called when busy + 14) # of times in this domain pull_task() was called even though the + target task was cache-hot when busy + 15) # of times in this domain load_balance() was called but did not + find a busier queue while the cpu was busy + 16) # of times in this domain a busier queue was found while the cpu + was busy but no busier group was found + + 17) # of times in this domain load_balance() was called when the + cpu was just becoming idle + 18) # of times in this domain load_balance() checked but found the + load did not require balancing when the cpu was just becoming idle + 19) # of times in this domain load_balance() tried to move one or more + tasks and failed, when the cpu was just becoming idle + 20) sum of imbalances discovered (if any) with each call to + load_balance() in this domain when the cpu was just becoming idle + 21) # of times in this domain pull_task() was called when newly idle + 22) # of times in this domain pull_task() was called even though the + target task was cache-hot when just becoming idle + 23) # of times in this domain load_balance() was called but did not + find a busier queue while the cpu was just becoming idle + 24) # of times in this domain a busier queue was found while the cpu + was just becoming idle but no busier group was found + + Next three are active_load_balance() statistics: + 25) # of times active_load_balance() was called + 26) # of times active_load_balance() tried to move a task and failed + 27) # of times active_load_balance() successfully moved a task + + Next three are sched_balance_exec() statistics: + 28) sbe_cnt is not used + 29) sbe_balanced is not used + 30) sbe_pushed is not used + + Next three are sched_balance_fork() statistics: + 31) sbf_cnt is not used + 32) sbf_balanced is not used + 33) sbf_pushed is not used + + Next three are try_to_wake_up() statistics: + 34) # of times in this domain try_to_wake_up() awoke a task that + last ran on a different cpu in this domain + 35) # of times in this domain try_to_wake_up() moved a task to the + waking cpu because it was cache-cold on its own cpu anyway + 36) # of times in this domain try_to_wake_up() started passive balancing /proc//schedstat ---------------- schedstats also adds a new /proc/ Date: Thu, 26 Jul 2007 10:41:02 -0700 Subject: lguest: documentation I: Preparation The netfilter code had very good documentation: the Netfilter Hacking HOWTO. Noone ever read it. So this time I'm trying something different, using a bit of Knuthiness. Signed-off-by: Rusty Russell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/lguest/extract | 58 +++++++++++++++++++++++++++++++++++ Documentation/lguest/lguest.c | 9 ++++-- drivers/lguest/Makefile | 12 ++++++++ drivers/lguest/README | 47 ++++++++++++++++++++++++++++ drivers/lguest/core.c | 7 +++-- drivers/lguest/hypercalls.c | 9 ++++-- drivers/lguest/interrupts_and_traps.c | 13 ++++++++ drivers/lguest/io.c | 8 +++-- drivers/lguest/lguest.c | 30 ++++++++++++++++-- drivers/lguest/lguest_bus.c | 3 ++ drivers/lguest/lguest_user.c | 7 ++++- drivers/lguest/page_tables.c | 10 ++++-- drivers/lguest/segments.c | 11 +++++++ drivers/lguest/switcher.S | 13 ++++---- 14 files changed, 218 insertions(+), 19 deletions(-) create mode 100644 Documentation/lguest/extract create mode 100644 drivers/lguest/README (limited to 'Documentation') diff --git a/Documentation/lguest/extract b/Documentation/lguest/extract new file mode 100644 index 000000000000..7730bb6e4b94 --- /dev/null +++ b/Documentation/lguest/extract @@ -0,0 +1,58 @@ +#! /bin/sh + +set -e + +PREFIX=$1 +shift + +trap 'rm -r $TMPDIR' 0 +TMPDIR=`mktemp -d` + +exec 3>/dev/null +for f; do + while IFS=" +" read -r LINE; do + case "$LINE" in + *$PREFIX:[0-9]*:\**) + NUM=`echo "$LINE" | sed "s/.*$PREFIX:\([0-9]*\).*/\1/"` + if [ -f $TMPDIR/$NUM ]; then + echo "$TMPDIR/$NUM already exits prior to $f" + exit 1 + fi + exec 3>>$TMPDIR/$NUM + echo $f | sed 's,\.\./,,g' > $TMPDIR/.$NUM + /bin/echo "$LINE" | sed -e "s/$PREFIX:[0-9]*//" -e "s/:\*/*/" >&3 + ;; + *$PREFIX:[0-9]*) + NUM=`echo "$LINE" | sed "s/.*$PREFIX:\([0-9]*\).*/\1/"` + if [ -f $TMPDIR/$NUM ]; then + echo "$TMPDIR/$NUM already exits prior to $f" + exit 1 + fi + exec 3>>$TMPDIR/$NUM + echo $f | sed 's,\.\./,,g' > $TMPDIR/.$NUM + /bin/echo "$LINE" | sed "s/$PREFIX:[0-9]*//" >&3 + ;; + *:\**) + /bin/echo "$LINE" | sed -e "s/:\*/*/" -e "s,/\*\*/,," >&3 + echo >&3 + exec 3>/dev/null + ;; + *) + /bin/echo "$LINE" >&3 + ;; + esac + done < $f + echo >&3 + exec 3>/dev/null +done + +LASTFILE="" +for f in $TMPDIR/*; do + if [ "$LASTFILE" != $(cat $TMPDIR/.$(basename $f) ) ]; then + LASTFILE=$(cat $TMPDIR/.$(basename $f) ) + echo "[ $LASTFILE ]" + fi + cat $f +done + diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c index 62a8133393e1..fc1bf70abfb1 100644 --- a/Documentation/lguest/lguest.c +++ b/Documentation/lguest/lguest.c @@ -1,5 +1,10 @@ -/* Simple program to layout "physical" memory for new lguest guest. - * Linked high to avoid likely physical memory. */ +/*P:100 This is the Launcher code, a simple program which lays out the + * "physical" memory for the new Guest by mapping the kernel image and the + * virtual devices, then reads repeatedly from /dev/lguest to run the Guest. + * + * The only trick: the Makefile links it at a high address so it will be clear + * of the guest memory region. It means that each Guest cannot have more than + * about 2.5G of memory on a normally configured Host. :*/ #define _LARGEFILE64_SOURCE #define _GNU_SOURCE #include diff --git a/drivers/lguest/Makefile b/drivers/lguest/Makefile index 55382c7d799c..e5047471c334 100644 --- a/drivers/lguest/Makefile +++ b/drivers/lguest/Makefile @@ -5,3 +5,15 @@ obj-$(CONFIG_LGUEST_GUEST) += lguest.o lguest_asm.o lguest_bus.o obj-$(CONFIG_LGUEST) += lg.o lg-y := core.o hypercalls.o page_tables.o interrupts_and_traps.o \ segments.o io.o lguest_user.o switcher.o + +Preparation Preparation!: PREFIX=P +Guest: PREFIX=G +Drivers: PREFIX=D +Launcher: PREFIX=L +Host: PREFIX=H +Switcher: PREFIX=S +Mastery: PREFIX=M +Beer: + @for f in Preparation Guest Drivers Launcher Host Switcher Mastery; do echo "{==- $$f -==}"; make -s $$f; done; echo "{==-==}" +Preparation Preparation! Guest Drivers Launcher Host Switcher Mastery: + @sh ../../Documentation/lguest/extract $(PREFIX) `find ../../* -name '*.[chS]' -wholename '*lguest*'` diff --git a/drivers/lguest/README b/drivers/lguest/README new file mode 100644 index 000000000000..b7db39a64c66 --- /dev/null +++ b/drivers/lguest/README @@ -0,0 +1,47 @@ +Welcome, friend reader, to lguest. + +Lguest is an adventure, with you, the reader, as Hero. I can't think of many +5000-line projects which offer both such capability and glimpses of future +potential; it is an exciting time to be delving into the source! + +But be warned; this is an arduous journey of several hours or more! And as we +know, all true Heroes are driven by a Noble Goal. Thus I offer a Beer (or +equivalent) to anyone I meet who has completed this documentation. + +So get comfortable and keep your wits about you (both quick and humorous). +Along your way to the Noble Goal, you will also gain masterly insight into +lguest, and hypervisors and x86 virtualization in general. + +Our Quest is in seven parts: (best read with C highlighting turned on) + +I) Preparation + - In which our potential hero is flown quickly over the landscape for a + taste of its scope. Suitable for the armchair coders and other such + persons of faint constitution. + +II) Guest + - Where we encounter the first tantalising wisps of code, and come to + understand the details of the life of a Guest kernel. + +III) Drivers + - Whereby the Guest finds its voice and become useful, and our + understanding of the Guest is completed. + +IV) Launcher + - Where we trace back to the creation of the Guest, and thus begin our + understanding of the Host. + +V) Host + - Where we master the Host code, through a long and tortuous journey. + Indeed, it is here that our hero is tested in the Bit of Despair. + +VI) Switcher + - Where our understanding of the intertwined nature of Guests and Hosts + is completed. + +VII) Mastery + - Where our fully fledged hero grapples with the Great Question: + "What next?" + +make Preparation! +Rusty Russell. diff --git a/drivers/lguest/core.c b/drivers/lguest/core.c index ce909ec57499..2cea0c80c992 100644 --- a/drivers/lguest/core.c +++ b/drivers/lguest/core.c @@ -1,5 +1,8 @@ -/* World's simplest hypervisor, to test paravirt_ops and show - * unbelievers that virtualization is the future. Plus, it's fun! */ +/*P:400 This contains run_guest() which actually calls into the Host<->Guest + * Switcher and analyzes the return, such as determining if the Guest wants the + * Host to do something. This file also contains useful helper routines, and a + * couple of non-obvious setup and teardown pieces which were implemented after + * days of debugging pain. :*/ #include #include #include diff --git a/drivers/lguest/hypercalls.c b/drivers/lguest/hypercalls.c index ea52ca451f74..fb546b046445 100644 --- a/drivers/lguest/hypercalls.c +++ b/drivers/lguest/hypercalls.c @@ -1,5 +1,10 @@ -/* Actual hypercalls, which allow guests to actually do something. - Copyright (C) 2006 Rusty Russell IBM Corporation +/*P:500 Just as userspace programs request kernel operations through a system + * call, the Guest requests Host operations through a "hypercall". You might + * notice this nomenclature doesn't really follow any logic, but the name has + * been around for long enough that we're stuck with it. As you'd expect, this + * code is basically a one big switch statement. :*/ + +/* Copyright (C) 2006 Rusty Russell IBM Corporation This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by diff --git a/drivers/lguest/interrupts_and_traps.c b/drivers/lguest/interrupts_and_traps.c index bee029bb2c7b..b2647974e1a7 100644 --- a/drivers/lguest/interrupts_and_traps.c +++ b/drivers/lguest/interrupts_and_traps.c @@ -1,3 +1,16 @@ +/*P:800 Interrupts (traps) are complicated enough to earn their own file. + * There are three classes of interrupts: + * + * 1) Real hardware interrupts which occur while we're running the Guest, + * 2) Interrupts for virtual devices attached to the Guest, and + * 3) Traps and faults from the Guest. + * + * Real hardware interrupts must be delivered to the Host, not the Guest. + * Virtual interrupts must be delivered to the Guest, but we make them look + * just like real hardware would deliver them. Traps from the Guest can be set + * up to go directly back into the Guest, but sometimes the Host wants to see + * them first, so we also have a way of "reflecting" them into the Guest as if + * they had been delivered to it directly. :*/ #include #include "lg.h" diff --git a/drivers/lguest/io.c b/drivers/lguest/io.c index c8eb79266991..d2f02f0653ca 100644 --- a/drivers/lguest/io.c +++ b/drivers/lguest/io.c @@ -1,5 +1,9 @@ -/* Simple I/O model for guests, based on shared memory. - * Copyright (C) 2006 Rusty Russell IBM Corporation +/*P:300 The I/O mechanism in lguest is simple yet flexible, allowing the Guest + * to talk to the Launcher or directly to another Guest. It uses familiar + * concepts of DMA and interrupts, plus some neat code stolen from + * futexes... :*/ + +/* Copyright (C) 2006 Rusty Russell IBM Corporation * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by diff --git a/drivers/lguest/lguest.c b/drivers/lguest/lguest.c index 18dade06d4a9..e7d128312b23 100644 --- a/drivers/lguest/lguest.c +++ b/drivers/lguest/lguest.c @@ -1,6 +1,32 @@ -/* - * Lguest specific paravirt-ops implementation +/*P:010 + * A hypervisor allows multiple Operating Systems to run on a single machine. + * To quote David Wheeler: "Any problem in computer science can be solved with + * another layer of indirection." + * + * We keep things simple in two ways. First, we start with a normal Linux + * kernel and insert a module (lg.ko) which allows us to run other Linux + * kernels the same way we'd run processes. We call the first kernel the Host, + * and the others the Guests. The program which sets up and configures Guests + * (such as the example in Documentation/lguest/lguest.c) is called the + * Launcher. + * + * Secondly, we only run specially modified Guests, not normal kernels. When + * you set CONFIG_LGUEST to 'y' or 'm', this automatically sets + * CONFIG_LGUEST_GUEST=y, which compiles this file into the kernel so it knows + * how to be a Guest. This means that you can use the same kernel you boot + * normally (ie. as a Host) as a Guest. * + * These Guests know that they cannot do privileged operations, such as disable + * interrupts, and that they have to ask the Host to do such things explicitly. + * This file consists of all the replacements for such low-level native + * hardware operations: these special Guest versions call the Host. + * + * So how does the kernel know it's a Guest? The Guest starts at a special + * entry point marked with a magic string, which sets up a few things then + * calls here. We replace the native functions in "struct paravirt_ops" + * with our Guest versions, then boot like normal. :*/ + +/* * Copyright (C) 2006, Rusty Russell IBM Corporation. * * This program is free software; you can redistribute it and/or modify diff --git a/drivers/lguest/lguest_bus.c b/drivers/lguest/lguest_bus.c index 18d6ab21a43b..9a22d199502e 100644 --- a/drivers/lguest/lguest_bus.c +++ b/drivers/lguest/lguest_bus.c @@ -1,3 +1,6 @@ +/*P:050 Lguest guests use a very simple bus for devices. It's a simple array + * of device descriptors contained just above the top of normal memory. The + * lguest bus is 80% tedious boilerplate code. :*/ #include #include #include diff --git a/drivers/lguest/lguest_user.c b/drivers/lguest/lguest_user.c index e90d7a783daf..6ae86f20ce3d 100644 --- a/drivers/lguest/lguest_user.c +++ b/drivers/lguest/lguest_user.c @@ -1,4 +1,9 @@ -/* Userspace control of the guest, via /dev/lguest. */ +/*P:200 This contains all the /dev/lguest code, whereby the userspace launcher + * controls and communicates with the Guest. For example, the first write will + * tell us the memory size, pagetable, entry point and kernel address offset. + * A read will run the Guest until a signal is pending (-EINTR), or the Guest + * does a DMA out to the Launcher. Writes are also used to get a DMA buffer + * registered by the Guest and to send the Guest an interrupt. :*/ #include #include #include diff --git a/drivers/lguest/page_tables.c b/drivers/lguest/page_tables.c index 1b0ba09b1269..f9ca50d80466 100644 --- a/drivers/lguest/page_tables.c +++ b/drivers/lguest/page_tables.c @@ -1,5 +1,11 @@ -/* Shadow page table operations. - * Copyright (C) Rusty Russell IBM Corporation 2006. +/*P:700 The pagetable code, on the other hand, still shows the scars of + * previous encounters. It's functional, and as neat as it can be in the + * circumstances, but be wary, for these things are subtle and break easily. + * The Guest provides a virtual to physical mapping, but we can neither trust + * it nor use it: we verify and convert it here to point the hardware to the + * actual Guest pages when running the Guest. :*/ + +/* Copyright (C) Rusty Russell IBM Corporation 2006. * GPL v2 and any later version */ #include #include diff --git a/drivers/lguest/segments.c b/drivers/lguest/segments.c index 1b2cfe89dcd5..c4fc7293b84b 100644 --- a/drivers/lguest/segments.c +++ b/drivers/lguest/segments.c @@ -1,3 +1,14 @@ +/*P:600 The x86 architecture has segments, which involve a table of descriptors + * which can be used to do funky things with virtual address interpretation. + * We originally used to use segments so the Guest couldn't alter the + * Guest<->Host Switcher, and then we had to trim Guest segments, and restore + * for userspace per-thread segments, but trim again for on userspace->kernel + * transitions... This nightmarish creation was contained within this file, + * where we knew not to tread without heavy armament and a change of underwear. + * + * In these modern times, the segment handling code consists of simple sanity + * checks, and the worst you'll experience reading this code is butterfly-rash + * from frolicking through its parklike serenity. :*/ #include "lg.h" static int desc_ok(const struct desc_struct *gdt) diff --git a/drivers/lguest/switcher.S b/drivers/lguest/switcher.S index eadd4cc299d2..e7cb8c123558 100644 --- a/drivers/lguest/switcher.S +++ b/drivers/lguest/switcher.S @@ -1,10 +1,11 @@ -/* This code sits at 0xFFC00000 to do the low-level guest<->host switch. +/*P:900 This is the Switcher: code which sits at 0xFFC00000 to do the low-level + * Guest<->Host switch. It is as simple as it can be made, but it's naturally + * very specific to x86. + * + * You have now completed Preparation. If this has whet your appetite; if you + * are feeling invigorated and refreshed then the next, more challenging stage + * can be found in "make Guest". :*/ - There is are two pages above us for this CPU (struct lguest_pages). - The second page (struct lguest_ro_state) becomes read-only after the - context switch. The first page (the stack for traps) remains writable, - but while we're in here, the guest cannot be running. -*/ #include #include #include "lg.h" -- cgit v1.2.3 From dde797899ac17ebb812b7566044124d785e98dc7 Mon Sep 17 00:00:00 2001 From: Rusty Russell Date: Thu, 26 Jul 2007 10:41:03 -0700 Subject: lguest: documentation IV: Launcher Documentation: The Launcher Signed-off-by: Rusty Russell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/lguest/lguest.c | 599 ++++++++++++++++++++++++++++++++++++++---- drivers/lguest/core.c | 24 +- drivers/lguest/io.c | 247 +++++++++++++++-- drivers/lguest/lg.h | 25 ++ drivers/lguest/lguest_user.c | 159 ++++++++++- 5 files changed, 982 insertions(+), 72 deletions(-) (limited to 'Documentation') diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c index fc1bf70abfb1..d7e26f025959 100644 --- a/Documentation/lguest/lguest.c +++ b/Documentation/lguest/lguest.c @@ -34,12 +34,20 @@ #include #include #include +/*L:110 We can ignore the 28 include files we need for this program, but I do + * want to draw attention to the use of kernel-style types. + * + * As Linus said, "C is a Spartan language, and so should your naming be." I + * like these abbreviations and the header we need uses them, so we define them + * here. + */ typedef unsigned long long u64; typedef uint32_t u32; typedef uint16_t u16; typedef uint8_t u8; #include "../../include/linux/lguest_launcher.h" #include "../../include/asm-i386/e820.h" +/*:*/ #define PAGE_PRESENT 0x7 /* Present, RW, Execute */ #define NET_PEERNUM 1 @@ -48,33 +56,52 @@ typedef uint8_t u8; #define SIOCBRADDIF 0x89a2 /* add interface to bridge */ #endif +/*L:120 verbose is both a global flag and a macro. The C preprocessor allows + * this, and although I wouldn't recommend it, it works quite nicely here. */ static bool verbose; #define verbose(args...) \ do { if (verbose) printf(args); } while(0) +/*:*/ + +/* The pipe to send commands to the waker process */ static int waker_fd; +/* The top of guest physical memory. */ static u32 top; +/* This is our list of devices. */ struct device_list { + /* Summary information about the devices in our list: ready to pass to + * select() to ask which need servicing.*/ fd_set infds; int max_infd; + /* The descriptor page for the devices. */ struct lguest_device_desc *descs; + + /* A single linked list of devices. */ struct device *dev; + /* ... And an end pointer so we can easily append new devices */ struct device **lastdev; }; +/* The device structure describes a single device. */ struct device { + /* The linked-list pointer. */ struct device *next; + /* The descriptor for this device, as mapped into the Guest. */ struct lguest_device_desc *desc; + /* The memory page(s) of this device, if any. Also mapped in Guest. */ void *mem; - /* Watch this fd if handle_input non-NULL. */ + /* If handle_input is set, it wants to be called when this file + * descriptor is ready. */ int fd; bool (*handle_input)(int fd, struct device *me); - /* Watch DMA to this key if handle_input non-NULL. */ + /* If handle_output is set, it wants to be called when the Guest sends + * DMA to this key. */ unsigned long watch_key; u32 (*handle_output)(int fd, const struct iovec *iov, unsigned int num, struct device *me); @@ -83,6 +110,11 @@ struct device void *priv; }; +/*L:130 + * Loading the Kernel. + * + * We start with couple of simple helper routines. open_or_die() avoids + * error-checking code cluttering the callers: */ static int open_or_die(const char *name, int flags) { int fd = open(name, flags); @@ -91,26 +123,38 @@ static int open_or_die(const char *name, int flags) return fd; } +/* map_zeroed_pages() takes a (page-aligned) address and a number of pages. */ static void *map_zeroed_pages(unsigned long addr, unsigned int num) { + /* We cache the /dev/zero file-descriptor so we only open it once. */ static int fd = -1; if (fd == -1) fd = open_or_die("/dev/zero", O_RDONLY); + /* We use a private mapping (ie. if we write to the page, it will be + * copied), and obviously we insist that it be mapped where we ask. */ if (mmap((void *)addr, getpagesize() * num, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_FIXED|MAP_PRIVATE, fd, 0) != (void *)addr) err(1, "Mmaping %u pages of /dev/zero @%p", num, (void *)addr); + + /* Returning the address is just a courtesy: can simplify callers. */ return (void *)addr; } -/* Find magic string marking entry point, return entry point. */ +/* To find out where to start we look for the magic Guest string, which marks + * the code we see in lguest_asm.S. This is a hack which we are currently + * plotting to replace with the normal Linux entry point. */ static unsigned long entry_point(void *start, void *end, unsigned long page_offset) { void *p; + /* The scan gives us the physical starting address. We want the + * virtual address in this case, and fortunately, we already figured + * out the physical-virtual difference and passed it here in + * "page_offset". */ for (p = start; p < end; p++) if (memcmp(p, "GenuineLguest", strlen("GenuineLguest")) == 0) return (long)p + strlen("GenuineLguest") + page_offset; @@ -118,7 +162,17 @@ static unsigned long entry_point(void *start, void *end, err(1, "Is this image a genuine lguest?"); } -/* Returns the entry point */ +/* This routine takes an open vmlinux image, which is in ELF, and maps it into + * the Guest memory. ELF = Embedded Linking Format, which is the format used + * by all modern binaries on Linux including the kernel. + * + * The ELF headers give *two* addresses: a physical address, and a virtual + * address. The Guest kernel expects to be placed in memory at the physical + * address, and the page tables set up so it will correspond to that virtual + * address. We return the difference between the virtual and physical + * addresses in the "page_offset" pointer. + * + * We return the starting address. */ static unsigned long map_elf(int elf_fd, const Elf32_Ehdr *ehdr, unsigned long *page_offset) { @@ -127,40 +181,61 @@ static unsigned long map_elf(int elf_fd, const Elf32_Ehdr *ehdr, unsigned int i; unsigned long start = -1UL, end = 0; - /* Sanity checks. */ + /* Sanity checks on the main ELF header: an x86 executable with a + * reasonable number of correctly-sized program headers. */ if (ehdr->e_type != ET_EXEC || ehdr->e_machine != EM_386 || ehdr->e_phentsize != sizeof(Elf32_Phdr) || ehdr->e_phnum < 1 || ehdr->e_phnum > 65536U/sizeof(Elf32_Phdr)) errx(1, "Malformed elf header"); + /* An ELF executable contains an ELF header and a number of "program" + * headers which indicate which parts ("segments") of the program to + * load where. */ + + /* We read in all the program headers at once: */ if (lseek(elf_fd, ehdr->e_phoff, SEEK_SET) < 0) err(1, "Seeking to program headers"); if (read(elf_fd, phdr, sizeof(phdr)) != sizeof(phdr)) err(1, "Reading program headers"); + /* We don't know page_offset yet. */ *page_offset = 0; - /* We map the loadable segments at virtual addresses corresponding - * to their physical addresses (our virtual == guest physical). */ + + /* Try all the headers: there are usually only three. A read-only one, + * a read-write one, and a "note" section which isn't loadable. */ for (i = 0; i < ehdr->e_phnum; i++) { + /* If this isn't a loadable segment, we ignore it */ if (phdr[i].p_type != PT_LOAD) continue; verbose("Section %i: size %i addr %p\n", i, phdr[i].p_memsz, (void *)phdr[i].p_paddr); - /* We expect linear address space. */ + /* We expect a simple linear address space: every segment must + * have the same difference between virtual (p_vaddr) and + * physical (p_paddr) address. */ if (!*page_offset) *page_offset = phdr[i].p_vaddr - phdr[i].p_paddr; else if (*page_offset != phdr[i].p_vaddr - phdr[i].p_paddr) errx(1, "Page offset of section %i different", i); + /* We track the first and last address we mapped, so we can + * tell entry_point() where to scan. */ if (phdr[i].p_paddr < start) start = phdr[i].p_paddr; if (phdr[i].p_paddr + phdr[i].p_filesz > end) end = phdr[i].p_paddr + phdr[i].p_filesz; - /* We map everything private, writable. */ + /* We map this section of the file at its physical address. We + * map it read & write even if the header says this segment is + * read-only. The kernel really wants to be writable: it + * patches its own instructions which would normally be + * read-only. + * + * MAP_PRIVATE means that the page won't be copied until a + * write is done to it. This allows us to share much of the + * kernel memory between Guests. */ addr = mmap((void *)phdr[i].p_paddr, phdr[i].p_filesz, PROT_READ|PROT_WRITE|PROT_EXEC, @@ -174,7 +249,31 @@ static unsigned long map_elf(int elf_fd, const Elf32_Ehdr *ehdr, return entry_point((void *)start, (void *)end, *page_offset); } -/* This is amazingly reliable. */ +/*L:170 Prepare to be SHOCKED and AMAZED. And possibly a trifle nauseated. + * + * We know that CONFIG_PAGE_OFFSET sets what virtual address the kernel expects + * to be. We don't know what that option was, but we can figure it out + * approximately by looking at the addresses in the code. I chose the common + * case of reading a memory location into the %eax register: + * + * movl , %eax + * + * This gets encoded as five bytes: "0xA1 <4-byte-address>". For example, + * "0xA1 0x18 0x60 0x47 0xC0" reads the address 0xC0476018 into %eax. + * + * In this example can guess that the kernel was compiled with + * CONFIG_PAGE_OFFSET set to 0xC0000000 (it's always a round number). If the + * kernel were larger than 16MB, we might see 0xC1 addresses show up, but our + * kernel isn't that bloated yet. + * + * Unfortunately, x86 has variable-length instructions, so finding this + * particular instruction properly involves writing a disassembler. Instead, + * we rely on statistics. We look for "0xA1" and tally the different bytes + * which occur 4 bytes later (the "0xC0" in our example above). When one of + * those bytes appears three times, we can be reasonably confident that it + * forms the start of CONFIG_PAGE_OFFSET. + * + * This is amazingly reliable. */ static unsigned long intuit_page_offset(unsigned char *img, unsigned long len) { unsigned int i, possibilities[256] = { 0 }; @@ -187,30 +286,52 @@ static unsigned long intuit_page_offset(unsigned char *img, unsigned long len) errx(1, "could not determine page offset"); } +/*L:160 Unfortunately the entire ELF image isn't compressed: the segments + * which need loading are extracted and compressed raw. This denies us the + * information we need to make a fully-general loader. */ static unsigned long unpack_bzimage(int fd, unsigned long *page_offset) { gzFile f; int ret, len = 0; + /* A bzImage always gets loaded at physical address 1M. This is + * actually configurable as CONFIG_PHYSICAL_START, but as the comment + * there says, "Don't change this unless you know what you are doing". + * Indeed. */ void *img = (void *)0x100000; + /* gzdopen takes our file descriptor (carefully placed at the start of + * the GZIP header we found) and returns a gzFile. */ f = gzdopen(fd, "rb"); + /* We read it into memory in 64k chunks until we hit the end. */ while ((ret = gzread(f, img + len, 65536)) > 0) len += ret; if (ret < 0) err(1, "reading image from bzImage"); verbose("Unpacked size %i addr %p\n", len, img); + + /* Without the ELF header, we can't tell virtual-physical gap. This is + * CONFIG_PAGE_OFFSET, and people do actually change it. Fortunately, + * I have a clever way of figuring it out from the code itself. */ *page_offset = intuit_page_offset(img, len); return entry_point(img, img + len, *page_offset); } +/*L:150 A bzImage, unlike an ELF file, is not meant to be loaded. You're + * supposed to jump into it and it will unpack itself. We can't do that + * because the Guest can't run the unpacking code, and adding features to + * lguest kills puppies, so we don't want to. + * + * The bzImage is formed by putting the decompressing code in front of the + * compressed kernel code. So we can simple scan through it looking for the + * first "gzip" header, and start decompressing from there. */ static unsigned long load_bzimage(int fd, unsigned long *page_offset) { unsigned char c; int state = 0; - /* Ugly brute force search for gzip header. */ + /* GZIP header is 0x1F 0x8B ... . */ while (read(fd, &c, 1) == 1) { switch (state) { case 0: @@ -227,8 +348,10 @@ static unsigned long load_bzimage(int fd, unsigned long *page_offset) state++; break; case 9: + /* Seek back to the start of the gzip header. */ lseek(fd, -10, SEEK_CUR); - if (c != 0x03) /* Compressed under UNIX. */ + /* One final check: "compressed under UNIX". */ + if (c != 0x03) state = -1; else return unpack_bzimage(fd, page_offset); @@ -237,25 +360,43 @@ static unsigned long load_bzimage(int fd, unsigned long *page_offset) errx(1, "Could not find kernel in bzImage"); } +/*L:140 Loading the kernel is easy when it's a "vmlinux", but most kernels + * come wrapped up in the self-decompressing "bzImage" format. With some funky + * coding, we can load those, too. */ static unsigned long load_kernel(int fd, unsigned long *page_offset) { Elf32_Ehdr hdr; + /* Read in the first few bytes. */ if (read(fd, &hdr, sizeof(hdr)) != sizeof(hdr)) err(1, "Reading kernel"); + /* If it's an ELF file, it starts with "\177ELF" */ if (memcmp(hdr.e_ident, ELFMAG, SELFMAG) == 0) return map_elf(fd, &hdr, page_offset); + /* Otherwise we assume it's a bzImage, and try to unpack it */ return load_bzimage(fd, page_offset); } +/* This is a trivial little helper to align pages. Andi Kleen hated it because + * it calls getpagesize() twice: "it's dumb code." + * + * Kernel guys get really het up about optimization, even when it's not + * necessary. I leave this code as a reaction against that. */ static inline unsigned long page_align(unsigned long addr) { + /* Add upwards and truncate downwards. */ return ((addr + getpagesize()-1) & ~(getpagesize()-1)); } -/* initrd gets loaded at top of memory: return length. */ +/*L:180 An "initial ram disk" is a disk image loaded into memory along with + * the kernel which the kernel can use to boot from without needing any + * drivers. Most distributions now use this as standard: the initrd contains + * the code to load the appropriate driver modules for the current machine. + * + * Importantly, James Morris works for RedHat, and Fedora uses initrds for its + * kernels. He sent me this (and tells me when I break it). */ static unsigned long load_initrd(const char *name, unsigned long mem) { int ifd; @@ -264,21 +405,35 @@ static unsigned long load_initrd(const char *name, unsigned long mem) void *iaddr; ifd = open_or_die(name, O_RDONLY); + /* fstat() is needed to get the file size. */ if (fstat(ifd, &st) < 0) err(1, "fstat() on initrd '%s'", name); + /* The length needs to be rounded up to a page size: mmap needs the + * address to be page aligned. */ len = page_align(st.st_size); + /* We map the initrd at the top of memory. */ iaddr = mmap((void *)mem - len, st.st_size, PROT_READ|PROT_EXEC|PROT_WRITE, MAP_FIXED|MAP_PRIVATE, ifd, 0); if (iaddr != (void *)mem - len) err(1, "Mmaping initrd '%s' returned %p not %p", name, iaddr, (void *)mem - len); + /* Once a file is mapped, you can close the file descriptor. It's a + * little odd, but quite useful. */ close(ifd); verbose("mapped initrd %s size=%lu @ %p\n", name, st.st_size, iaddr); + + /* We return the initrd size. */ return len; } +/* Once we know how much memory we have, and the address the Guest kernel + * expects, we can construct simple linear page tables which will get the Guest + * far enough into the boot to create its own. + * + * We lay them out of the way, just below the initrd (which is why we need to + * know its size). */ static unsigned long setup_pagetables(unsigned long mem, unsigned long initrd_size, unsigned long page_offset) @@ -287,23 +442,32 @@ static unsigned long setup_pagetables(unsigned long mem, unsigned int mapped_pages, i, linear_pages; unsigned int ptes_per_page = getpagesize()/sizeof(u32); - /* If we can map all of memory above page_offset, we do so. */ + /* Ideally we map all physical memory starting at page_offset. + * However, if page_offset is 0xC0000000 we can only map 1G of physical + * (0xC0000000 + 1G overflows). */ if (mem <= -page_offset) mapped_pages = mem/getpagesize(); else mapped_pages = -page_offset/getpagesize(); - /* Each linear PTE page can map ptes_per_page pages. */ + /* Each PTE page can map ptes_per_page pages: how many do we need? */ linear_pages = (mapped_pages + ptes_per_page-1)/ptes_per_page; - /* We lay out top-level then linear mapping immediately below initrd */ + /* We put the toplevel page directory page at the top of memory. */ pgdir = (void *)mem - initrd_size - getpagesize(); + + /* Now we use the next linear_pages pages as pte pages */ linear = (void *)pgdir - linear_pages*getpagesize(); + /* Linear mapping is easy: put every page's address into the mapping in + * order. PAGE_PRESENT contains the flags Present, Writable and + * Executable. */ for (i = 0; i < mapped_pages; i++) linear[i] = ((i * getpagesize()) | PAGE_PRESENT); - /* Now set up pgd so that this memory is at page_offset */ + /* The top level points to the linear page table pages above. The + * entry representing page_offset points to the first one, and they + * continue from there. */ for (i = 0; i < mapped_pages; i += ptes_per_page) { pgdir[(i + page_offset/getpagesize())/ptes_per_page] = (((u32)linear + i*sizeof(u32)) | PAGE_PRESENT); @@ -312,9 +476,13 @@ static unsigned long setup_pagetables(unsigned long mem, verbose("Linear mapping of %u pages in %u pte pages at %p\n", mapped_pages, linear_pages, linear); + /* We return the top level (guest-physical) address: the kernel needs + * to know where it is. */ return (unsigned long)pgdir; } +/* Simple routine to roll all the commandline arguments together with spaces + * between them. */ static void concat(char *dst, char *args[]) { unsigned int i, len = 0; @@ -328,6 +496,10 @@ static void concat(char *dst, char *args[]) dst[len] = '\0'; } +/* This is where we actually tell the kernel to initialize the Guest. We saw + * the arguments it expects when we looked at initialize() in lguest_user.c: + * the top physical page to allow, the top level pagetable, the entry point and + * the page_offset constant for the Guest. */ static int tell_kernel(u32 pgdir, u32 start, u32 page_offset) { u32 args[] = { LHREQ_INITIALIZE, @@ -337,8 +509,11 @@ static int tell_kernel(u32 pgdir, u32 start, u32 page_offset) fd = open_or_die("/dev/lguest", O_RDWR); if (write(fd, args, sizeof(args)) < 0) err(1, "Writing to /dev/lguest"); + + /* We return the /dev/lguest file descriptor to control this Guest */ return fd; } +/*:*/ static void set_fd(int fd, struct device_list *devices) { @@ -347,61 +522,108 @@ static void set_fd(int fd, struct device_list *devices) devices->max_infd = fd; } -/* When input arrives, we tell the kernel to kick lguest out with -EAGAIN. */ +/*L:200 + * The Waker. + * + * With a console and network devices, we can have lots of input which we need + * to process. We could try to tell the kernel what file descriptors to watch, + * but handing a file descriptor mask through to the kernel is fairly icky. + * + * Instead, we fork off a process which watches the file descriptors and writes + * the LHREQ_BREAK command to the /dev/lguest filedescriptor to tell the Host + * loop to stop running the Guest. This causes it to return from the + * /dev/lguest read with -EAGAIN, where it will write to /dev/lguest to reset + * the LHREQ_BREAK and wake us up again. + * + * This, of course, is merely a different *kind* of icky. + */ static void wake_parent(int pipefd, int lguest_fd, struct device_list *devices) { + /* Add the pipe from the Launcher to the fdset in the device_list, so + * we watch it, too. */ set_fd(pipefd, devices); for (;;) { fd_set rfds = devices->infds; u32 args[] = { LHREQ_BREAK, 1 }; + /* Wait until input is ready from one of the devices. */ select(devices->max_infd+1, &rfds, NULL, NULL, NULL); + /* Is it a message from the Launcher? */ if (FD_ISSET(pipefd, &rfds)) { int ignorefd; + /* If read() returns 0, it means the Launcher has + * exited. We silently follow. */ if (read(pipefd, &ignorefd, sizeof(ignorefd)) == 0) exit(0); + /* Otherwise it's telling us there's a problem with one + * of the devices, and we should ignore that file + * descriptor from now on. */ FD_CLR(ignorefd, &devices->infds); - } else + } else /* Send LHREQ_BREAK command. */ write(lguest_fd, args, sizeof(args)); } } +/* This routine just sets up a pipe to the Waker process. */ static int setup_waker(int lguest_fd, struct device_list *device_list) { int pipefd[2], child; + /* We create a pipe to talk to the waker, and also so it knows when the + * Launcher dies (and closes pipe). */ pipe(pipefd); child = fork(); if (child == -1) err(1, "forking"); if (child == 0) { + /* Close the "writing" end of our copy of the pipe */ close(pipefd[1]); wake_parent(pipefd[0], lguest_fd, device_list); } + /* Close the reading end of our copy of the pipe. */ close(pipefd[0]); + /* Here is the fd used to talk to the waker. */ return pipefd[1]; } +/*L:210 + * Device Handling. + * + * When the Guest sends DMA to us, it sends us an array of addresses and sizes. + * We need to make sure it's not trying to reach into the Launcher itself, so + * we have a convenient routine which check it and exits with an error message + * if something funny is going on: + */ static void *_check_pointer(unsigned long addr, unsigned int size, unsigned int line) { + /* We have to separately check addr and addr+size, because size could + * be huge and addr + size might wrap around. */ if (addr >= top || addr + size >= top) errx(1, "%s:%i: Invalid address %li", __FILE__, line, addr); + /* We return a pointer for the caller's convenience, now we know it's + * safe to use. */ return (void *)addr; } +/* A macro which transparently hands the line number to the real function. */ #define check_pointer(addr,size) _check_pointer(addr, size, __LINE__) -/* Returns pointer to dma->used_len */ +/* The Guest has given us the address of a "struct lguest_dma". We check it's + * OK and convert it to an iovec (which is a simple array of ptr/size + * pairs). */ static u32 *dma2iov(unsigned long dma, struct iovec iov[], unsigned *num) { unsigned int i; struct lguest_dma *udma; + /* First we make sure that the array memory itself is valid. */ udma = check_pointer(dma, sizeof(*udma)); + /* Now we check each element */ for (i = 0; i < LGUEST_MAX_DMA_SECTIONS; i++) { + /* A zero length ends the array. */ if (!udma->len[i]) break; @@ -409,9 +631,15 @@ static u32 *dma2iov(unsigned long dma, struct iovec iov[], unsigned *num) iov[i].iov_len = udma->len[i]; } *num = i; + + /* We return the pointer to where the caller should write the amount of + * the buffer used. */ return &udma->used_len; } +/* This routine gets a DMA buffer from the Guest for a given key, and converts + * it to an iovec array. It returns the interrupt the Guest wants when we're + * finished, and a pointer to the "used_len" field to fill in. */ static u32 *get_dma_buffer(int fd, void *key, struct iovec iov[], unsigned int *num, u32 *irq) { @@ -419,16 +647,21 @@ static u32 *get_dma_buffer(int fd, void *key, unsigned long udma; u32 *res; + /* Ask the kernel for a DMA buffer corresponding to this key. */ udma = write(fd, buf, sizeof(buf)); + /* They haven't registered any, or they're all used? */ if (udma == (unsigned long)-1) return NULL; - /* Kernel stashes irq in ->used_len. */ + /* Convert it into our iovec array */ res = dma2iov(udma, iov, num); + /* The kernel stashes irq in ->used_len to get it out to us. */ *irq = *res; + /* Return a pointer to ((struct lguest_dma *)udma)->used_len. */ return res; } +/* This is a convenient routine to send the Guest an interrupt. */ static void trigger_irq(int fd, u32 irq) { u32 buf[] = { LHREQ_IRQ, irq }; @@ -436,6 +669,10 @@ static void trigger_irq(int fd, u32 irq) err(1, "Triggering irq %i", irq); } +/* This simply sets up an iovec array where we can put data to be discarded. + * This happens when the Guest doesn't want or can't handle the input: we have + * to get rid of it somewhere, and if we bury it in the ceiling space it will + * start to smell after a week. */ static void discard_iovec(struct iovec *iov, unsigned int *num) { static char discard_buf[1024]; @@ -444,19 +681,24 @@ static void discard_iovec(struct iovec *iov, unsigned int *num) iov->iov_len = sizeof(discard_buf); } +/* Here is the input terminal setting we save, and the routine to restore them + * on exit so the user can see what they type next. */ static struct termios orig_term; static void restore_term(void) { tcsetattr(STDIN_FILENO, TCSANOW, &orig_term); } +/* We associate some data with the console for our exit hack. */ struct console_abort { + /* How many times have they hit ^C? */ int count; + /* When did they start? */ struct timeval start; }; -/* We DMA input to buffer bound at start of console page. */ +/* This is the routine which handles console input (ie. stdin). */ static bool handle_console_input(int fd, struct device *dev) { u32 irq = 0, *lenp; @@ -465,24 +707,38 @@ static bool handle_console_input(int fd, struct device *dev) struct iovec iov[LGUEST_MAX_DMA_SECTIONS]; struct console_abort *abort = dev->priv; + /* First we get the console buffer from the Guest. The key is dev->mem + * which was set to 0 in setup_console(). */ lenp = get_dma_buffer(fd, dev->mem, iov, &num, &irq); if (!lenp) { + /* If it's not ready for input, warn and set up to discard. */ warn("console: no dma buffer!"); discard_iovec(iov, &num); } + /* This is why we convert to iovecs: the readv() call uses them, and so + * it reads straight into the Guest's buffer. */ len = readv(dev->fd, iov, num); if (len <= 0) { + /* This implies that the console is closed, is /dev/null, or + * something went terribly wrong. We still go through the rest + * of the logic, though, especially the exit handling below. */ warnx("Failed to get console input, ignoring console."); len = 0; } + /* If we read the data into the Guest, fill in the length and send the + * interrupt. */ if (lenp) { *lenp = len; trigger_irq(fd, irq); } - /* Three ^C within one second? Exit. */ + /* Three ^C within one second? Exit. + * + * This is such a hack, but works surprisingly well. Each ^C has to be + * in a buffer by itself, so they can't be too fast. But we check that + * we get three within about a second, so they can't be too slow. */ if (len == 1 && ((char *)iov[0].iov_base)[0] == 3) { if (!abort->count++) gettimeofday(&abort->start, NULL); @@ -490,43 +746,60 @@ static bool handle_console_input(int fd, struct device *dev) struct timeval now; gettimeofday(&now, NULL); if (now.tv_sec <= abort->start.tv_sec+1) { - /* Make sure waker is not blocked in BREAK */ u32 args[] = { LHREQ_BREAK, 0 }; + /* Close the fd so Waker will know it has to + * exit. */ close(waker_fd); + /* Just in case waker is blocked in BREAK, send + * unbreak now. */ write(fd, args, sizeof(args)); exit(2); } abort->count = 0; } } else + /* Any other key resets the abort counter. */ abort->count = 0; + /* Now, if we didn't read anything, put the input terminal back and + * return failure (meaning, don't call us again). */ if (!len) { restore_term(); return false; } + /* Everything went OK! */ return true; } +/* Handling console output is much simpler than input. */ static u32 handle_console_output(int fd, const struct iovec *iov, unsigned num, struct device*dev) { + /* Whatever the Guest sends, write it to standard output. Return the + * number of bytes written. */ return writev(STDOUT_FILENO, iov, num); } +/* Guest->Host network output is also pretty easy. */ static u32 handle_tun_output(int fd, const struct iovec *iov, unsigned num, struct device *dev) { - /* Now we've seen output, we should warn if we can't get buffers. */ + /* We put a flag in the "priv" pointer of the network device, and set + * it as soon as we see output. We'll see why in handle_tun_input() */ *(bool *)dev->priv = true; + /* Whatever packet the Guest sent us, write it out to the tun + * device. */ return writev(dev->fd, iov, num); } +/* This matches the peer_key() in lguest_net.c. The key for any given slot + * is the address of the network device's page plus 4 * the slot number. */ static unsigned long peer_offset(unsigned int peernum) { return 4 * peernum; } +/* This is where we handle a packet coming in from the tun device */ static bool handle_tun_input(int fd, struct device *dev) { u32 irq = 0, *lenp; @@ -534,17 +807,28 @@ static bool handle_tun_input(int fd, struct device *dev) unsigned num; struct iovec iov[LGUEST_MAX_DMA_SECTIONS]; + /* First we get a buffer the Guest has bound to its key. */ lenp = get_dma_buffer(fd, dev->mem+peer_offset(NET_PEERNUM), iov, &num, &irq); if (!lenp) { + /* Now, it's expected that if we try to send a packet too + * early, the Guest won't be ready yet. This is why we set a + * flag when the Guest sends its first packet. If it's sent a + * packet we assume it should be ready to receive them. + * + * Actually, this is what the status bits in the descriptor are + * for: we should *use* them. FIXME! */ if (*(bool *)dev->priv) warn("network: no dma buffer!"); discard_iovec(iov, &num); } + /* Read the packet from the device directly into the Guest's buffer. */ len = readv(dev->fd, iov, num); if (len <= 0) err(1, "reading network"); + + /* Write the used_len, and trigger the interrupt for the Guest */ if (lenp) { *lenp = len; trigger_irq(fd, irq); @@ -552,9 +836,13 @@ static bool handle_tun_input(int fd, struct device *dev) verbose("tun input packet len %i [%02x %02x] (%s)\n", len, ((u8 *)iov[0].iov_base)[0], ((u8 *)iov[0].iov_base)[1], lenp ? "sent" : "discarded"); + /* All good. */ return true; } +/* The last device handling routine is block output: the Guest has sent a DMA + * to the block device. It will have placed the command it wants in the + * "struct lguest_block_page". */ static u32 handle_block_output(int fd, const struct iovec *iov, unsigned num, struct device *dev) { @@ -564,36 +852,64 @@ static u32 handle_block_output(int fd, const struct iovec *iov, struct iovec reply[LGUEST_MAX_DMA_SECTIONS]; off64_t device_len, off = (off64_t)p->sector * 512; + /* First we extract the device length from the dev->priv pointer. */ device_len = *(off64_t *)dev->priv; + /* We first check that the read or write is within the length of the + * block file. */ if (off >= device_len) err(1, "Bad offset %llu vs %llu", off, device_len); + /* Move to the right location in the block file. This shouldn't fail, + * but best to check. */ if (lseek64(dev->fd, off, SEEK_SET) != off) err(1, "Bad seek to sector %i", p->sector); verbose("Block: %s at offset %llu\n", p->type ? "WRITE" : "READ", off); + /* They were supposed to bind a reply buffer at key equal to the start + * of the block device memory. We need this to tell them when the + * request is finished. */ lenp = get_dma_buffer(fd, dev->mem, reply, &reply_num, &irq); if (!lenp) err(1, "Block request didn't give us a dma buffer"); if (p->type) { + /* A write request. The DMA they sent contained the data, so + * write it out. */ len = writev(dev->fd, iov, num); + /* Grr... Now we know how long the "struct lguest_dma" they + * sent was, we make sure they didn't try to write over the end + * of the block file (possibly extending it). */ if (off + len > device_len) { + /* Trim it back to the correct length */ ftruncate(dev->fd, device_len); + /* Die, bad Guest, die. */ errx(1, "Write past end %llu+%u", off, len); } + /* The reply length is 0: we just send back an empty DMA to + * interrupt them and tell them the write is finished. */ *lenp = 0; } else { + /* A read request. They sent an empty DMA to start the + * request, and we put the read contents into the reply + * buffer. */ len = readv(dev->fd, reply, reply_num); *lenp = len; } + /* The result is 1 (done), 2 if there was an error (short read or + * write). */ p->result = 1 + (p->bytes != len); + /* Now tell them we've used their reply buffer. */ trigger_irq(fd, irq); + + /* We're supposed to return the number of bytes of the output buffer we + * used. But the block device uses the "result" field instead, so we + * don't bother. */ return 0; } +/* This is the generic routine we call when the Guest sends some DMA out. */ static void handle_output(int fd, unsigned long dma, unsigned long key, struct device_list *devices) { @@ -602,30 +918,53 @@ static void handle_output(int fd, unsigned long dma, unsigned long key, struct iovec iov[LGUEST_MAX_DMA_SECTIONS]; unsigned num = 0; + /* Convert the "struct lguest_dma" they're sending to a "struct + * iovec". */ lenp = dma2iov(dma, iov, &num); + + /* Check each device: if they expect output to this key, tell them to + * handle it. */ for (i = devices->dev; i; i = i->next) { if (i->handle_output && key == i->watch_key) { + /* We write the result straight into the used_len field + * for them. */ *lenp = i->handle_output(fd, iov, num, i); return; } } + + /* This can happen: the kernel sends any SEND_DMA which doesn't match + * another Guest to us. It could be that another Guest just left a + * network, for example. But it's unusual. */ warnx("Pending dma %p, key %p", (void *)dma, (void *)key); } +/* This is called when the waker wakes us up: check for incoming file + * descriptors. */ static void handle_input(int fd, struct device_list *devices) { + /* select() wants a zeroed timeval to mean "don't wait". */ struct timeval poll = { .tv_sec = 0, .tv_usec = 0 }; for (;;) { struct device *i; fd_set fds = devices->infds; + /* If nothing is ready, we're done. */ if (select(devices->max_infd+1, &fds, NULL, NULL, &poll) == 0) break; + /* Otherwise, call the device(s) which have readable + * file descriptors and a method of handling them. */ for (i = devices->dev; i; i = i->next) { if (i->handle_input && FD_ISSET(i->fd, &fds)) { + /* If handle_input() returns false, it means we + * should no longer service it. + * handle_console_input() does this. */ if (!i->handle_input(fd, i)) { + /* Clear it from the set of input file + * descriptors kept at the head of the + * device list. */ FD_CLR(i->fd, &devices->infds); /* Tell waker to ignore it too... */ write(waker_fd, &i->fd, sizeof(i->fd)); @@ -635,6 +974,15 @@ static void handle_input(int fd, struct device_list *devices) } } +/*L:190 + * Device Setup + * + * All devices need a descriptor so the Guest knows it exists, and a "struct + * device" so the Launcher can keep track of it. We have common helper + * routines to allocate them. + * + * This routine allocates a new "struct lguest_device_desc" from descriptor + * table in the devices array just above the Guest's normal memory. */ static struct lguest_device_desc * new_dev_desc(struct lguest_device_desc *descs, u16 type, u16 features, u16 num_pages) @@ -646,6 +994,8 @@ new_dev_desc(struct lguest_device_desc *descs, descs[i].type = type; descs[i].features = features; descs[i].num_pages = num_pages; + /* If they said the device needs memory, we allocate + * that now, bumping up the top of Guest memory. */ if (num_pages) { map_zeroed_pages(top, num_pages); descs[i].pfn = top/getpagesize(); @@ -657,6 +1007,9 @@ new_dev_desc(struct lguest_device_desc *descs, errx(1, "too many devices"); } +/* This monster routine does all the creation and setup of a new device, + * including caling new_dev_desc() to allocate the descriptor and device + * memory. */ static struct device *new_device(struct device_list *devices, u16 type, u16 num_pages, u16 features, int fd, @@ -669,12 +1022,18 @@ static struct device *new_device(struct device_list *devices, { struct device *dev = malloc(sizeof(*dev)); - /* Append to device list. */ + /* Append to device list. Prepending to a single-linked list is + * easier, but the user expects the devices to be arranged on the bus + * in command-line order. The first network device on the command line + * is eth0, the first block device /dev/lgba, etc. */ *devices->lastdev = dev; dev->next = NULL; devices->lastdev = &dev->next; + /* Now we populate the fields one at a time. */ dev->fd = fd; + /* If we have an input handler for this file descriptor, then we add it + * to the device_list's fdset and maxfd. */ if (handle_input) set_fd(dev->fd, devices); dev->desc = new_dev_desc(devices->descs, type, features, num_pages); @@ -685,27 +1044,37 @@ static struct device *new_device(struct device_list *devices, return dev; } +/* Our first setup routine is the console. It's a fairly simple device, but + * UNIX tty handling makes it uglier than it could be. */ static void setup_console(struct device_list *devices) { struct device *dev; + /* If we can save the initial standard input settings... */ if (tcgetattr(STDIN_FILENO, &orig_term) == 0) { struct termios term = orig_term; + /* Then we turn off echo, line buffering and ^C etc. We want a + * raw input stream to the Guest. */ term.c_lflag &= ~(ISIG|ICANON|ECHO); tcsetattr(STDIN_FILENO, TCSANOW, &term); + /* If we exit gracefully, the original settings will be + * restored so the user can see what they're typing. */ atexit(restore_term); } - /* We don't currently require a page for the console. */ + /* We don't currently require any memory for the console, so we ask for + * 0 pages. */ dev = new_device(devices, LGUEST_DEVICE_T_CONSOLE, 0, 0, STDIN_FILENO, handle_console_input, LGUEST_CONSOLE_DMA_KEY, handle_console_output); + /* We store the console state in dev->priv, and initialize it. */ dev->priv = malloc(sizeof(struct console_abort)); ((struct console_abort *)dev->priv)->count = 0; verbose("device %p: console\n", (void *)(dev->desc->pfn * getpagesize())); } +/* Setting up a block file is also fairly straightforward. */ static void setup_block_file(const char *filename, struct device_list *devices) { int fd; @@ -713,20 +1082,47 @@ static void setup_block_file(const char *filename, struct device_list *devices) off64_t *device_len; struct lguest_block_page *p; + /* We open with O_LARGEFILE because otherwise we get stuck at 2G. We + * open with O_DIRECT because otherwise our benchmarks go much too + * fast. */ fd = open_or_die(filename, O_RDWR|O_LARGEFILE|O_DIRECT); + + /* We want one page, and have no input handler (the block file never + * has anything interesting to say to us). Our timing will be quite + * random, so it should be a reasonable randomness source. */ dev = new_device(devices, LGUEST_DEVICE_T_BLOCK, 1, LGUEST_DEVICE_F_RANDOMNESS, fd, NULL, 0, handle_block_output); + + /* We store the device size in the private area */ device_len = dev->priv = malloc(sizeof(*device_len)); + /* This is the safe way of establishing the size of our device: it + * might be a normal file or an actual block device like /dev/hdb. */ *device_len = lseek64(fd, 0, SEEK_END); - p = dev->mem; + /* The device memory is a "struct lguest_block_page". It's zeroed + * already, we just need to put in the device size. Block devices + * think in sectors (ie. 512 byte chunks), so we translate here. */ + p = dev->mem; p->num_sectors = *device_len/512; verbose("device %p: block %i sectors\n", (void *)(dev->desc->pfn * getpagesize()), p->num_sectors); } -/* We use fnctl locks to reserve network slots (autocleanup!) */ +/* + * Network Devices. + * + * Setting up network devices is quite a pain, because we have three types. + * First, we have the inter-Guest network. This is a file which is mapped into + * the address space of the Guests who are on the network. Because it is a + * shared mapping, the same page underlies all the devices, and they can send + * DMA to each other. + * + * Remember from our network driver, the Guest is told what slot in the page it + * is to use. We use exclusive fnctl locks to reserve a slot. If another + * Guest is using a slot, the lock will fail and we try another. Because fnctl + * locks are cleaned up automatically when we die, this cleverly means that our + * reservation on the slot will vanish if we crash. */ static unsigned int find_slot(int netfd, const char *filename) { struct flock fl; @@ -734,26 +1130,33 @@ static unsigned int find_slot(int netfd, const char *filename) fl.l_type = F_WRLCK; fl.l_whence = SEEK_SET; fl.l_len = 1; + /* Try a 1 byte lock in each possible position number */ for (fl.l_start = 0; fl.l_start < getpagesize()/sizeof(struct lguest_net); fl.l_start++) { + /* If we succeed, return the slot number. */ if (fcntl(netfd, F_SETLK, &fl) == 0) return fl.l_start; } errx(1, "No free slots in network file %s", filename); } +/* This function sets up the network file */ static void setup_net_file(const char *filename, struct device_list *devices) { int netfd; struct device *dev; + /* We don't use open_or_die() here: for friendliness we create the file + * if it doesn't already exist. */ netfd = open(filename, O_RDWR, 0); if (netfd < 0) { if (errno == ENOENT) { netfd = open(filename, O_RDWR|O_CREAT, 0600); if (netfd >= 0) { + /* If we succeeded, initialize the file with a + * blank page. */ char page[getpagesize()]; memset(page, 0, sizeof(page)); write(netfd, page, sizeof(page)); @@ -763,11 +1166,15 @@ static void setup_net_file(const char *filename, err(1, "cannot open net file '%s'", filename); } + /* We need 1 page, and the features indicate the slot to use and that + * no checksum is needed. We never touch this device again; it's + * between the Guests on the network, so we don't register input or + * output handlers. */ dev = new_device(devices, LGUEST_DEVICE_T_NET, 1, find_slot(netfd, filename)|LGUEST_NET_F_NOCSUM, -1, NULL, 0, NULL); - /* We overwrite the /dev/zero mapping with the actual file. */ + /* Map the shared file. */ if (mmap(dev->mem, getpagesize(), PROT_READ|PROT_WRITE, MAP_FIXED|MAP_SHARED, netfd, 0) != dev->mem) err(1, "could not mmap '%s'", filename); @@ -775,6 +1182,7 @@ static void setup_net_file(const char *filename, (void *)(dev->desc->pfn * getpagesize()), filename, dev->desc->features & ~LGUEST_NET_F_NOCSUM); } +/*:*/ static u32 str2ip(const char *ipaddr) { @@ -784,7 +1192,11 @@ static u32 str2ip(const char *ipaddr) return (byte[0] << 24) | (byte[1] << 16) | (byte[2] << 8) | byte[3]; } -/* adapted from libbridge */ +/* This code is "adapted" from libbridge: it attaches the Host end of the + * network device to the bridge device specified by the command line. + * + * This is yet another James Morris contribution (I'm an IP-level guy, so I + * dislike bridging), and I just try not to break it. */ static void add_to_bridge(int fd, const char *if_name, const char *br_name) { int ifidx; @@ -803,12 +1215,16 @@ static void add_to_bridge(int fd, const char *if_name, const char *br_name) err(1, "can't add %s to bridge %s", if_name, br_name); } +/* This sets up the Host end of the network device with an IP address, brings + * it up so packets will flow, the copies the MAC address into the hwaddr + * pointer (in practice, the Host's slot in the network device's memory). */ static void configure_device(int fd, const char *devname, u32 ipaddr, unsigned char hwaddr[6]) { struct ifreq ifr; struct sockaddr_in *sin = (struct sockaddr_in *)&ifr.ifr_addr; + /* Don't read these incantations. Just cut & paste them like I did! */ memset(&ifr, 0, sizeof(ifr)); strcpy(ifr.ifr_name, devname); sin->sin_family = AF_INET; @@ -819,12 +1235,19 @@ static void configure_device(int fd, const char *devname, u32 ipaddr, if (ioctl(fd, SIOCSIFFLAGS, &ifr) != 0) err(1, "Bringing interface %s up", devname); + /* SIOC stands for Socket I/O Control. G means Get (vs S for Set + * above). IF means Interface, and HWADDR is hardware address. + * Simple! */ if (ioctl(fd, SIOCGIFHWADDR, &ifr) != 0) err(1, "getting hw address for %s", devname); - memcpy(hwaddr, ifr.ifr_hwaddr.sa_data, 6); } +/*L:195 The other kind of network is a Host<->Guest network. This can either + * use briding or routing, but the principle is the same: it uses the "tun" + * device to inject packets into the Host as if they came in from a normal + * network card. We just shunt packets between the Guest and the tun + * device. */ static void setup_tun_net(const char *arg, struct device_list *devices) { struct device *dev; @@ -833,36 +1256,56 @@ static void setup_tun_net(const char *arg, struct device_list *devices) u32 ip; const char *br_name = NULL; + /* We open the /dev/net/tun device and tell it we want a tap device. A + * tap device is like a tun device, only somehow different. To tell + * the truth, I completely blundered my way through this code, but it + * works now! */ netfd = open_or_die("/dev/net/tun", O_RDWR); memset(&ifr, 0, sizeof(ifr)); ifr.ifr_flags = IFF_TAP | IFF_NO_PI; strcpy(ifr.ifr_name, "tap%d"); if (ioctl(netfd, TUNSETIFF, &ifr) != 0) err(1, "configuring /dev/net/tun"); + /* We don't need checksums calculated for packets coming in this + * device: trust us! */ ioctl(netfd, TUNSETNOCSUM, 1); - /* You will be peer 1: we should create enough jitter to randomize */ + /* We create the net device with 1 page, using the features field of + * the descriptor to tell the Guest it is in slot 1 (NET_PEERNUM), and + * that the device has fairly random timing. We do *not* specify + * LGUEST_NET_F_NOCSUM: these packets can reach the real world. + * + * We will put our MAC address is slot 0 for the Guest to see, so + * it will send packets to us using the key "peer_offset(0)": */ dev = new_device(devices, LGUEST_DEVICE_T_NET, 1, NET_PEERNUM|LGUEST_DEVICE_F_RANDOMNESS, netfd, handle_tun_input, peer_offset(0), handle_tun_output); + + /* We keep a flag which says whether we've seen packets come out from + * this network device. */ dev->priv = malloc(sizeof(bool)); *(bool *)dev->priv = false; + /* We need a socket to perform the magic network ioctls to bring up the + * tap interface, connect to the bridge etc. Any socket will do! */ ipfd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP); if (ipfd < 0) err(1, "opening IP socket"); + /* If the command line was --tunnet=bridge: do bridging. */ if (!strncmp(BRIDGE_PFX, arg, strlen(BRIDGE_PFX))) { ip = INADDR_ANY; br_name = arg + strlen(BRIDGE_PFX); add_to_bridge(ipfd, ifr.ifr_name, br_name); - } else + } else /* It is an IP address to set up the device with */ ip = str2ip(arg); - /* We are peer 0, ie. first slot. */ + /* We are peer 0, ie. first slot, so we hand dev->mem to this routine + * to write the MAC address at the start of the device memory. */ configure_device(ipfd, ifr.ifr_name, ip, dev->mem); - /* Set "promisc" bit: we want every single packet. */ + /* Set "promisc" bit: we want every single packet if we're going to + * bridge to other machines (and otherwise it doesn't matter). */ *((u8 *)dev->mem) |= 0x1; close(ipfd); @@ -873,7 +1316,10 @@ static void setup_tun_net(const char *arg, struct device_list *devices) if (br_name) verbose("attached to bridge: %s\n", br_name); } +/* That's the end of device setup. */ +/*L:220 Finally we reach the core of the Launcher, which runs the Guest, serves + * its input and output, and finally, lays it to rest. */ static void __attribute__((noreturn)) run_guest(int lguest_fd, struct device_list *device_list) { @@ -885,20 +1331,37 @@ run_guest(int lguest_fd, struct device_list *device_list) /* We read from the /dev/lguest device to run the Guest. */ readval = read(lguest_fd, arr, sizeof(arr)); + /* The read can only really return sizeof(arr) (the Guest did a + * SEND_DMA to us), or an error. */ + + /* For a successful read, arr[0] is the address of the "struct + * lguest_dma", and arr[1] is the key the Guest sent to. */ if (readval == sizeof(arr)) { handle_output(lguest_fd, arr[0], arr[1], device_list); continue; + /* ENOENT means the Guest died. Reading tells us why. */ } else if (errno == ENOENT) { char reason[1024] = { 0 }; read(lguest_fd, reason, sizeof(reason)-1); errx(1, "%s", reason); + /* EAGAIN means the waker wanted us to look at some input. + * Anything else means a bug or incompatible change. */ } else if (errno != EAGAIN) err(1, "Running guest failed"); + + /* Service input, then unset the BREAK which releases + * the Waker. */ handle_input(lguest_fd, device_list); if (write(lguest_fd, args, sizeof(args)) < 0) err(1, "Resetting break"); } } +/* + * This is the end of the Launcher. + * + * But wait! We've seen I/O from the Launcher, and we've seen I/O from the + * Drivers. If we were to see the Host kernel I/O code, our understanding + * would be complete... :*/ static struct option opts[] = { { "verbose", 0, NULL, 'v' }, @@ -916,20 +1379,49 @@ static void usage(void) " vmlinux [args...]"); } +/*L:100 The Launcher code itself takes us out into userspace, that scary place + * where pointers run wild and free! Unfortunately, like most userspace + * programs, it's quite boring (which is why everyone like to hack on the + * kernel!). Perhaps if you make up an Lguest Drinking Game at this point, it + * will get you through this section. Or, maybe not. + * + * The Launcher binary sits up high, usually starting at address 0xB8000000. + * Everything below this is the "physical" memory for the Guest. For example, + * if the Guest were to write a "1" at physical address 0, we would see a "1" + * in the Launcher at "(int *)0". Guest physical == Launcher virtual. + * + * This can be tough to get your head around, but usually it just means that we + * don't need to do any conversion when the Guest gives us it's "physical" + * addresses. + */ int main(int argc, char *argv[]) { + /* Memory, top-level pagetable, code startpoint, PAGE_OFFSET and size + * of the (optional) initrd. */ unsigned long mem = 0, pgdir, start, page_offset, initrd_size = 0; + /* A temporary and the /dev/lguest file descriptor. */ int i, c, lguest_fd; + /* The list of Guest devices, based on command line arguments. */ struct device_list device_list; + /* The boot information for the Guest: at guest-physical address 0. */ void *boot = (void *)0; + /* If they specify an initrd file to load. */ const char *initrd_name = NULL; + /* First we initialize the device list. Since console and network + * device receive input from a file descriptor, we keep an fdset + * (infds) and the maximum fd number (max_infd) with the head of the + * list. We also keep a pointer to the last device, for easy appending + * to the list. */ device_list.max_infd = -1; device_list.dev = NULL; device_list.lastdev = &device_list.dev; FD_ZERO(&device_list.infds); - /* We need to know how much memory so we can allocate devices. */ + /* We need to know how much memory so we can set up the device + * descriptor and memory pages for the devices as we parse the command + * line. So we quickly look through the arguments to find the amount + * of memory now. */ for (i = 1; i < argc; i++) { if (argv[i][0] != '-') { mem = top = atoi(argv[i]) * 1024 * 1024; @@ -938,6 +1430,8 @@ int main(int argc, char *argv[]) break; } } + + /* The options are fairly straight-forward */ while ((c = getopt_long(argc, argv, "v", opts, NULL)) != EOF) { switch (c) { case 'v': @@ -960,42 +1454,59 @@ int main(int argc, char *argv[]) usage(); } } + /* After the other arguments we expect memory and kernel image name, + * followed by command line arguments for the kernel. */ if (optind + 2 > argc) usage(); - /* We need a console device */ + /* We always have a console device */ setup_console(&device_list); - /* First we map /dev/zero over all of guest-physical memory. */ + /* We start by mapping anonymous pages over all of guest-physical + * memory range. This fills it with 0, and ensures that the Guest + * won't be killed when it tries to access it. */ map_zeroed_pages(0, mem / getpagesize()); /* Now we load the kernel */ start = load_kernel(open_or_die(argv[optind+1], O_RDONLY), &page_offset); - /* Map the initrd image if requested */ + /* Map the initrd image if requested (at top of physical memory) */ if (initrd_name) { initrd_size = load_initrd(initrd_name, mem); + /* These are the location in the Linux boot header where the + * start and size of the initrd are expected to be found. */ *(unsigned long *)(boot+0x218) = mem - initrd_size; *(unsigned long *)(boot+0x21c) = initrd_size; + /* The bootloader type 0xFF means "unknown"; that's OK. */ *(unsigned char *)(boot+0x210) = 0xFF; } - /* Set up the initial linar pagetables. */ + /* Set up the initial linear pagetables, starting below the initrd. */ pgdir = setup_pagetables(mem, initrd_size, page_offset); - /* E820 memory map: ours is a simple, single region. */ + /* The Linux boot header contains an "E820" memory map: ours is a + * simple, single region. */ *(char*)(boot+E820NR) = 1; *((struct e820entry *)(boot+E820MAP)) = ((struct e820entry) { 0, mem, E820_RAM }); - /* Command line pointer and command line (at 4096) */ + /* The boot header contains a command line pointer: we put the command + * line after the boot header (at address 4096) */ *(void **)(boot + 0x228) = boot + 4096; concat(boot + 4096, argv+optind+2); - /* Paravirt type: 1 == lguest */ + + /* The guest type value of "1" tells the Guest it's under lguest. */ *(int *)(boot + 0x23c) = 1; + /* We tell the kernel to initialize the Guest: this returns the open + * /dev/lguest file descriptor. */ lguest_fd = tell_kernel(pgdir, start, page_offset); + + /* We fork off a child process, which wakes the Launcher whenever one + * of the input file descriptors needs attention. Otherwise we would + * run the Guest until it tries to output something. */ waker_fd = setup_waker(lguest_fd, &device_list); + /* Finally, run the Guest. This doesn't return. */ run_guest(lguest_fd, &device_list); } diff --git a/drivers/lguest/core.c b/drivers/lguest/core.c index 2cea0c80c992..1eb05f9a56b6 100644 --- a/drivers/lguest/core.c +++ b/drivers/lguest/core.c @@ -208,24 +208,39 @@ static int emulate_insn(struct lguest *lg) return 1; } +/*L:305 + * Dealing With Guest Memory. + * + * When the Guest gives us (what it thinks is) a physical address, we can use + * the normal copy_from_user() & copy_to_user() on that address: remember, + * Guest physical == Launcher virtual. + * + * But we can't trust the Guest: it might be trying to access the Launcher + * code. We have to check that the range is below the pfn_limit the Launcher + * gave us. We have to make sure that addr + len doesn't give us a false + * positive by overflowing, too. */ int lguest_address_ok(const struct lguest *lg, unsigned long addr, unsigned long len) { return (addr+len) / PAGE_SIZE < lg->pfn_limit && (addr+len >= addr); } -/* Just like get_user, but don't let guest access lguest binary. */ +/* This is a convenient routine to get a 32-bit value from the Guest (a very + * common operation). Here we can see how useful the kill_lguest() routine we + * met in the Launcher can be: we return a random value (0) instead of needing + * to return an error. */ u32 lgread_u32(struct lguest *lg, unsigned long addr) { u32 val = 0; - /* Don't let them access lguest binary */ + /* Don't let them access lguest binary. */ if (!lguest_address_ok(lg, addr, sizeof(val)) || get_user(val, (u32 __user *)addr) != 0) kill_guest(lg, "bad read address %#lx", addr); return val; } +/* Same thing for writing a value. */ void lgwrite_u32(struct lguest *lg, unsigned long addr, u32 val) { if (!lguest_address_ok(lg, addr, sizeof(val)) @@ -233,6 +248,9 @@ void lgwrite_u32(struct lguest *lg, unsigned long addr, u32 val) kill_guest(lg, "bad write address %#lx", addr); } +/* This routine is more generic, and copies a range of Guest bytes into a + * buffer. If the copy_from_user() fails, we fill the buffer with zeroes, so + * the caller doesn't end up using uninitialized kernel memory. */ void lgread(struct lguest *lg, void *b, unsigned long addr, unsigned bytes) { if (!lguest_address_ok(lg, addr, bytes) @@ -243,6 +261,7 @@ void lgread(struct lguest *lg, void *b, unsigned long addr, unsigned bytes) } } +/* Similarly, our generic routine to copy into a range of Guest bytes. */ void lgwrite(struct lguest *lg, unsigned long addr, const void *b, unsigned bytes) { @@ -250,6 +269,7 @@ void lgwrite(struct lguest *lg, unsigned long addr, const void *b, || copy_to_user((void __user *)addr, b, bytes) != 0) kill_guest(lg, "bad write address %#lx len %u", addr, bytes); } +/* (end of memory access helper routines) :*/ static void set_ts(void) { diff --git a/drivers/lguest/io.c b/drivers/lguest/io.c index d2f02f0653ca..da288128e44f 100644 --- a/drivers/lguest/io.c +++ b/drivers/lguest/io.c @@ -27,8 +27,36 @@ #include #include "lg.h" +/*L:300 + * I/O + * + * Getting data in and out of the Guest is quite an art. There are numerous + * ways to do it, and they all suck differently. We try to keep things fairly + * close to "real" hardware so our Guest's drivers don't look like an alien + * visitation in the middle of the Linux code, and yet make sure that Guests + * can talk directly to other Guests, not just the Launcher. + * + * To do this, the Guest gives us a key when it binds or sends DMA buffers. + * The key corresponds to a "physical" address inside the Guest (ie. a virtual + * address inside the Launcher process). We don't, however, use this key + * directly. + * + * We want Guests which share memory to be able to DMA to each other: two + * Launchers can mmap memory the same file, then the Guests can communicate. + * Fortunately, the futex code provides us with a way to get a "union + * futex_key" corresponding to the memory lying at a virtual address: if the + * two processes share memory, the "union futex_key" for that memory will match + * even if the memory is mapped at different addresses in each. So we always + * convert the keys to "union futex_key"s to compare them. + * + * Before we dive into this though, we need to look at another set of helper + * routines used throughout the Host kernel code to access Guest memory. + :*/ static struct list_head dma_hash[61]; +/* An unfortunate side effect of the Linux double-linked list implementation is + * that there's no good way to statically initialize an array of linked + * lists. */ void lguest_io_init(void) { unsigned int i; @@ -60,6 +88,19 @@ kill: return 0; } +/*L:330 This is our hash function, using the wonderful Jenkins hash. + * + * The futex key is a union with three parts: an unsigned long word, a pointer, + * and an int "offset". We could use jhash_2words() which takes three u32s. + * (Ok, the hash functions are great: the naming sucks though). + * + * It's nice to be portable to 64-bit platforms, so we use the more generic + * jhash2(), which takes an array of u32, the number of u32s, and an initial + * u32 to roll in. This is uglier, but breaks down to almost the same code on + * 32-bit platforms like this one. + * + * We want a position in the array, so we modulo ARRAY_SIZE(dma_hash) (ie. 61). + */ static unsigned int hash(const union futex_key *key) { return jhash2((u32*)&key->both.word, @@ -68,6 +109,9 @@ static unsigned int hash(const union futex_key *key) % ARRAY_SIZE(dma_hash); } +/* This is a convenience routine to compare two keys. It's a much bemoaned C + * weakness that it doesn't allow '==' on structures or unions, so we have to + * open-code it like this. */ static inline int key_eq(const union futex_key *a, const union futex_key *b) { return (a->both.word == b->both.word @@ -75,22 +119,36 @@ static inline int key_eq(const union futex_key *a, const union futex_key *b) && a->both.offset == b->both.offset); } -/* Must hold read lock on dmainfo owner's current->mm->mmap_sem */ +/*L:360 OK, when we need to actually free up a Guest's DMA array we do several + * things, so we have a convenient function to do it. + * + * The caller must hold a read lock on dmainfo owner's current->mm->mmap_sem + * for the drop_futex_key_refs(). */ static void unlink_dma(struct lguest_dma_info *dmainfo) { + /* You locked this too, right? */ BUG_ON(!mutex_is_locked(&lguest_lock)); + /* This is how we know that the entry is free. */ dmainfo->interrupt = 0; + /* Remove it from the hash table. */ list_del(&dmainfo->list); + /* Drop the references we were holding (to the inode or mm). */ drop_futex_key_refs(&dmainfo->key); } +/*L:350 This is the routine which we call when the Guest asks to unregister a + * DMA array attached to a given key. Returns true if the array was found. */ static int unbind_dma(struct lguest *lg, const union futex_key *key, unsigned long dmas) { int i, ret = 0; + /* We don't bother with the hash table, just look through all this + * Guest's DMA arrays. */ for (i = 0; i < LGUEST_MAX_DMA; i++) { + /* In theory it could have more than one array on the same key, + * or one array on multiple keys, so we check both */ if (key_eq(key, &lg->dma[i].key) && dmas == lg->dma[i].dmas) { unlink_dma(&lg->dma[i]); ret = 1; @@ -100,51 +158,91 @@ static int unbind_dma(struct lguest *lg, return ret; } +/*L:340 BIND_DMA: this is the hypercall which sets up an array of "struct + * lguest_dma" for receiving I/O. + * + * The Guest wants to bind an array of "struct lguest_dma"s to a particular key + * to receive input. This only happens when the Guest is setting up a new + * device, so it doesn't have to be very fast. + * + * It returns 1 on a successful registration (it can fail if we hit the limit + * of registrations for this Guest). + */ int bind_dma(struct lguest *lg, unsigned long ukey, unsigned long dmas, u16 numdmas, u8 interrupt) { unsigned int i; int ret = 0; union futex_key key; + /* Futex code needs the mmap_sem. */ struct rw_semaphore *fshared = ¤t->mm->mmap_sem; + /* Invalid interrupt? (We could kill the guest here). */ if (interrupt >= LGUEST_IRQS) return 0; + /* We need to grab the Big Lguest Lock, because other Guests may be + * trying to look through this Guest's DMAs to send something while + * we're doing this. */ mutex_lock(&lguest_lock); down_read(fshared); if (get_futex_key((u32 __user *)ukey, fshared, &key) != 0) { kill_guest(lg, "bad dma key %#lx", ukey); goto unlock; } + + /* We want to keep this key valid once we drop mmap_sem, so we have to + * hold a reference. */ get_futex_key_refs(&key); + /* If the Guest specified an interrupt of 0, that means they want to + * unregister this array of "struct lguest_dma"s. */ if (interrupt == 0) ret = unbind_dma(lg, &key, dmas); else { + /* Look through this Guest's dma array for an unused entry. */ for (i = 0; i < LGUEST_MAX_DMA; i++) { + /* If the interrupt is non-zero, the entry is already + * used. */ if (lg->dma[i].interrupt) continue; + /* OK, a free one! Fill on our details. */ lg->dma[i].dmas = dmas; lg->dma[i].num_dmas = numdmas; lg->dma[i].next_dma = 0; lg->dma[i].key = key; lg->dma[i].guestid = lg->guestid; lg->dma[i].interrupt = interrupt; + + /* Now we add it to the hash table: the position + * depends on the futex key that we got. */ list_add(&lg->dma[i].list, &dma_hash[hash(&key)]); + /* Success! */ ret = 1; goto unlock; } } + /* If we didn't find a slot to put the key in, drop the reference + * again. */ drop_futex_key_refs(&key); unlock: + /* Unlock and out. */ up_read(fshared); mutex_unlock(&lguest_lock); return ret; } -/* lgread from another guest */ +/*L:385 Note that our routines to access a different Guest's memory are called + * lgread_other() and lgwrite_other(): these names emphasize that they are only + * used when the Guest is *not* the current Guest. + * + * The interface for copying from another process's memory is called + * access_process_vm(), with a final argument of 0 for a read, and 1 for a + * write. + * + * We need lgread_other() to read the destination Guest's "struct lguest_dma" + * array. */ static int lgread_other(struct lguest *lg, void *buf, u32 addr, unsigned bytes) { @@ -157,7 +255,8 @@ static int lgread_other(struct lguest *lg, return 1; } -/* lgwrite to another guest */ +/* "lgwrite()" to another Guest: used to update the destination "used_len" once + * we've transferred data into the buffer. */ static int lgwrite_other(struct lguest *lg, u32 addr, const void *buf, unsigned bytes) { @@ -170,6 +269,15 @@ static int lgwrite_other(struct lguest *lg, u32 addr, return 1; } +/*L:400 This is the generic engine which copies from a source "struct + * lguest_dma" from this Guest into another Guest's "struct lguest_dma". The + * destination Guest's pages have already been mapped, as contained in the + * pages array. + * + * If you're wondering if there's a nice "copy from one process to another" + * routine, so was I. But Linux isn't really set up to copy between two + * unrelated processes, so we have to write it ourselves. + */ static u32 copy_data(struct lguest *srclg, const struct lguest_dma *src, const struct lguest_dma *dst, @@ -178,33 +286,59 @@ static u32 copy_data(struct lguest *srclg, unsigned int totlen, si, di, srcoff, dstoff; void *maddr = NULL; + /* We return the total length transferred. */ totlen = 0; + + /* We keep indexes into the source and destination "struct lguest_dma", + * and an offset within each region. */ si = di = 0; srcoff = dstoff = 0; + + /* We loop until the source or destination is exhausted. */ while (si < LGUEST_MAX_DMA_SECTIONS && src->len[si] && di < LGUEST_MAX_DMA_SECTIONS && dst->len[di]) { + /* We can only transfer the rest of the src buffer, or as much + * as will fit into the destination buffer. */ u32 len = min(src->len[si] - srcoff, dst->len[di] - dstoff); + /* For systems using "highmem" we need to use kmap() to access + * the page we want. We often use the same page over and over, + * so rather than kmap() it on every loop, we set the maddr + * pointer to NULL when we need to move to the next + * destination page. */ if (!maddr) maddr = kmap(pages[di]); - /* FIXME: This is not completely portable, since - archs do different things for copy_to_user_page. */ + /* Copy directly from (this Guest's) source address to the + * destination Guest's kmap()ed buffer. Note that maddr points + * to the start of the page: we need to add the offset of the + * destination address and offset within the buffer. */ + + /* FIXME: This is not completely portable. I looked at + * copy_to_user_page(), and some arch's seem to need special + * flushes. x86 is fine. */ if (copy_from_user(maddr + (dst->addr[di] + dstoff)%PAGE_SIZE, (void __user *)src->addr[si], len) != 0) { + /* If a copy failed, it's the source's fault. */ kill_guest(srclg, "bad address in sending DMA"); totlen = 0; break; } + /* Increment the total and src & dst offsets */ totlen += len; srcoff += len; dstoff += len; + + /* Presumably we reached the end of the src or dest buffers: */ if (srcoff == src->len[si]) { + /* Move to the next buffer at offset 0 */ si++; srcoff = 0; } if (dstoff == dst->len[di]) { + /* We need to unmap that destination page and reset + * maddr ready for the next one. */ kunmap(pages[di]); maddr = NULL; di++; @@ -212,13 +346,15 @@ static u32 copy_data(struct lguest *srclg, } } + /* If we still had a page mapped at the end, unmap now. */ if (maddr) kunmap(pages[di]); return totlen; } -/* Src is us, ie. current. */ +/*L:390 This is how we transfer a "struct lguest_dma" from the source Guest + * (the current Guest which called SEND_DMA) to another Guest. */ static u32 do_dma(struct lguest *srclg, const struct lguest_dma *src, struct lguest *dstlg, const struct lguest_dma *dst) { @@ -226,23 +362,31 @@ static u32 do_dma(struct lguest *srclg, const struct lguest_dma *src, u32 ret; struct page *pages[LGUEST_MAX_DMA_SECTIONS]; + /* We check that both source and destination "struct lguest_dma"s are + * within the bounds of the source and destination Guests */ if (!check_dma_list(dstlg, dst) || !check_dma_list(srclg, src)) return 0; - /* First get the destination pages */ + /* We need to map the pages which correspond to each parts of + * destination buffer. */ for (i = 0; i < LGUEST_MAX_DMA_SECTIONS; i++) { if (dst->len[i] == 0) break; + /* get_user_pages() is a complicated function, especially since + * we only want a single page. But it works, and returns the + * number of pages. Note that we're holding the destination's + * mmap_sem, as get_user_pages() requires. */ if (get_user_pages(dstlg->tsk, dstlg->mm, dst->addr[i], 1, 1, 1, pages+i, NULL) != 1) { + /* This means the destination gave us a bogus buffer */ kill_guest(dstlg, "Error mapping DMA pages"); ret = 0; goto drop_pages; } } - /* Now copy until we run out of src or dst. */ + /* Now copy the data until we run out of src or dst. */ ret = copy_data(srclg, src, dst, pages); drop_pages: @@ -251,6 +395,11 @@ drop_pages: return ret; } +/*L:380 Transferring data from one Guest to another is not as simple as I'd + * like. We've found the "struct lguest_dma_info" bound to the same address as + * the send, we need to copy into it. + * + * This function returns true if the destination array was empty. */ static int dma_transfer(struct lguest *srclg, unsigned long udma, struct lguest_dma_info *dst) @@ -259,15 +408,23 @@ static int dma_transfer(struct lguest *srclg, struct lguest *dstlg; u32 i, dma = 0; + /* From the "struct lguest_dma_info" we found in the hash, grab the + * Guest. */ dstlg = &lguests[dst->guestid]; - /* Get our dma list. */ + /* Read in the source "struct lguest_dma" handed to SEND_DMA. */ lgread(srclg, &src_dma, udma, sizeof(src_dma)); - /* We can't deadlock against them dmaing to us, because this - * is all under the lguest_lock. */ + /* We need the destination's mmap_sem, and we already hold the source's + * mmap_sem for the futex key lookup. Normally this would suggest that + * we could deadlock if the destination Guest was trying to send to + * this source Guest at the same time, which is another reason that all + * I/O is done under the big lguest_lock. */ down_read(&dstlg->mm->mmap_sem); + /* Look through the destination DMA array for an available buffer. */ for (i = 0; i < dst->num_dmas; i++) { + /* We keep a "next_dma" pointer which often helps us avoid + * looking at lots of previously-filled entries. */ dma = (dst->next_dma + i) % dst->num_dmas; if (!lgread_other(dstlg, &dst_dma, dst->dmas + dma * sizeof(struct lguest_dma), @@ -277,30 +434,46 @@ static int dma_transfer(struct lguest *srclg, if (!dst_dma.used_len) break; } + + /* If we found a buffer, we do the actual data copy. */ if (i != dst->num_dmas) { unsigned long used_lenp; unsigned int ret; ret = do_dma(srclg, &src_dma, dstlg, &dst_dma); - /* Put used length in src. */ + /* Put used length in the source "struct lguest_dma"'s used_len + * field. It's a little tricky to figure out where that is, + * though. */ lgwrite_u32(srclg, udma+offsetof(struct lguest_dma, used_len), ret); + /* Tranferring 0 bytes is OK if the source buffer was empty. */ if (ret == 0 && src_dma.len[0] != 0) goto fail; - /* Make sure destination sees contents before length. */ + /* The destination Guest might be running on a different CPU: + * we have to make sure that it will see the "used_len" field + * change to non-zero *after* it sees the data we copied into + * the buffer. Hence a write memory barrier. */ wmb(); + /* Figuring out where the destination's used_len field for this + * "struct lguest_dma" in the array is also a little ugly. */ used_lenp = dst->dmas + dma * sizeof(struct lguest_dma) + offsetof(struct lguest_dma, used_len); lgwrite_other(dstlg, used_lenp, &ret, sizeof(ret)); + /* Move the cursor for next time. */ dst->next_dma++; } up_read(&dstlg->mm->mmap_sem); - /* Do this last so dst doesn't simply sleep on lock. */ + /* We trigger the destination interrupt, even if the destination was + * empty and we didn't transfer anything: this gives them a chance to + * wake up and refill. */ set_bit(dst->interrupt, dstlg->irqs_pending); + /* Wake up the destination process. */ wake_up_process(dstlg->tsk); + /* If we passed the last "struct lguest_dma", the receive had no + * buffers left. */ return i == dst->num_dmas; fail: @@ -308,6 +481,8 @@ fail: return 0; } +/*L:370 This is the counter-side to the BIND_DMA hypercall; the SEND_DMA + * hypercall. We find out who's listening, and send to them. */ void send_dma(struct lguest *lg, unsigned long ukey, unsigned long udma) { union futex_key key; @@ -317,31 +492,43 @@ void send_dma(struct lguest *lg, unsigned long ukey, unsigned long udma) again: mutex_lock(&lguest_lock); down_read(fshared); + /* Get the futex key for the key the Guest gave us */ if (get_futex_key((u32 __user *)ukey, fshared, &key) != 0) { kill_guest(lg, "bad sending DMA key"); goto unlock; } - /* Shared mapping? Look for other guests... */ + /* Since the key must be a multiple of 4, the futex key uses the lower + * bit of the "offset" field (which would always be 0) to indicate a + * mapping which is shared with other processes (ie. Guests). */ if (key.shared.offset & 1) { struct lguest_dma_info *i; + /* Look through the hash for other Guests. */ list_for_each_entry(i, &dma_hash[hash(&key)], list) { + /* Don't send to ourselves. */ if (i->guestid == lg->guestid) continue; if (!key_eq(&key, &i->key)) continue; + /* If dma_transfer() tells us the destination has no + * available buffers, we increment "empty". */ empty += dma_transfer(lg, udma, i); break; } + /* If the destination is empty, we release our locks and + * give the destination Guest a brief chance to restock. */ if (empty == 1) { /* Give any recipients one chance to restock. */ up_read(¤t->mm->mmap_sem); mutex_unlock(&lguest_lock); + /* Next time, we won't try again. */ empty++; goto again; } } else { - /* Private mapping: tell our userspace. */ + /* Private mapping: Guest is sending to its Launcher. We set + * the "dma_is_pending" flag so that the main loop will exit + * and the Launcher's read() from /dev/lguest will return. */ lg->dma_is_pending = 1; lg->pending_dma = udma; lg->pending_key = ukey; @@ -350,6 +537,7 @@ unlock: up_read(fshared); mutex_unlock(&lguest_lock); } +/*:*/ void release_all_dma(struct lguest *lg) { @@ -365,7 +553,8 @@ void release_all_dma(struct lguest *lg) up_read(&lg->mm->mmap_sem); } -/* Userspace wants a dma buffer from this guest. */ +/*L:320 This routine looks for a DMA buffer registered by the Guest on the + * given key (using the BIND_DMA hypercall). */ unsigned long get_dma_buffer(struct lguest *lg, unsigned long ukey, unsigned long *interrupt) { @@ -374,15 +563,29 @@ unsigned long get_dma_buffer(struct lguest *lg, struct lguest_dma_info *i; struct rw_semaphore *fshared = ¤t->mm->mmap_sem; + /* Take the Big Lguest Lock to stop other Guests sending this Guest DMA + * at the same time. */ mutex_lock(&lguest_lock); + /* To match between Guests sharing the same underlying memory we steal + * code from the futex infrastructure. This requires that we hold the + * "mmap_sem" for our process (the Launcher), and pass it to the futex + * code. */ down_read(fshared); + + /* This can fail if it's not a valid address, or if the address is not + * divisible by 4 (the futex code needs that, we don't really). */ if (get_futex_key((u32 __user *)ukey, fshared, &key) != 0) { kill_guest(lg, "bad registered DMA buffer"); goto unlock; } + /* Search the hash table for matching entries (the Launcher can only + * send to its own Guest for the moment, so the entry must be for this + * Guest) */ list_for_each_entry(i, &dma_hash[hash(&key)], list) { if (key_eq(&key, &i->key) && i->guestid == lg->guestid) { unsigned int j; + /* Look through the registered DMA array for an + * available buffer. */ for (j = 0; j < i->num_dmas; j++) { struct lguest_dma dma; @@ -391,6 +594,8 @@ unsigned long get_dma_buffer(struct lguest *lg, if (dma.used_len == 0) break; } + /* Store the interrupt the Guest wants when the buffer + * is used. */ *interrupt = i->interrupt; break; } @@ -400,4 +605,12 @@ unlock: mutex_unlock(&lguest_lock); return ret; } +/*:*/ +/*L:410 This really has completed the Launcher. Not only have we now finished + * the longest chapter in our journey, but this also means we are over halfway + * through! + * + * Enough prevaricating around the bush: it is time for us to dive into the + * core of the Host, in "make Host". + */ diff --git a/drivers/lguest/lg.h b/drivers/lguest/lg.h index 3e2ddfbc816e..3b9dc123a7df 100644 --- a/drivers/lguest/lg.h +++ b/drivers/lguest/lg.h @@ -244,6 +244,30 @@ unsigned long get_dma_buffer(struct lguest *lg, unsigned long key, /* hypercalls.c: */ void do_hypercalls(struct lguest *lg); +/*L:035 + * Let's step aside for the moment, to study one important routine that's used + * widely in the Host code. + * + * There are many cases where the Guest does something invalid, like pass crap + * to a hypercall. Since only the Guest kernel can make hypercalls, it's quite + * acceptable to simply terminate the Guest and give the Launcher a nicely + * formatted reason. It's also simpler for the Guest itself, which doesn't + * need to check most hypercalls for "success"; if you're still running, it + * succeeded. + * + * Once this is called, the Guest will never run again, so most Host code can + * call this then continue as if nothing had happened. This means many + * functions don't have to explicitly return an error code, which keeps the + * code simple. + * + * It also means that this can be called more than once: only the first one is + * remembered. The only trick is that we still need to kill the Guest even if + * we can't allocate memory to store the reason. Linux has a neat way of + * packing error codes into invalid pointers, so we use that here. + * + * Like any macro which uses an "if", it is safely wrapped in a run-once "do { + * } while(0)". + */ #define kill_guest(lg, fmt...) \ do { \ if (!(lg)->dead) { \ @@ -252,6 +276,7 @@ do { \ (lg)->dead = ERR_PTR(-ENOMEM); \ } \ } while(0) +/* (End of aside) :*/ static inline unsigned long guest_pa(struct lguest *lg, unsigned long vaddr) { diff --git a/drivers/lguest/lguest_user.c b/drivers/lguest/lguest_user.c index 6ae86f20ce3d..80d1b58c7698 100644 --- a/drivers/lguest/lguest_user.c +++ b/drivers/lguest/lguest_user.c @@ -9,33 +9,62 @@ #include #include "lg.h" +/*L:030 setup_regs() doesn't really belong in this file, but it gives us an + * early glimpse deeper into the Host so it's worth having here. + * + * Most of the Guest's registers are left alone: we used get_zeroed_page() to + * allocate the structure, so they will be 0. */ static void setup_regs(struct lguest_regs *regs, unsigned long start) { - /* Write out stack in format lguest expects, so we can switch to it. */ + /* There are four "segment" registers which the Guest needs to boot: + * The "code segment" register (cs) refers to the kernel code segment + * __KERNEL_CS, and the "data", "extra" and "stack" segment registers + * refer to the kernel data segment __KERNEL_DS. + * + * The privilege level is packed into the lower bits. The Guest runs + * at privilege level 1 (GUEST_PL).*/ regs->ds = regs->es = regs->ss = __KERNEL_DS|GUEST_PL; regs->cs = __KERNEL_CS|GUEST_PL; - regs->eflags = 0x202; /* Interrupts enabled. */ + + /* The "eflags" register contains miscellaneous flags. Bit 1 (0x002) + * is supposed to always be "1". Bit 9 (0x200) controls whether + * interrupts are enabled. We always leave interrupts enabled while + * running the Guest. */ + regs->eflags = 0x202; + + /* The "Extended Instruction Pointer" register says where the Guest is + * running. */ regs->eip = start; - /* esi points to our boot information (physical address 0) */ + + /* %esi points to our boot information, at physical address 0, so don't + * touch it. */ } -/* + addr */ +/*L:310 To send DMA into the Guest, the Launcher needs to be able to ask for a + * DMA buffer. This is done by writing LHREQ_GETDMA and the key to + * /dev/lguest. */ static long user_get_dma(struct lguest *lg, const u32 __user *input) { unsigned long key, udma, irq; + /* Fetch the key they wrote to us. */ if (get_user(key, input) != 0) return -EFAULT; + /* Look for a free Guest DMA buffer bound to that key. */ udma = get_dma_buffer(lg, key, &irq); if (!udma) return -ENOENT; - /* We put irq number in udma->used_len. */ + /* We need to tell the Launcher what interrupt the Guest expects after + * the buffer is filled. We stash it in udma->used_len. */ lgwrite_u32(lg, udma + offsetof(struct lguest_dma, used_len), irq); + + /* The (guest-physical) address of the DMA buffer is returned from + * the write(). */ return udma; } -/* To force the Guest to stop running and return to the Launcher, the +/*L:315 To force the Guest to stop running and return to the Launcher, the * Waker sets writes LHREQ_BREAK and the value "1" to /dev/lguest. The * Launcher then writes LHREQ_BREAK and "0" to release the Waker. */ static int break_guest_out(struct lguest *lg, const u32 __user *input) @@ -59,7 +88,8 @@ static int break_guest_out(struct lguest *lg, const u32 __user *input) } } -/* + irq */ +/*L:050 Sending an interrupt is done by writing LHREQ_IRQ and an interrupt + * number to /dev/lguest. */ static int user_send_irq(struct lguest *lg, const u32 __user *input) { u32 irq; @@ -68,14 +98,19 @@ static int user_send_irq(struct lguest *lg, const u32 __user *input) return -EFAULT; if (irq >= LGUEST_IRQS) return -EINVAL; + /* Next time the Guest runs, the core code will see if it can deliver + * this interrupt. */ set_bit(irq, lg->irqs_pending); return 0; } +/*L:040 Once our Guest is initialized, the Launcher makes it run by reading + * from /dev/lguest. */ static ssize_t read(struct file *file, char __user *user, size_t size,loff_t*o) { struct lguest *lg = file->private_data; + /* You must write LHREQ_INITIALIZE first! */ if (!lg) return -EINVAL; @@ -83,27 +118,52 @@ static ssize_t read(struct file *file, char __user *user, size_t size,loff_t*o) if (current != lg->tsk) return -EPERM; + /* If the guest is already dead, we indicate why */ if (lg->dead) { size_t len; + /* lg->dead either contains an error code, or a string. */ if (IS_ERR(lg->dead)) return PTR_ERR(lg->dead); + /* We can only return as much as the buffer they read with. */ len = min(size, strlen(lg->dead)+1); if (copy_to_user(user, lg->dead, len) != 0) return -EFAULT; return len; } + /* If we returned from read() last time because the Guest sent DMA, + * clear the flag. */ if (lg->dma_is_pending) lg->dma_is_pending = 0; + /* Run the Guest until something interesting happens. */ return run_guest(lg, (unsigned long __user *)user); } -/* Take: pfnlimit, pgdir, start, pageoffset. */ +/*L:020 The initialization write supplies 4 32-bit values (in addition to the + * 32-bit LHREQ_INITIALIZE value). These are: + * + * pfnlimit: The highest (Guest-physical) page number the Guest should be + * allowed to access. The Launcher has to live in Guest memory, so it sets + * this to ensure the Guest can't reach it. + * + * pgdir: The (Guest-physical) address of the top of the initial Guest + * pagetables (which are set up by the Launcher). + * + * start: The first instruction to execute ("eip" in x86-speak). + * + * page_offset: The PAGE_OFFSET constant in the Guest kernel. We should + * probably wean the code off this, but it's a very useful constant! Any + * address above this is within the Guest kernel, and any kernel address can + * quickly converted from physical to virtual by adding PAGE_OFFSET. It's + * 0xC0000000 (3G) by default, but it's configurable at kernel build time. + */ static int initialize(struct file *file, const u32 __user *input) { + /* "struct lguest" contains everything we (the Host) know about a + * Guest. */ struct lguest *lg; int err, i; u32 args[4]; @@ -111,7 +171,7 @@ static int initialize(struct file *file, const u32 __user *input) /* We grab the Big Lguest lock, which protects the global array * "lguests" and multiple simultaneous initializations. */ mutex_lock(&lguest_lock); - + /* You can't initialize twice! Close the device and start again... */ if (file->private_data) { err = -EBUSY; goto unlock; @@ -122,37 +182,70 @@ static int initialize(struct file *file, const u32 __user *input) goto unlock; } + /* Find an unused guest. */ i = find_free_guest(); if (i < 0) { err = -ENOSPC; goto unlock; } + /* OK, we have an index into the "lguest" array: "lg" is a convenient + * pointer. */ lg = &lguests[i]; + + /* Populate the easy fields of our "struct lguest" */ lg->guestid = i; lg->pfn_limit = args[0]; lg->page_offset = args[3]; + + /* We need a complete page for the Guest registers: they are accessible + * to the Guest and we can only grant it access to whole pages. */ lg->regs_page = get_zeroed_page(GFP_KERNEL); if (!lg->regs_page) { err = -ENOMEM; goto release_guest; } + /* We actually put the registers at the bottom of the page. */ lg->regs = (void *)lg->regs_page + PAGE_SIZE - sizeof(*lg->regs); + /* Initialize the Guest's shadow page tables, using the toplevel + * address the Launcher gave us. This allocates memory, so can + * fail. */ err = init_guest_pagetable(lg, args[1]); if (err) goto free_regs; + /* Now we initialize the Guest's registers, handing it the start + * address. */ setup_regs(lg->regs, args[2]); + + /* There are a couple of GDT entries the Guest expects when first + * booting. */ setup_guest_gdt(lg); + + /* The timer for lguest's clock needs initialization. */ init_clockdev(lg); + + /* We keep a pointer to the Launcher task (ie. current task) for when + * other Guests want to wake this one (inter-Guest I/O). */ lg->tsk = current; + /* We need to keep a pointer to the Launcher's memory map, because if + * the Launcher dies we need to clean it up. If we don't keep a + * reference, it is destroyed before close() is called. */ lg->mm = get_task_mm(lg->tsk); + + /* Initialize the queue for the waker to wait on */ init_waitqueue_head(&lg->break_wq); + + /* We remember which CPU's pages this Guest used last, for optimization + * when the same Guest runs on the same CPU twice. */ lg->last_pages = NULL; + + /* We keep our "struct lguest" in the file's private_data. */ file->private_data = lg; mutex_unlock(&lguest_lock); + /* And because this is a write() call, we return the length used. */ return sizeof(args); free_regs: @@ -164,9 +257,15 @@ unlock: return err; } +/*L:010 The first operation the Launcher does must be a write. All writes + * start with a 32 bit number: for the first write this must be + * LHREQ_INITIALIZE to set up the Guest. After that the Launcher can use + * writes of other values to get DMA buffers and send interrupts. */ static ssize_t write(struct file *file, const char __user *input, size_t size, loff_t *off) { + /* Once the guest is initialized, we hold the "struct lguest" in the + * file private data. */ struct lguest *lg = file->private_data; u32 req; @@ -174,8 +273,11 @@ static ssize_t write(struct file *file, const char __user *input, return -EFAULT; input += sizeof(req); + /* If you haven't initialized, you must do that first. */ if (req != LHREQ_INITIALIZE && !lg) return -EINVAL; + + /* Once the Guest is dead, all you can do is read() why it died. */ if (lg && lg->dead) return -ENOENT; @@ -197,33 +299,72 @@ static ssize_t write(struct file *file, const char __user *input, } } +/*L:060 The final piece of interface code is the close() routine. It reverses + * everything done in initialize(). This is usually called because the + * Launcher exited. + * + * Note that the close routine returns 0 or a negative error number: it can't + * really fail, but it can whine. I blame Sun for this wart, and K&R C for + * letting them do it. :*/ static int close(struct inode *inode, struct file *file) { struct lguest *lg = file->private_data; + /* If we never successfully initialized, there's nothing to clean up */ if (!lg) return 0; + /* We need the big lock, to protect from inter-guest I/O and other + * Launchers initializing guests. */ mutex_lock(&lguest_lock); /* Cancels the hrtimer set via LHCALL_SET_CLOCKEVENT. */ hrtimer_cancel(&lg->hrt); + /* Free any DMA buffers the Guest had bound. */ release_all_dma(lg); + /* Free up the shadow page tables for the Guest. */ free_guest_pagetable(lg); + /* Now all the memory cleanups are done, it's safe to release the + * Launcher's memory management structure. */ mmput(lg->mm); + /* If lg->dead doesn't contain an error code it will be NULL or a + * kmalloc()ed string, either of which is ok to hand to kfree(). */ if (!IS_ERR(lg->dead)) kfree(lg->dead); + /* We can free up the register page we allocated. */ free_page(lg->regs_page); + /* We clear the entire structure, which also marks it as free for the + * next user. */ memset(lg, 0, sizeof(*lg)); + /* Release lock and exit. */ mutex_unlock(&lguest_lock); + return 0; } +/*L:000 + * Welcome to our journey through the Launcher! + * + * The Launcher is the Host userspace program which sets up, runs and services + * the Guest. In fact, many comments in the Drivers which refer to "the Host" + * doing things are inaccurate: the Launcher does all the device handling for + * the Guest. The Guest can't tell what's done by the the Launcher and what by + * the Host. + * + * Just to confuse you: to the Host kernel, the Launcher *is* the Guest and we + * shall see more of that later. + * + * We begin our understanding with the Host kernel interface which the Launcher + * uses: reading and writing a character device called /dev/lguest. All the + * work happens in the read(), write() and close() routines: */ static struct file_operations lguest_fops = { .owner = THIS_MODULE, .release = close, .write = write, .read = read, }; + +/* This is a textbook example of a "misc" character device. Populate a "struct + * miscdevice" and register it with misc_register(). */ static struct miscdevice lguest_dev = { .minor = MISC_DYNAMIC_MINOR, .name = "lguest", -- cgit v1.2.3 From f56a384e98aa81065038c4e16f39ed989ccae687 Mon Sep 17 00:00:00 2001 From: Rusty Russell Date: Thu, 26 Jul 2007 10:41:05 -0700 Subject: lguest: documentation VII: FIXMEs Documentation: The FIXMEs Signed-off-by: Rusty Russell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/lguest/lguest.c | 12 ++++++++++++ drivers/char/hvc_lguest.c | 3 +++ drivers/lguest/interrupts_and_traps.c | 14 ++++++++++++++ drivers/lguest/io.c | 10 ++++++++++ drivers/lguest/lguest.c | 8 ++++++++ drivers/lguest/lguest_asm.S | 14 ++++++++++++++ drivers/lguest/page_tables.c | 5 +++++ drivers/lguest/segments.c | 4 ++++ drivers/net/lguest_net.c | 19 +++++++++++++++++++ 9 files changed, 89 insertions(+) (limited to 'Documentation') diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c index d7e26f025959..f7918401a007 100644 --- a/Documentation/lguest/lguest.c +++ b/Documentation/lguest/lguest.c @@ -1510,3 +1510,15 @@ int main(int argc, char *argv[]) /* Finally, run the Guest. This doesn't return. */ run_guest(lguest_fd, &device_list); } +/*:*/ + +/*M:999 + * Mastery is done: you now know everything I do. + * + * But surely you have seen code, features and bugs in your wanderings which + * you now yearn to attack? That is the real game, and I look forward to you + * patching and forking lguest into the Your-Name-Here-visor. + * + * Farewell, and good coding! + * Rusty Russell. + */ diff --git a/drivers/char/hvc_lguest.c b/drivers/char/hvc_lguest.c index 1de8967cce06..feeccbaec438 100644 --- a/drivers/char/hvc_lguest.c +++ b/drivers/char/hvc_lguest.c @@ -13,6 +13,9 @@ * functions. :*/ +/*M:002 The console can be flooded: while the Guest is processing input the + * Host can send more. Buffering in the Host could alleviate this, but it is a + * difficult problem in general. :*/ /* Copyright (C) 2006 Rusty Russell, IBM Corporation * * This program is free software; you can redistribute it and/or modify diff --git a/drivers/lguest/interrupts_and_traps.c b/drivers/lguest/interrupts_and_traps.c index 3d9830322646..bd0091bf79ec 100644 --- a/drivers/lguest/interrupts_and_traps.c +++ b/drivers/lguest/interrupts_and_traps.c @@ -231,6 +231,20 @@ static int direct_trap(const struct lguest *lg, * go direct, of course 8) */ return idt_type(trap->a, trap->b) == 0xF; } +/*:*/ + +/*M:005 The Guest has the ability to turn its interrupt gates into trap gates, + * if it is careful. The Host will let trap gates can go directly to the + * Guest, but the Guest needs the interrupts atomically disabled for an + * interrupt gate. It can do this by pointing the trap gate at instructions + * within noirq_start and noirq_end, where it can safely disable interrupts. */ + +/*M:006 The Guests do not use the sysenter (fast system call) instruction, + * because it's hardcoded to enter privilege level 0 and so can't go direct. + * It's about twice as fast as the older "int 0x80" system call, so it might + * still be worthwhile to handle it in the Switcher and lcall down to the + * Guest. The sysenter semantics are hairy tho: search for that keyword in + * entry.S :*/ /*H:260 When we make traps go directly into the Guest, we need to make sure * the kernel stack is valid (ie. mapped in the page tables). Otherwise, the diff --git a/drivers/lguest/io.c b/drivers/lguest/io.c index da288128e44f..ea68613b43f6 100644 --- a/drivers/lguest/io.c +++ b/drivers/lguest/io.c @@ -553,6 +553,16 @@ void release_all_dma(struct lguest *lg) up_read(&lg->mm->mmap_sem); } +/*M:007 We only return a single DMA buffer to the Launcher, but it would be + * more efficient to return a pointer to the entire array of DMA buffers, which + * it can cache and choose one whenever it wants. + * + * Currently the Launcher uses a write to /dev/lguest, and the return value is + * the address of the DMA structure with the interrupt number placed in + * dma->used_len. If we wanted to return the entire array, we need to return + * the address, array size and interrupt number: this seems to require an + * ioctl(). :*/ + /*L:320 This routine looks for a DMA buffer registered by the Guest on the * given key (using the BIND_DMA hypercall). */ unsigned long get_dma_buffer(struct lguest *lg, diff --git a/drivers/lguest/lguest.c b/drivers/lguest/lguest.c index 7e7e9fb3aefd..6dfe568523a2 100644 --- a/drivers/lguest/lguest.c +++ b/drivers/lguest/lguest.c @@ -250,6 +250,14 @@ static void irq_enable(void) { lguest_data.irq_enabled = X86_EFLAGS_IF; } +/*:*/ +/*M:003 Note that we don't check for outstanding interrupts when we re-enable + * them (or when we unmask an interrupt). This seems to work for the moment, + * since interrupts are rare and we'll just get the interrupt on the next timer + * tick, but when we turn on CONFIG_NO_HZ, we should revisit this. One way + * would be to put the "irq_enabled" field in a page by itself, and have the + * Host write-protect it when an interrupt comes in when irqs are disabled. + * There will then be a page fault as soon as interrupts are re-enabled. :*/ /*G:034 * The Interrupt Descriptor Table (IDT). diff --git a/drivers/lguest/lguest_asm.S b/drivers/lguest/lguest_asm.S index 3126ae923cc0..f182c6a36209 100644 --- a/drivers/lguest/lguest_asm.S +++ b/drivers/lguest/lguest_asm.S @@ -39,6 +39,20 @@ LGUEST_PATCH(pushf, movl lguest_data+LGUEST_DATA_irq_enabled, %eax) .global lguest_noirq_start .global lguest_noirq_end +/*M:004 When the Host reflects a trap or injects an interrupt into the Guest, + * it sets the eflags interrupt bit on the stack based on + * lguest_data.irq_enabled, so the Guest iret logic does the right thing when + * restoring it. However, when the Host sets the Guest up for direct traps, + * such as system calls, the processor is the one to push eflags onto the + * stack, and the interrupt bit will be 1 (in reality, interrupts are always + * enabled in the Guest). + * + * This turns out to be harmless: the only trap which should happen under Linux + * with interrupts disabled is Page Fault (due to our lazy mapping of vmalloc + * regions), which has to be reflected through the Host anyway. If another + * trap *does* go off when interrupts are disabled, the Guest will panic, and + * we'll never get to this iret! :*/ + /*G:045 There is one final paravirt_op that the Guest implements, and glancing * at it you can see why I left it to last. It's *cool*! It's in *assembler*! * diff --git a/drivers/lguest/page_tables.c b/drivers/lguest/page_tables.c index cd047e81cd63..b7a924ace684 100644 --- a/drivers/lguest/page_tables.c +++ b/drivers/lguest/page_tables.c @@ -15,6 +15,11 @@ #include #include "lg.h" +/*M:008 We hold reference to pages, which prevents them from being swapped. + * It'd be nice to have a callback in the "struct mm_struct" when Linux wants + * to swap out. If we had this, and a shrinker callback to trim PTE pages, we + * could probably consider launching Guests as non-root. :*/ + /*H:300 * The Page Table Code * diff --git a/drivers/lguest/segments.c b/drivers/lguest/segments.c index 4d4e5a4586f9..f675a41a80da 100644 --- a/drivers/lguest/segments.c +++ b/drivers/lguest/segments.c @@ -94,6 +94,10 @@ static void check_segment_use(struct lguest *lg, unsigned int desc) || lg->regs->ss / 8 == desc) kill_guest(lg, "Removed live GDT entry %u", desc); } +/*:*/ +/*M:009 We wouldn't need to check for removal of in-use segments if we handled + * faults in the Switcher. However, it's probably not a worthwhile + * optimization. :*/ /*H:610 Once the GDT has been changed, we look through the changed entries and * see if they're OK. If not, we'll call kill_guest() and the Guest will never diff --git a/drivers/net/lguest_net.c b/drivers/net/lguest_net.c index 20df6a848923..cab57911a80e 100644 --- a/drivers/net/lguest_net.c +++ b/drivers/net/lguest_net.c @@ -35,6 +35,25 @@ #define MAX_LANS 4 #define NUM_SKBS 8 +/*M:011 Network code master Jeff Garzik points out numerous shortcomings in + * this driver if it aspires to greatness. + * + * Firstly, it doesn't use "NAPI": the networking's New API, and is poorer for + * it. As he says "NAPI means system-wide load leveling, across multiple + * network interfaces. Lack of NAPI can mean competition at higher loads." + * + * He also points out that we don't implement set_mac_address, so users cannot + * change the devices hardware address. When I asked why one would want to: + * "Bonding, and situations where you /do/ want the MAC address to "leak" out + * of the host onto the wider net." + * + * Finally, he would like module unloading: "It is not unrealistic to think of + * [un|re|]loading the net support module in an lguest guest. And, adding + * module support makes the programmer more responsible, because they now have + * to learn to clean up after themselves. Any driver that cannot clean up + * after itself is an incomplete driver in my book." + :*/ + /*D:530 The "struct lguestnet_info" contains all the information we need to * know about the network device. */ struct lguestnet_info -- cgit v1.2.3 From c99c108ac362f5cc37f79fad7e9897bd9d033bcc Mon Sep 17 00:00:00 2001 From: Chuck Ebbert Date: Fri, 27 Jul 2007 10:46:20 +1000 Subject: AGP: document boot options Add documentation for AGP boot options. Signed-off-by: Chuck Ebbert Signed-off-by: Dave Airlie --- Documentation/kernel-parameters.txt | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index fb80e9ffea68..1156653338fe 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -30,6 +30,7 @@ the beginning of each description states the restrictions within which a parameter is applicable: ACPI ACPI support is enabled. + AGP AGP (Accelerated Graphics Port) is enabled. ALSA ALSA sound support is enabled. APIC APIC support is enabled. APM Advanced Power Management support is enabled. @@ -227,6 +228,12 @@ and is between 256 and 4096 characters. It is defined in the file to assume that this machine's pmtimer latches its value and always returns good values. + agp= [AGP] + { off | try_unsupported } + off: disable AGP support + try_unsupported: try to drive unsupported chipsets + (may crash computer or cause data corruption) + enable_timer_pin_1 [i386,x86-64] Enable PIN 1 of APIC timer Can be useful to work around chipset bugs -- cgit v1.2.3 From 79685b8deea4541d18882d8c07d0e99e788292ab Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Fri, 27 Jul 2007 08:08:51 +0200 Subject: docbook: add pipes, other fixes Fix some typos in pipe.c and splice.c. Add pipes API to kernel-api.tmpl. Signed-off-by: Randy Dunlap Signed-off-by: Jens Axboe --- Documentation/DocBook/kernel-api.tmpl | 13 +++++++++++-- fs/pipe.c | 2 +- fs/splice.c | 4 ++-- 3 files changed, 14 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl index eb42bf9847cb..ec7c498b69fc 100644 --- a/Documentation/DocBook/kernel-api.tmpl +++ b/Documentation/DocBook/kernel-api.tmpl @@ -704,14 +704,23 @@ X!Idrivers/video/console/fonts.c splice API - ) + splice is a method for moving blocks of data around inside the - kernel, without continually transferring it between the kernel + kernel, without continually transferring them between the kernel and user space. !Iinclude/linux/splice.h !Ffs/splice.c + + pipes API + + Pipe interfaces are all for in-kernel (builtin image) use. + They are not exported for use by modules. + +!Iinclude/linux/pipe_fs_i.h +!Ffs/pipe.c + diff --git a/fs/pipe.c b/fs/pipe.c index d007830d9c87..6b3d91a691bf 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -255,7 +255,7 @@ void generic_pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf) /** * generic_pipe_buf_confirm - verify contents of the pipe buffer - * @pipe: the pipe that the buffer belongs to + * @info: the pipe that the buffer belongs to * @buf: the buffer to confirm * * Description: diff --git a/fs/splice.c b/fs/splice.c index 0a0973218084..c010a72ca2d2 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -164,7 +164,7 @@ static const struct pipe_buf_operations user_page_pipe_buf_ops = { * @spd: data to fill * * Description: - * @spd contains a map of pages and len/offset tupples, a long with + * @spd contains a map of pages and len/offset tuples, along with * the struct pipe_buf_operations associated with these pages. This * function will link that data to the pipe. * @@ -1000,7 +1000,7 @@ static long do_splice_to(struct file *in, loff_t *ppos, * Description: * This is a special case helper to splice directly between two * points, without requiring an explicit pipe. Internally an allocated - * pipe is cached in the process, and reused during the life time of + * pipe is cached in the process, and reused during the lifetime of * that process. * */ -- cgit v1.2.3 From 8059862c636778bc1872c89ae307eb6bccd35581 Mon Sep 17 00:00:00 2001 From: Cornelia Huck Date: Fri, 27 Jul 2007 12:29:14 +0200 Subject: [S390] cio: Remove deprecated rdc/rcd. http://marc.info/?l=linux-kernel&m=118481061928246&w=2 seems to indicate disfavour of "deprecated", so let's just kill it now. Signed-off-by: Cornelia Huck Signed-off-by: Martin Schwidefsky --- Documentation/feature-removal-schedule.txt | 16 -- drivers/s390/cio/device_ops.c | 250 ----------------------------- include/asm-s390/ccwdev.h | 5 - 3 files changed, 271 deletions(-) (limited to 'Documentation') diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index c175eedadb5f..a43d2878a4ef 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -211,22 +211,6 @@ Who: Richard Purdie --------------------------- -What: read_dev_chars(), read_conf_data{,_lpm}() (s390 common I/O layer) -When: December 2007 -Why: These functions are a leftover from 2.4 times. They have several - problems: - - Duplication of checks that are done in the device driver's - interrupt handler - - common I/O layer can't do device specific error recovery - - device driver can't be notified for conditions happening during - execution of the function - Device drivers should issue the read device characteristics and read - configuration data ccws and do the appropriate error handling - themselves. -Who: Cornelia Huck - ---------------------------- - What: i2c-ixp2000, i2c-ixp4xx and scx200_i2c drivers When: September 2007 Why: Obsolete. The new i2c-gpio driver replaces all hardware-specific diff --git a/drivers/s390/cio/device_ops.c b/drivers/s390/cio/device_ops.c index c8cfbf161d44..14eba854b155 100644 --- a/drivers/s390/cio/device_ops.c +++ b/drivers/s390/cio/device_ops.c @@ -288,253 +288,6 @@ ccw_device_get_path_mask(struct ccw_device *cdev) return sch->lpm; } -static void -ccw_device_wake_up(struct ccw_device *cdev, unsigned long ip, struct irb *irb) -{ - if (!ip) - /* unsolicited interrupt */ - return; - - /* Abuse intparm for error reporting. */ - if (IS_ERR(irb)) - cdev->private->intparm = -EIO; - else if (irb->scsw.cc == 1) - /* Retry for deferred condition code. */ - cdev->private->intparm = -EAGAIN; - else if ((irb->scsw.dstat != - (DEV_STAT_CHN_END|DEV_STAT_DEV_END)) || - (irb->scsw.cstat != 0)) { - /* - * We didn't get channel end / device end. Check if path - * verification has been started; we can retry after it has - * finished. We also retry unit checks except for command reject - * or intervention required. Also check for long busy - * conditions. - */ - if (cdev->private->flags.doverify || - cdev->private->state == DEV_STATE_VERIFY) - cdev->private->intparm = -EAGAIN; - else if ((irb->scsw.dstat & DEV_STAT_UNIT_CHECK) && - !(irb->ecw[0] & - (SNS0_CMD_REJECT | SNS0_INTERVENTION_REQ))) - cdev->private->intparm = -EAGAIN; - else if ((irb->scsw.dstat & DEV_STAT_ATTENTION) && - (irb->scsw.dstat & DEV_STAT_DEV_END) && - (irb->scsw.dstat & DEV_STAT_UNIT_EXCEP)) - cdev->private->intparm = -EAGAIN; - else - cdev->private->intparm = -EIO; - - } else - cdev->private->intparm = 0; - wake_up(&cdev->private->wait_q); -} - -static int -__ccw_device_retry_loop(struct ccw_device *cdev, struct ccw1 *ccw, long magic, __u8 lpm) -{ - int ret; - struct subchannel *sch; - - sch = to_subchannel(cdev->dev.parent); - do { - ccw_device_set_timeout(cdev, 60 * HZ); - ret = cio_start (sch, ccw, lpm); - if (ret != 0) - ccw_device_set_timeout(cdev, 0); - if (ret == -EBUSY) { - /* Try again later. */ - spin_unlock_irq(sch->lock); - msleep(10); - spin_lock_irq(sch->lock); - continue; - } - if (ret != 0) - /* Non-retryable error. */ - break; - /* Wait for end of request. */ - cdev->private->intparm = magic; - spin_unlock_irq(sch->lock); - wait_event(cdev->private->wait_q, - (cdev->private->intparm == -EIO) || - (cdev->private->intparm == -EAGAIN) || - (cdev->private->intparm == 0)); - spin_lock_irq(sch->lock); - /* Check at least for channel end / device end */ - if (cdev->private->intparm == -EIO) { - /* Non-retryable error. */ - ret = -EIO; - break; - } - if (cdev->private->intparm == 0) - /* Success. */ - break; - /* Try again later. */ - spin_unlock_irq(sch->lock); - msleep(10); - spin_lock_irq(sch->lock); - } while (1); - - return ret; -} - -/** - * read_dev_chars() - read device characteristics - * @param cdev target ccw device - * @param buffer pointer to buffer for rdc data - * @param length size of rdc data - * @returns 0 for success, negative error value on failure - * - * Context: - * called for online device, lock not held - **/ -int -read_dev_chars (struct ccw_device *cdev, void **buffer, int length) -{ - void (*handler)(struct ccw_device *, unsigned long, struct irb *); - struct subchannel *sch; - int ret; - struct ccw1 *rdc_ccw; - - if (!cdev) - return -ENODEV; - if (!buffer || !length) - return -EINVAL; - sch = to_subchannel(cdev->dev.parent); - - CIO_TRACE_EVENT (4, "rddevch"); - CIO_TRACE_EVENT (4, sch->dev.bus_id); - - rdc_ccw = kzalloc(sizeof(struct ccw1), GFP_KERNEL | GFP_DMA); - if (!rdc_ccw) - return -ENOMEM; - rdc_ccw->cmd_code = CCW_CMD_RDC; - rdc_ccw->count = length; - rdc_ccw->flags = CCW_FLAG_SLI; - ret = set_normalized_cda (rdc_ccw, (*buffer)); - if (ret != 0) { - kfree(rdc_ccw); - return ret; - } - - spin_lock_irq(sch->lock); - /* Save interrupt handler. */ - handler = cdev->handler; - /* Temporarily install own handler. */ - cdev->handler = ccw_device_wake_up; - if (cdev->private->state != DEV_STATE_ONLINE) - ret = -ENODEV; - else if (((sch->schib.scsw.stctl & SCSW_STCTL_PRIM_STATUS) && - !(sch->schib.scsw.stctl & SCSW_STCTL_SEC_STATUS)) || - cdev->private->flags.doverify) - ret = -EBUSY; - else - /* 0x00D9C4C3 == ebcdic "RDC" */ - ret = __ccw_device_retry_loop(cdev, rdc_ccw, 0x00D9C4C3, 0); - - /* Restore interrupt handler. */ - cdev->handler = handler; - spin_unlock_irq(sch->lock); - - clear_normalized_cda (rdc_ccw); - kfree(rdc_ccw); - - return ret; -} - -/* - * Read Configuration data using path mask - */ -int -read_conf_data_lpm (struct ccw_device *cdev, void **buffer, int *length, __u8 lpm) -{ - void (*handler)(struct ccw_device *, unsigned long, struct irb *); - struct subchannel *sch; - struct ciw *ciw; - char *rcd_buf; - int ret; - struct ccw1 *rcd_ccw; - - if (!cdev) - return -ENODEV; - if (!buffer || !length) - return -EINVAL; - sch = to_subchannel(cdev->dev.parent); - - CIO_TRACE_EVENT (4, "rdconf"); - CIO_TRACE_EVENT (4, sch->dev.bus_id); - - /* - * scan for RCD command in extended SenseID data - */ - ciw = ccw_device_get_ciw(cdev, CIW_TYPE_RCD); - if (!ciw || ciw->cmd == 0) - return -EOPNOTSUPP; - - /* Adjust requested path mask to excluded varied off paths. */ - if (lpm) { - lpm &= sch->opm; - if (lpm == 0) - return -EACCES; - } - - rcd_ccw = kzalloc(sizeof(struct ccw1), GFP_KERNEL | GFP_DMA); - if (!rcd_ccw) - return -ENOMEM; - rcd_buf = kzalloc(ciw->count, GFP_KERNEL | GFP_DMA); - if (!rcd_buf) { - kfree(rcd_ccw); - return -ENOMEM; - } - rcd_ccw->cmd_code = ciw->cmd; - rcd_ccw->cda = (__u32) __pa (rcd_buf); - rcd_ccw->count = ciw->count; - rcd_ccw->flags = CCW_FLAG_SLI; - - spin_lock_irq(sch->lock); - /* Save interrupt handler. */ - handler = cdev->handler; - /* Temporarily install own handler. */ - cdev->handler = ccw_device_wake_up; - if (cdev->private->state != DEV_STATE_ONLINE) - ret = -ENODEV; - else if (((sch->schib.scsw.stctl & SCSW_STCTL_PRIM_STATUS) && - !(sch->schib.scsw.stctl & SCSW_STCTL_SEC_STATUS)) || - cdev->private->flags.doverify) - ret = -EBUSY; - else - /* 0x00D9C3C4 == ebcdic "RCD" */ - ret = __ccw_device_retry_loop(cdev, rcd_ccw, 0x00D9C3C4, lpm); - - /* Restore interrupt handler. */ - cdev->handler = handler; - spin_unlock_irq(sch->lock); - - /* - * on success we update the user input parms - */ - if (ret) { - kfree (rcd_buf); - *buffer = NULL; - *length = 0; - } else { - *length = ciw->count; - *buffer = rcd_buf; - } - kfree(rcd_ccw); - - return ret; -} - -/* - * Read Configuration data - */ -int -read_conf_data (struct ccw_device *cdev, void **buffer, int *length) -{ - return read_conf_data_lpm (cdev, buffer, length, 0); -} - /* * Try to break the lock on a boxed device. */ @@ -649,8 +402,5 @@ EXPORT_SYMBOL(ccw_device_start_timeout_key); EXPORT_SYMBOL(ccw_device_start_key); EXPORT_SYMBOL(ccw_device_get_ciw); EXPORT_SYMBOL(ccw_device_get_path_mask); -EXPORT_SYMBOL(read_conf_data); -EXPORT_SYMBOL(read_dev_chars); EXPORT_SYMBOL(_ccw_device_get_subchannel_number); EXPORT_SYMBOL_GPL(ccw_device_get_chp_desc); -EXPORT_SYMBOL_GPL(read_conf_data_lpm); diff --git a/include/asm-s390/ccwdev.h b/include/asm-s390/ccwdev.h index 4c2e1710f157..1aeda27d5a8b 100644 --- a/include/asm-s390/ccwdev.h +++ b/include/asm-s390/ccwdev.h @@ -165,11 +165,6 @@ extern int ccw_device_resume(struct ccw_device *); extern int ccw_device_halt(struct ccw_device *, unsigned long); extern int ccw_device_clear(struct ccw_device *, unsigned long); -extern int __deprecated read_dev_chars(struct ccw_device *cdev, void **buffer, int length); -extern int __deprecated read_conf_data(struct ccw_device *cdev, void **buffer, int *length); -extern int __deprecated read_conf_data_lpm(struct ccw_device *cdev, void **buffer, - int *length, __u8 lpm); - extern int ccw_device_set_online(struct ccw_device *cdev); extern int ccw_device_set_offline(struct ccw_device *cdev); -- cgit v1.2.3 From 6e3eb0993837fb4a597b88a7e28ce3847cb5777c Mon Sep 17 00:00:00 2001 From: "IKEDA, Munehiro" Date: Thu, 19 Jul 2007 11:36:56 +0900 Subject: HOWTO: adjust translation header of Japanese stable_api_nonsense.txt Signed-off-by: IKEDA, Munehiro Signed-off-by: Greg Kroah-Hartman --- Documentation/ja_JP/stable_api_nonsense.txt | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) (limited to 'Documentation') diff --git a/Documentation/ja_JP/stable_api_nonsense.txt b/Documentation/ja_JP/stable_api_nonsense.txt index b3f2b27f0881..7653b5cbfed2 100644 --- a/Documentation/ja_JP/stable_api_nonsense.txt +++ b/Documentation/ja_JP/stable_api_nonsense.txt @@ -1,17 +1,17 @@ NOTE: -This is a Japanese translated version of -"Documentation/stable_api_nonsense.txt". -This one is maintained by -IKEDA, Munehiro -and JF Project team . -If you find difference with original file or problem in translation, +This is a version of Documentation/stable_api_nonsense.txt into Japanese. +This document is maintained by IKEDA, Munehiro +and the JF Project team . +If you find any difference between this document and the original file +or a problem with the translation, please contact the maintainer of this file or JF project. -Please also note that purpose of this file is easier to read for non -English natives and not to be intended to fork. So, if you have any -comments or updates of this file, please try to update -Original(English) file at first. +Please also note that the purpose of this file is to be easier to read +for non English (read: Japanese) speakers and is not intended as a +fork. So if you have any comments or updates of this file, please try +to update the original English file first. +Last Updated: 2007/07/18 ================================== これは、 linux-2.6.22-rc4/Documentation/stable_api_nonsense.txt の和訳 -- cgit v1.2.3 From 8b43626f0cdfb3154c57d52e732679c9d3484369 Mon Sep 17 00:00:00 2001 From: Tsugikazu Shibata Date: Thu, 19 Jul 2007 11:24:54 +0900 Subject: HOWTO: sync Japanese HOWTO Signed-off-by: Tsugikazu Shibata Signed-off-by: Greg Kroah-Hartman --- Documentation/ja_JP/HOWTO | 66 +++++++++++++++++++++++++---------------------- 1 file changed, 35 insertions(+), 31 deletions(-) (limited to 'Documentation') diff --git a/Documentation/ja_JP/HOWTO b/Documentation/ja_JP/HOWTO index b2446a090870..9f08dab1e75b 100644 --- a/Documentation/ja_JP/HOWTO +++ b/Documentation/ja_JP/HOWTO @@ -1,23 +1,24 @@ -NOTE: -This is Japanese translated version of "Documentation/HOWTO". -This one is maintained by Tsugikazu Shibata -and JF Project team . -If you find difference with original file or problem in translation, -please contact maintainer of this file or JF project. - -Please also note that purpose of this file is easier to read for non -English natives and not to be intended to fork. So, if you have any -comments or updates of this file, please try to update Original(English) -file at first. - -Last Updated: 2007/06/04 +NOTE: +This is a version of Documentation/HOWTO translated into Japanese. +This document is maintained by Tsugikazu Shibata +and the JF Project team . +If you find any difference between this document and the original file +or a problem with the translation, +please contact the maintainer of this file or JF project. + +Please also note that the purpose of this file is to be easier to read +for non English (read: Japanese) speakers and is not intended as a +fork. So if you have any comments or updates for this file, please try +to update the original English file first. + +Last Updated: 2007/07/18 ================================== これは、 -linux-2.6.21/Documentation/HOWTO +linux-2.6.22/Documentation/HOWTO の和訳です。 翻訳団体: JF プロジェクト < http://www.linux.or.jp/JF/ > -翻訳日: 2007/06/04 +翻訳日: 2007/07/16 翻訳者: Tsugikazu Shibata 校正者: 松倉さん 小林 雅典さん (Masanori Kobayasi) @@ -52,6 +53,7 @@ Linux カーネル開発コミュニティと共に活動するやり方を学 また、このコミュニティがなぜ今うまくまわっているのかという理由の一部も 説明しようと試みています。 + カーネルは 少量のアーキテクチャ依存部分がアセンブリ言語で書かれている 以外は大部分は C 言語で書かれています。C言語をよく理解していることはカー ネル開発者には必要です。アーキテクチャ向けの低レベル部分の開発をするの @@ -141,6 +143,7 @@ Linux カーネルソースツリーは幅広い範囲のドキュメントを これらのルールに従えばうまくいくことを保証することではありません が (すべてのパッチは内容とスタイルについて精査を受けるので)、 ルールに従わなければ間違いなくうまくいかないでしょう。 + この他にパッチを作る方法についてのよくできた記述は- "The Perfect Patch" @@ -360,44 +363,42 @@ linux-kernel メーリングリストで収集された多数のパッチと同 git ツリー- - Kbuild の開発ツリー、Sam Ravnborg - kernel.org:/pub/scm/linux/kernel/git/sam/kbuild.git + git.kernel.org:/pub/scm/linux/kernel/git/sam/kbuild.git - ACPI の開発ツリー、 Len Brown - kernel.org:/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6.git + git.kernel.org:/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6.git - Block の開発ツリー、Jens Axboe - kernel.org:/pub/scm/linux/kernel/git/axboe/linux-2.6-block.git + git.kernel.org:/pub/scm/linux/kernel/git/axboe/linux-2.6-block.git - DRM の開発ツリー、Dave Airlie - kernel.org:/pub/scm/linux/kernel/git/airlied/drm-2.6.git + git.kernel.org:/pub/scm/linux/kernel/git/airlied/drm-2.6.git - ia64 の開発ツリー、Tony Luck - kernel.org:/pub/scm/linux/kernel/git/aegl/linux-2.6.git - - - ieee1394 の開発ツリー、Jody McIntyre - kernel.org:/pub/scm/linux/kernel/git/scjody/ieee1394.git + git.kernel.org:/pub/scm/linux/kernel/git/aegl/linux-2.6.git - infiniband, Roland Dreier - kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git + git.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git - libata, Jeff Garzik - kernel.org:/pub/scm/linux/kernel/git/jgarzik/libata-dev.git + git.kernel.org:/pub/scm/linux/kernel/git/jgarzik/libata-dev.git - ネットワークドライバ, Jeff Garzik - kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git + git.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git - pcmcia, Dominik Brodowski - kernel.org:/pub/scm/linux/kernel/git/brodo/pcmcia-2.6.git + git.kernel.org:/pub/scm/linux/kernel/git/brodo/pcmcia-2.6.git - SCSI, James Bottomley - kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6.git - - その他の git カーネルツリーは http://kernel.org/git に一覧表がありま - す。 + git.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6.git quilt ツリー- - USB, PCI ドライバコアと I2C, Greg Kroah-Hartman kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/ + - x86-64 と i386 の仲間 Andi Kleen + + その他のカーネルツリーは http://git.kernel.org/ と MAINTAINERS ファ + イルに一覧表があります。 バグレポート ------------- @@ -508,6 +509,7 @@ MAINTAINERS ファイルにリストがありますので参照してくださ せん*。単に自分のパッチに対して指摘された問題を全て修正して再送すれば いいのです。 + カーネルコミュニティと企業組織のちがい ----------------------------------------------------------------- @@ -577,6 +579,7 @@ Linux カーネルコミュニティは、一度に大量のコードの塊を かし、500行のパッチは、正しいことをレビューするのに数時間かかるかも しれません(時間はパッチのサイズなどにより指数関数に比例してかかりま す) + 小さいパッチは何かあったときにデバッグもとても簡単になります。パッ チを1個1個取り除くのは、とても大きなパッチを当てた後に(かつ、何かお かしくなった後で)解剖するのに比べればとても簡単です。 @@ -591,6 +594,7 @@ Linux カーネルコミュニティは、一度に大量のコードの塊を う。先生は簡潔な最高の解をみたいのです。良い生徒はこれを知って おり、そして最終解の前の中間作業を提出することは決してないので す" + カーネル開発でもこれは同じです。メンテナー達とレビューア達は、 問題を解決する解の背後になる思考プロセスをみたいとは思いません。 彼らは単純であざやかな解決方法をみたいのです。 -- cgit v1.2.3 From 30b1b28001fef09ea31b1c87e8e8acb962d109e2 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Mon, 23 Jul 2007 21:05:02 -0700 Subject: Fix Doc/sysfs-rules typos Fix typos only (spelling, grammar, duplicate words, etc.). Signed-off-by: Randy Dunlap Cc: Kay Sievers --- Documentation/sysfs-rules.txt | 72 +++++++++++++++++++++---------------------- 1 file changed, 35 insertions(+), 37 deletions(-) (limited to 'Documentation') diff --git a/Documentation/sysfs-rules.txt b/Documentation/sysfs-rules.txt index 42861bb0bc9b..80ef562160bb 100644 --- a/Documentation/sysfs-rules.txt +++ b/Documentation/sysfs-rules.txt @@ -1,19 +1,18 @@ Rules on how to access information in the Linux kernel sysfs -The kernel exported sysfs exports internal kernel implementation-details +The kernel-exported sysfs exports internal kernel implementation details and depends on internal kernel structures and layout. It is agreed upon by the kernel developers that the Linux kernel does not provide a stable internal API. As sysfs is a direct export of kernel internal -structures, the sysfs interface can not provide a stable interface eighter, +structures, the sysfs interface cannot provide a stable interface either; it may always change along with internal kernel changes. To minimize the risk of breaking users of sysfs, which are in most cases low-level userspace applications, with a new kernel release, the users -of sysfs must follow some rules to use an as abstract-as-possible way to +of sysfs must follow some rules to use an as-abstract-as-possible way to access this filesystem. The current udev and HAL programs already implement this and users are encouraged to plug, if possible, into the -abstractions these programs provide instead of accessing sysfs -directly. +abstractions these programs provide instead of accessing sysfs directly. But if you really do want or need to access sysfs directly, please follow the following rules and then your programs should work with future @@ -25,22 +24,22 @@ versions of the sysfs interface. implementation details in its own API. Therefore it is not better than reading directories and opening the files yourself. Also, it is not actively maintained, in the sense of reflecting the - current kernel-development. The goal of providing a stable interface - to sysfs has failed, it causes more problems, than it solves. It + current kernel development. The goal of providing a stable interface + to sysfs has failed; it causes more problems than it solves. It violates many of the rules in this document. - sysfs is always at /sys Parsing /proc/mounts is a waste of time. Other mount points are a system configuration bug you should not try to solve. For test cases, possibly support a SYSFS_PATH environment variable to overwrite the - applications behavior, but never try to search for sysfs. Never try + application's behavior, but never try to search for sysfs. Never try to mount it, if you are not an early boot script. - devices are only "devices" There is no such thing like class-, bus-, physical devices, interfaces, and such that you can rely on in userspace. Everything is just simply a "device". Class-, bus-, physical, ... types are just - kernel implementation details, which should not be expected by + kernel implementation details which should not be expected by applications that look for devices in sysfs. The properties of a device are: @@ -48,11 +47,11 @@ versions of the sysfs interface. - identical to the DEVPATH value in the event sent from the kernel at device creation and removal - the unique key to the device at that point in time - - the kernels path to the device-directory without the leading + - the kernel's path to the device directory without the leading /sys, and always starting with with a slash - all elements of a devpath must be real directories. Symlinks pointing to /sys/devices must always be resolved to their real - target, and the target path must be used to access the device. + target and the target path must be used to access the device. That way the devpath to the device matches the devpath of the kernel used at event time. - using or exposing symlink values as elements in a devpath string @@ -73,17 +72,17 @@ versions of the sysfs interface. link - it is retrieved by reading the "driver"-link and using only the last element of the target path - - devices which do not have "driver"-link, just do not have a - driver; copying the driver value in a child device context, is a + - devices which do not have "driver"-link just do not have a + driver; copying the driver value in a child device context is a bug in the application o attributes - - the files in the device directory or files below a subdirectories + - the files in the device directory or files below subdirectories of the same device directory - accessing attributes reached by a symlink pointing to another device, like the "device"-link, is a bug in the application - Everything else is just a kernel driver-core implementation detail, + Everything else is just a kernel driver-core implementation detail that should not be assumed to be stable across kernel releases. - Properties of parent devices never belong into a child device. @@ -91,25 +90,25 @@ versions of the sysfs interface. context properties. If the device 'eth0' or 'sda' does not have a "driver"-link, then this device does not have a driver. Its value is empty. Never copy any property of the parent-device into a child-device. Parent - device-properties may change dynamically without any notice to the + device properties may change dynamically without any notice to the child device. -- Hierarchy in a single device-tree +- Hierarchy in a single device tree There is only one valid place in sysfs where hierarchy can be examined and this is below: /sys/devices. - It is planned, that all device directories will end up in the tree + It is planned that all device directories will end up in the tree below this directory. - Classification by subsystem There are currently three places for classification of devices: /sys/block, /sys/class and /sys/bus. It is planned that these will - not contain any device-directories themselves, but only flat lists of + not contain any device directories themselves, but only flat lists of symlinks pointing to the unified /sys/devices tree. All three places have completely different rules on how to access device information. It is planned to merge all three - classification-directories into one place at /sys/subsystem, - following the layout of the bus-directories. All buses and - classes, including the converted block-subsystem, will show up + classification directories into one place at /sys/subsystem, + following the layout of the bus directories. All buses and + classes, including the converted block subsystem, will show up there. The devices belonging to a subsystem will create a symlink in the "devices" directory at /sys/subsystem//devices. @@ -121,38 +120,38 @@ versions of the sysfs interface. subsystem name. Assuming /sys/class/ and /sys/bus/, or - /sys/block and /sys/class/block are not interchangeable, is a bug in + /sys/block and /sys/class/block are not interchangeable is a bug in the application. - Block - The converted block-subsystem at /sys/class/block, or + The converted block subsystem at /sys/class/block or /sys/subsystem/block will contain the links for disks and partitions - at the same level, never in a hierarchy. Assuming the block-subsytem to - contain only disks and not partition-devices in the same flat list is + at the same level, never in a hierarchy. Assuming the block subsytem to + contain only disks and not partition devices in the same flat list is a bug in the application. - "device"-link and :-links Never depend on the "device"-link. The "device"-link is a workaround - for the old layout, where class-devices are not created in - /sys/devices/ like the bus-devices. If the link-resolving of a - device-directory does not end in /sys/devices/, you can use the + for the old layout, where class devices are not created in + /sys/devices/ like the bus devices. If the link-resolving of a + device directory does not end in /sys/devices/, you can use the "device"-link to find the parent devices in /sys/devices/. That is the - single valid use of the "device"-link, it must never appear in any + single valid use of the "device"-link; it must never appear in any path as an element. Assuming the existence of the "device"-link for a device in /sys/devices/ is a bug in the application. Accessing /sys/class/net/eth0/device is a bug in the application. Never depend on the class-specific links back to the /sys/class directory. These links are also a workaround for the design mistake - that class-devices are not created in /sys/devices. If a device + that class devices are not created in /sys/devices. If a device directory does not contain directories for child devices, these links may be used to find the child devices in /sys/class. That is the single - valid use of these links, they must never appear in any path as an + valid use of these links; they must never appear in any path as an element. Assuming the existence of these links for devices which are - real child device directories in the /sys/devices tree, is a bug in + real child device directories in the /sys/devices tree is a bug in the application. - It is planned to remove all these links when when all class-device + It is planned to remove all these links when all class device directories live in /sys/devices. - Position of devices along device chain can change. @@ -161,6 +160,5 @@ versions of the sysfs interface. the chain. You must always request the parent device you are looking for by its subsystem value. You need to walk up the chain until you find the device that matches the expected subsystem. Depending on a specific - position of a parent device, or exposing relative paths, using "../" to - access the chain of parents, is a bug in the application. - + position of a parent device or exposing relative paths using "../" to + access the chain of parents is a bug in the application. -- cgit v1.2.3 From a2765e81d8a58f66e21176ca2a8fd6012b187994 Mon Sep 17 00:00:00 2001 From: Juan Lang Date: Tue, 24 Jul 2007 13:24:19 -0700 Subject: stable_api_nonsense.txt: Disambiguate the use of "this" by using "that" to refer to the syscall interface Signed-off-by: Greg Kroah-Hartman --- Documentation/stable_api_nonsense.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/stable_api_nonsense.txt b/Documentation/stable_api_nonsense.txt index a2afca3b2bab..847b342b7b20 100644 --- a/Documentation/stable_api_nonsense.txt +++ b/Documentation/stable_api_nonsense.txt @@ -10,7 +10,7 @@ kernel to userspace interfaces. The kernel to userspace interface is the one that application programs use, the syscall interface. That interface is _very_ stable over time, and will not break. I have old programs that were built on a pre 0.9something kernel that still work -just fine on the latest 2.6 kernel release. This interface is the one +just fine on the latest 2.6 kernel release. That interface is the one that users and application programmers can count on being stable. -- cgit v1.2.3 From f285ea058001ef534f9e53a21aad42c2952bbad5 Mon Sep 17 00:00:00 2001 From: Cornelia Huck Date: Fri, 27 Jul 2007 13:41:10 +0200 Subject: kobject: update documentation Update kobject documentation: - Update structure definitions. - Remove documentation of removed struct subsystem. (First shot, uevent_ops probably need some documentation as well.) Signed-off-by: Cornelia Huck Signed-off-by: Greg Kroah-Hartman --- Documentation/kobject.txt | 178 +++++++++++++++------------------------------- 1 file changed, 59 insertions(+), 119 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kobject.txt b/Documentation/kobject.txt index e44855513b3d..8ee49ee7c963 100644 --- a/Documentation/kobject.txt +++ b/Documentation/kobject.txt @@ -27,7 +27,6 @@ in detail, and briefly here: - kobjects a simple object. - kset a set of objects of a certain type. - ktype a set of helpers for objects of a common type. -- subsystem a controlling object for a number of ksets. The kobject infrastructure maintains a close relationship with the @@ -54,13 +53,15 @@ embedded in larger data structures and replace fields they duplicate. 1.2 Definition struct kobject { + const char * k_name; char name[KOBJ_NAME_LEN]; - atomic_t refcount; + struct kref kref; struct list_head entry; struct kobject * parent; struct kset * kset; struct kobj_type * ktype; - struct dentry * dentry; + struct sysfs_dirent * sd; + wait_queue_head_t poll; }; void kobject_init(struct kobject *); @@ -137,8 +138,7 @@ If a kobject does not have a parent when it is registered, its parent becomes its dominant kset. If a kobject does not have a parent nor a dominant kset, its directory -is created at the top-level of the sysfs partition. This should only -happen for kobjects that are embedded in a struct subsystem. +is created at the top-level of the sysfs partition. @@ -150,10 +150,10 @@ A kset is a set of kobjects that are embedded in the same type. struct kset { - struct subsystem * subsys; struct kobj_type * ktype; struct list_head list; struct kobject kobj; + struct kset_uevent_ops * uevent_ops; }; @@ -169,8 +169,7 @@ struct kobject * kset_find_obj(struct kset *, char *); The type that the kobjects are embedded in is described by the ktype -pointer. The subsystem that the kobject belongs to is pointed to by the -subsys pointer. +pointer. A kset contains a kobject itself, meaning that it may be registered in the kobject hierarchy and exported via sysfs. More importantly, the @@ -209,6 +208,58 @@ the hierarchy. kset_find_obj() may be used to locate a kobject with a particular name. The kobject, if found, is returned. +There are also some helper functions which names point to the formerly +existing "struct subsystem", whose functions have been taken over by +ksets. + + +decl_subsys(name,type,uevent_ops) + +Declares a kset named '_subsys' of type with +uevent_ops . For example, + +decl_subsys(devices, &ktype_device, &device_uevent_ops); + +is equivalent to doing: + +struct kset devices_subsys = { + .kobj = { + .name = "devices", + }, + .ktype = &ktype_devices, + .uevent_ops = &device_uevent_ops, +}; + + +The objects that are registered with a subsystem that use the +subsystem's default list must have their kset ptr set properly. These +objects may have embedded kobjects or ksets. The +following helpers make setting the kset easier: + + +kobj_set_kset_s(obj,subsys) + +- Assumes that obj->kobj exists, and is a struct kobject. +- Sets the kset of that kobject to the kset . + + +kset_set_kset_s(obj,subsys) + +- Assumes that obj->kset exists, and is a struct kset. +- Sets the kset of the embedded kobject to the kset . + +subsys_set_kset(obj,subsys) + +- Assumes obj->subsys exists, and is a struct subsystem. +- Sets obj->subsys.kset.kobj.kset to the subsystem's embedded kset. + +void subsystem_init(struct kset *s); +int subsystem_register(struct kset *s); +void subsystem_unregister(struct kset *s); +struct kset *subsys_get(struct kset *s); +void kset_put(struct kset *s); + +These are just wrappers around the respective kset_* functions. 2.3 sysfs @@ -254,114 +305,3 @@ Instances of struct kobj_type are not registered; only referenced by the kset. A kobj_type may be referenced by an arbitrary number of ksets, as there may be disparate sets of identical objects. - - -4. subsystems - -4.1 Description - -A subsystem represents a significant entity of code that maintains an -arbitrary number of sets of objects of various types. Since the number -of ksets and the type of objects they contain are variable, a -generic representation of a subsystem is minimal. - - -struct subsystem { - struct kset kset; - struct rw_semaphore rwsem; -}; - -int subsystem_register(struct subsystem *); -void subsystem_unregister(struct subsystem *); - -struct subsystem * subsys_get(struct subsystem * s); -void subsys_put(struct subsystem * s); - - -A subsystem contains an embedded kset so: - -- It can be represented in the object hierarchy via the kset's - embedded kobject. - -- It can maintain a default list of objects of one type. - -Additional ksets may attach to the subsystem simply by referencing the -subsystem before they are registered. (This one-way reference means -that there is no way to determine the ksets that are attached to the -subsystem.) - -All ksets that are attached to a subsystem share the subsystem's R/W -semaphore. - - -4.2 subsystem Programming Interface. - -The subsystem programming interface is simple and does not offer the -flexibility that the kset and kobject programming interfaces do. They -may be registered and unregistered, as well as reference counted. Each -call forwards the calls to their embedded ksets (which forward the -calls to their embedded kobjects). - - -4.3 Helpers - -A number of macros are available to make dealing with subsystems and -their embedded objects easier. - - -decl_subsys(name,type) - -Declares a subsystem named '_subsys', with an embedded kset of -type . For example, - -decl_subsys(devices,&ktype_devices); - -is equivalent to doing: - -struct subsystem device_subsys = { - .kset = { - .kobj = { - .name = "devices", - }, - .ktype = &ktype_devices, - } -}; - - -The objects that are registered with a subsystem that use the -subsystem's default list must have their kset ptr set properly. These -objects may have embedded kobjects, ksets, or other subsystems. The -following helpers make setting the kset easier: - - -kobj_set_kset_s(obj,subsys) - -- Assumes that obj->kobj exists, and is a struct kobject. -- Sets the kset of that kobject to the subsystem's embedded kset. - - -kset_set_kset_s(obj,subsys) - -- Assumes that obj->kset exists, and is a struct kset. -- Sets the kset of the embedded kobject to the subsystem's - embedded kset. - -subsys_set_kset(obj,subsys) - -- Assumes obj->subsys exists, and is a struct subsystem. -- Sets obj->subsys.kset.kobj.kset to the subsystem's embedded kset. - - -4.4 sysfs - -subsystems are represented in sysfs via their embedded kobjects. They -follow the same rules as previously mentioned with no exceptions. They -typically receive a top-level directory in sysfs, except when their -embedded kobject is part of another kset, or the parent of the -embedded kobject is explicitly set. - -Note that the subsystem's embedded kset must be 'attached' to the -subsystem itself in order to use its rwsem. This is done after -kset_add() has been called. (Not before, because kset_add() uses its -subsystem for a default parent if it doesn't already have one). - -- cgit v1.2.3 From add77c64ca8b00dae5dc0a6be9eb89f1514d21ea Mon Sep 17 00:00:00 2001 From: Krzysztof Helt Date: Sun, 8 Jul 2007 22:43:00 +0200 Subject: hwmon: add support for THMC50 and ADM1022 This patch adds support for THMC50 and ADM1022 hardware monitoring chips. Signed-off-by: Krzysztof Helt Acked-by: Jean Delvare Signed-off-by: Mark M. Hoffman --- Documentation/hwmon/thmc50 | 74 ++++++++ drivers/hwmon/Kconfig | 10 ++ drivers/hwmon/Makefile | 1 + drivers/hwmon/thmc50.c | 440 +++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 525 insertions(+) create mode 100644 Documentation/hwmon/thmc50 create mode 100644 drivers/hwmon/thmc50.c (limited to 'Documentation') diff --git a/Documentation/hwmon/thmc50 b/Documentation/hwmon/thmc50 new file mode 100644 index 000000000000..9639ca93d559 --- /dev/null +++ b/Documentation/hwmon/thmc50 @@ -0,0 +1,74 @@ +Kernel driver thmc50 +===================== + +Supported chips: + * Analog Devices ADM1022 + Prefix: 'adm1022' + Addresses scanned: I2C 0x2c - 0x2e + Datasheet: http://www.analog.com/en/prod/0,2877,ADM1022,00.html + * Texas Instruments THMC50 + Prefix: 'thmc50' + Addresses scanned: I2C 0x2c - 0x2e + Datasheet: http://focus.ti.com/docs/prod/folders/print/thmc50.html + +Author: Krzysztof Helt + +This driver was derived from the 2.4 kernel thmc50.c source file. + +Credits: + thmc50.c (2.4 kernel): + Frodo Looijaard + Philip Edelbrock + +Module Parameters +----------------- + +* adm1022_temp3: short array + List of adapter,address pairs to force chips into ADM1022 mode with + second remote temperature. This does not work for original THMC50 chips. + +Description +----------- + +The THMC50 implements: an internal temperature sensor, support for an +external diode-type temperature sensor (compatible w/ the diode sensor inside +many processors), and a controllable fan/analog_out DAC. For the temperature +sensors, limits can be set through the appropriate Overtemperature Shutdown +register and Hysteresis register. Each value can be set and read to half-degree +accuracy. An alarm is issued (usually to a connected LM78) when the +temperature gets higher then the Overtemperature Shutdown value; it stays on +until the temperature falls below the Hysteresis value. All temperatures are in +degrees Celsius, and are guaranteed within a range of -55 to +125 degrees. + +The THMC50 only updates its values each 1.5 seconds; reading it more often +will do no harm, but will return 'old' values. + +The THMC50 is usually used in combination with LM78-like chips, to measure +the temperature of the processor(s). + +The ADM1022 works the same as THMC50 but it is faster (5 Hz instead of +1 Hz for THMC50). It can be also put in a new mode to handle additional +remote temperature sensor. The driver use the mode set by BIOS by default. + +In case the BIOS is broken and the mode is set incorrectly, you can force +the mode with additional remote temperature with adm1022_temp3 parameter. +A typical symptom of wrong setting is a fan forced to full speed. + +Driver Features +--------------- + +The driver provides up to three temperatures: + +temp1 -- internal +temp2 -- remote +temp3 -- 2nd remote only for ADM1022 + +pwm1 -- fan speed (0 = stop, 255 = full) +pwm1_mode -- always 0 (DC mode) + +The value of 0 for pwm1 also forces FAN_OFF signal from the chip, +so it stops fans even if the value 0 into the ANALOG_OUT register does not. + +The driver was tested on Compaq AP550 with two ADM1022 chips (one works +in the temp3 mode), five temperature readings and two fans. + diff --git a/drivers/hwmon/Kconfig b/drivers/hwmon/Kconfig index dbdca6f10e46..192953b29b28 100644 --- a/drivers/hwmon/Kconfig +++ b/drivers/hwmon/Kconfig @@ -520,6 +520,16 @@ config SENSORS_SMSC47B397 This driver can also be built as a module. If so, the module will be called smsc47b397. +config SENSORS_THMC50 + tristate "Texas Instruments THMC50 / Analog Devices ADM1022" + depends on I2C && EXPERIMENTAL + help + If you say yes here you get support for Texas Instruments THMC50 + sensor chips and clones: the Analog Devices ADM1022. + + This driver can also be built as a module. If so, the module + will be called thmc50. + config SENSORS_VIA686A tristate "VIA686A" depends on PCI diff --git a/drivers/hwmon/Makefile b/drivers/hwmon/Makefile index 59f81fae40a0..d04f90031ebf 100644 --- a/drivers/hwmon/Makefile +++ b/drivers/hwmon/Makefile @@ -56,6 +56,7 @@ obj-$(CONFIG_SENSORS_SIS5595) += sis5595.o obj-$(CONFIG_SENSORS_SMSC47B397)+= smsc47b397.o obj-$(CONFIG_SENSORS_SMSC47M1) += smsc47m1.o obj-$(CONFIG_SENSORS_SMSC47M192)+= smsc47m192.o +obj-$(CONFIG_SENSORS_THMC50) += thmc50.o obj-$(CONFIG_SENSORS_VIA686A) += via686a.o obj-$(CONFIG_SENSORS_VT1211) += vt1211.o obj-$(CONFIG_SENSORS_VT8231) += vt8231.o diff --git a/drivers/hwmon/thmc50.c b/drivers/hwmon/thmc50.c new file mode 100644 index 000000000000..9395b52d9b99 --- /dev/null +++ b/drivers/hwmon/thmc50.c @@ -0,0 +1,440 @@ +/* + thmc50.c - Part of lm_sensors, Linux kernel modules for hardware + monitoring + Copyright (C) 2007 Krzysztof Helt + Based on 2.4 driver by Frodo Looijaard and + Philip Edelbrock + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. +*/ + +#include +#include +#include +#include +#include +#include +#include +#include + +MODULE_LICENSE("GPL"); + +/* Addresses to scan */ +static unsigned short normal_i2c[] = { 0x2c, 0x2d, 0x2e, I2C_CLIENT_END }; + +/* Insmod parameters */ +I2C_CLIENT_INSMOD_2(thmc50, adm1022); +I2C_CLIENT_MODULE_PARM(adm1022_temp3, "List of adapter,address pairs " + "to enable 3rd temperature (ADM1022 only)"); + +/* Many THMC50 constants specified below */ + +/* The THMC50 registers */ +#define THMC50_REG_CONF 0x40 +#define THMC50_REG_COMPANY_ID 0x3E +#define THMC50_REG_DIE_CODE 0x3F +#define THMC50_REG_ANALOG_OUT 0x19 + +const static u8 THMC50_REG_TEMP[] = { 0x27, 0x26, 0x20 }; +const static u8 THMC50_REG_TEMP_MIN[] = { 0x3A, 0x38, 0x2C }; +const static u8 THMC50_REG_TEMP_MAX[] = { 0x39, 0x37, 0x2B }; + +#define THMC50_REG_CONF_nFANOFF 0x20 + +/* Each client has this additional data */ +struct thmc50_data { + struct i2c_client client; + struct class_device *class_dev; + + struct mutex update_lock; + enum chips type; + unsigned long last_updated; /* In jiffies */ + char has_temp3; /* !=0 if it is ADM1022 in temp3 mode */ + char valid; /* !=0 if following fields are valid */ + + /* Register values */ + s8 temp_input[3]; + s8 temp_max[3]; + s8 temp_min[3]; + u8 analog_out; +}; + +static int thmc50_attach_adapter(struct i2c_adapter *adapter); +static int thmc50_detach_client(struct i2c_client *client); +static void thmc50_init_client(struct i2c_client *client); +static struct thmc50_data *thmc50_update_device(struct device *dev); + +static struct i2c_driver thmc50_driver = { + .driver = { + .name = "thmc50", + }, + .attach_adapter = thmc50_attach_adapter, + .detach_client = thmc50_detach_client, +}; + +static ssize_t show_analog_out(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct thmc50_data *data = thmc50_update_device(dev); + return sprintf(buf, "%d\n", data->analog_out); +} + +static ssize_t set_analog_out(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct i2c_client *client = to_i2c_client(dev); + struct thmc50_data *data = i2c_get_clientdata(client); + int tmp = simple_strtoul(buf, NULL, 10); + int config; + + mutex_lock(&data->update_lock); + data->analog_out = SENSORS_LIMIT(tmp, 0, 255); + i2c_smbus_write_byte_data(client, THMC50_REG_ANALOG_OUT, + data->analog_out); + + config = i2c_smbus_read_byte_data(client, THMC50_REG_CONF); + if (data->analog_out == 0) + config &= ~THMC50_REG_CONF_nFANOFF; + else + config |= THMC50_REG_CONF_nFANOFF; + i2c_smbus_write_byte_data(client, THMC50_REG_CONF, config); + + mutex_unlock(&data->update_lock); + return count; +} + +/* There is only one PWM mode = DC */ +static ssize_t show_pwm_mode(struct device *dev, struct device_attribute *attr, + char *buf) +{ + return sprintf(buf, "0\n"); +} + +/* Temperatures */ +static ssize_t show_temp(struct device *dev, struct device_attribute *attr, + char *buf) +{ + int nr = to_sensor_dev_attr(attr)->index; + struct thmc50_data *data = thmc50_update_device(dev); + return sprintf(buf, "%d\n", data->temp_input[nr] * 1000); +} + +static ssize_t show_temp_min(struct device *dev, struct device_attribute *attr, + char *buf) +{ + int nr = to_sensor_dev_attr(attr)->index; + struct thmc50_data *data = thmc50_update_device(dev); + return sprintf(buf, "%d\n", data->temp_min[nr] * 1000); +} + +static ssize_t set_temp_min(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count) +{ + int nr = to_sensor_dev_attr(attr)->index; + struct i2c_client *client = to_i2c_client(dev); + struct thmc50_data *data = i2c_get_clientdata(client); + int val = simple_strtol(buf, NULL, 10); + + mutex_lock(&data->update_lock); + data->temp_min[nr] = SENSORS_LIMIT(val / 1000, -128, 127); + i2c_smbus_write_byte_data(client, THMC50_REG_TEMP_MIN[nr], + data->temp_min[nr]); + mutex_unlock(&data->update_lock); + return count; +} + +static ssize_t show_temp_max(struct device *dev, struct device_attribute *attr, + char *buf) +{ + int nr = to_sensor_dev_attr(attr)->index; + struct thmc50_data *data = thmc50_update_device(dev); + return sprintf(buf, "%d\n", data->temp_max[nr] * 1000); +} + +static ssize_t set_temp_max(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count) +{ + int nr = to_sensor_dev_attr(attr)->index; + struct i2c_client *client = to_i2c_client(dev); + struct thmc50_data *data = i2c_get_clientdata(client); + int val = simple_strtol(buf, NULL, 10); + + mutex_lock(&data->update_lock); + data->temp_max[nr] = SENSORS_LIMIT(val / 1000, -128, 127); + i2c_smbus_write_byte_data(client, THMC50_REG_TEMP_MAX[nr], + data->temp_max[nr]); + mutex_unlock(&data->update_lock); + return count; +} + +#define temp_reg(offset) \ +static SENSOR_DEVICE_ATTR(temp##offset##_input, S_IRUGO, show_temp, \ + NULL, offset - 1); \ +static SENSOR_DEVICE_ATTR(temp##offset##_min, S_IRUGO | S_IWUSR, \ + show_temp_min, set_temp_min, offset - 1); \ +static SENSOR_DEVICE_ATTR(temp##offset##_max, S_IRUGO | S_IWUSR, \ + show_temp_max, set_temp_max, offset - 1); + +temp_reg(1); +temp_reg(2); +temp_reg(3); + +static SENSOR_DEVICE_ATTR(pwm1, S_IRUGO | S_IWUSR, show_analog_out, + set_analog_out, 0); +static SENSOR_DEVICE_ATTR(pwm1_mode, S_IRUGO, show_pwm_mode, NULL, 0); + +static struct attribute *thmc50_attributes[] = { + &sensor_dev_attr_temp1_max.dev_attr.attr, + &sensor_dev_attr_temp1_min.dev_attr.attr, + &sensor_dev_attr_temp1_input.dev_attr.attr, + &sensor_dev_attr_temp2_max.dev_attr.attr, + &sensor_dev_attr_temp2_min.dev_attr.attr, + &sensor_dev_attr_temp2_input.dev_attr.attr, + &sensor_dev_attr_pwm1.dev_attr.attr, + &sensor_dev_attr_pwm1_mode.dev_attr.attr, + NULL +}; + +static const struct attribute_group thmc50_group = { + .attrs = thmc50_attributes, +}; + +/* for ADM1022 3rd temperature mode */ +static struct attribute *adm1022_attributes[] = { + &sensor_dev_attr_temp3_max.dev_attr.attr, + &sensor_dev_attr_temp3_min.dev_attr.attr, + &sensor_dev_attr_temp3_input.dev_attr.attr, + NULL +}; + +static const struct attribute_group adm1022_group = { + .attrs = adm1022_attributes, +}; + +static int thmc50_detect(struct i2c_adapter *adapter, int address, int kind) +{ + unsigned company; + unsigned revision; + unsigned config; + struct i2c_client *client; + struct thmc50_data *data; + struct device *dev; + int err = 0; + const char *type_name = ""; + + if (!i2c_check_functionality(adapter, I2C_FUNC_SMBUS_BYTE_DATA)) { + pr_debug("thmc50: detect failed, " + "smbus byte data not supported!\n"); + goto exit; + } + + /* OK. For now, we presume we have a valid client. We now create the + client structure, even though we cannot fill it completely yet. + But it allows us to access thmc50 registers. */ + if (!(data = kzalloc(sizeof(struct thmc50_data), GFP_KERNEL))) { + pr_debug("thmc50: detect failed, kzalloc failed!\n"); + err = -ENOMEM; + goto exit; + } + + client = &data->client; + i2c_set_clientdata(client, data); + client->addr = address; + client->adapter = adapter; + client->driver = &thmc50_driver; + dev = &client->dev; + + pr_debug("thmc50: Probing for THMC50 at 0x%2X on bus %d\n", + client->addr, i2c_adapter_id(client->adapter)); + + /* Now, we do the remaining detection. */ + company = i2c_smbus_read_byte_data(client, THMC50_REG_COMPANY_ID); + revision = i2c_smbus_read_byte_data(client, THMC50_REG_DIE_CODE); + config = i2c_smbus_read_byte_data(client, THMC50_REG_CONF); + + if (kind == 0) + kind = thmc50; + else if (kind < 0) { + err = -ENODEV; + if (revision >= 0xc0 && ((config & 0x10) == 0)) { + if (company == 0x49) { + kind = thmc50; + err = 0; + } else if (company == 0x41) { + kind = adm1022; + err = 0; + } + } + } + if (err == -ENODEV) { + pr_debug("thmc50: Detection of THMC50/ADM1022 failed\n"); + goto exit_free; + } + pr_debug("thmc50: Detected %s (version %x, revision %x)\n", + type_name, (revision >> 4) - 0xc, revision & 0xf); + data->type = kind; + + if (kind == thmc50) + type_name = "thmc50"; + else if (kind == adm1022) { + int id = i2c_adapter_id(client->adapter); + int i; + + type_name = "adm1022"; + data->has_temp3 = (config >> 7) & 1; /* config MSB */ + for (i = 0; i + 1 < adm1022_temp3_num; i += 2) + if (adm1022_temp3[i] == id && + adm1022_temp3[i + 1] == address) { + /* enable 2nd remote temp */ + data->has_temp3 = 1; + break; + } + } + + /* Fill in the remaining client fields & put it into the global list */ + strlcpy(client->name, type_name, I2C_NAME_SIZE); + mutex_init(&data->update_lock); + + /* Tell the I2C layer a new client has arrived */ + if ((err = i2c_attach_client(client))) + goto exit_free; + + thmc50_init_client(client); + + /* Register sysfs hooks */ + if ((err = sysfs_create_group(&client->dev.kobj, &thmc50_group))) + goto exit_detach; + + /* Register ADM1022 sysfs hooks */ + if (data->type == adm1022) + if ((err = sysfs_create_group(&client->dev.kobj, + &adm1022_group))) + goto exit_remove_sysfs_thmc50; + + /* Register a new directory entry with module sensors */ + data->class_dev = hwmon_device_register(&client->dev); + if (IS_ERR(data->class_dev)) { + err = PTR_ERR(data->class_dev); + goto exit_remove_sysfs; + } + + return 0; + +exit_remove_sysfs: + if (data->type == adm1022) + sysfs_remove_group(&client->dev.kobj, &adm1022_group); +exit_remove_sysfs_thmc50: + sysfs_remove_group(&client->dev.kobj, &thmc50_group); +exit_detach: + i2c_detach_client(client); +exit_free: + kfree(data); +exit: + return err; +} + +static int thmc50_attach_adapter(struct i2c_adapter *adapter) +{ + if (!(adapter->class & I2C_CLASS_HWMON)) + return 0; + return i2c_probe(adapter, &addr_data, thmc50_detect); +} + +static int thmc50_detach_client(struct i2c_client *client) +{ + struct thmc50_data *data = i2c_get_clientdata(client); + int err; + + hwmon_device_unregister(data->class_dev); + sysfs_remove_group(&client->dev.kobj, &thmc50_group); + if (data->type == adm1022) + sysfs_remove_group(&client->dev.kobj, &adm1022_group); + + if ((err = i2c_detach_client(client))) + return err; + + kfree(data); + + return 0; +} + +static void thmc50_init_client(struct i2c_client *client) +{ + struct thmc50_data *data = i2c_get_clientdata(client); + int config; + + data->analog_out = i2c_smbus_read_byte_data(client, + THMC50_REG_ANALOG_OUT); + /* set up to at least 1 */ + if (data->analog_out == 0) { + data->analog_out = 1; + i2c_smbus_write_byte_data(client, THMC50_REG_ANALOG_OUT, + data->analog_out); + } + config = i2c_smbus_read_byte_data(client, THMC50_REG_CONF); + config |= 0x1; /* start the chip if it is in standby mode */ + if (data->has_temp3) + config |= 0x80; /* enable 2nd remote temp */ + i2c_smbus_write_byte_data(client, THMC50_REG_CONF, config); +} + +static struct thmc50_data *thmc50_update_device(struct device *dev) +{ + struct i2c_client *client = to_i2c_client(dev); + struct thmc50_data *data = i2c_get_clientdata(client); + int timeout = HZ / 5 + (data->type == thmc50 ? HZ : 0); + + mutex_lock(&data->update_lock); + + if (time_after(jiffies, data->last_updated + timeout) + || !data->valid) { + + int temps = data->has_temp3 ? 3 : 2; + int i; + for (i = 0; i < temps; i++) { + data->temp_input[i] = i2c_smbus_read_byte_data(client, + THMC50_REG_TEMP[i]); + data->temp_max[i] = i2c_smbus_read_byte_data(client, + THMC50_REG_TEMP_MAX[i]); + data->temp_min[i] = i2c_smbus_read_byte_data(client, + THMC50_REG_TEMP_MIN[i]); + } + data->analog_out = + i2c_smbus_read_byte_data(client, THMC50_REG_ANALOG_OUT); + data->last_updated = jiffies; + data->valid = 1; + } + + mutex_unlock(&data->update_lock); + + return data; +} + +static int __init sm_thmc50_init(void) +{ + return i2c_add_driver(&thmc50_driver); +} + +static void __exit sm_thmc50_exit(void) +{ + i2c_del_driver(&thmc50_driver); +} + +MODULE_AUTHOR("Krzysztof Helt "); +MODULE_DESCRIPTION("THMC50 driver"); + +module_init(sm_thmc50_init); +module_exit(sm_thmc50_exit); -- cgit v1.2.3 From 517ef0d2a470c69b303c66694b0c45f31ff716cd Mon Sep 17 00:00:00 2001 From: Jean Delvare Date: Fri, 13 Jul 2007 14:29:41 +0200 Subject: hwmon: (adm1031) Fix broken links in documentation The Analog Devices chip information pages moved to a different location. Signed-off-by: Jean Delvare Signed-off-by: Mark M. Hoffman --- Documentation/hwmon/adm1031 | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/hwmon/adm1031 b/Documentation/hwmon/adm1031 index 130a38382b98..be92a77da1d5 100644 --- a/Documentation/hwmon/adm1031 +++ b/Documentation/hwmon/adm1031 @@ -6,13 +6,13 @@ Supported chips: Prefix: 'adm1030' Addresses scanned: I2C 0x2c to 0x2e Datasheet: Publicly available at the Analog Devices website - http://products.analog.com/products/info.asp?product=ADM1030 + http://www.analog.com/en/prod/0%2C2877%2CADM1030%2C00.html * Analog Devices ADM1031 Prefix: 'adm1031' Addresses scanned: I2C 0x2c to 0x2e Datasheet: Publicly available at the Analog Devices website - http://products.analog.com/products/info.asp?product=ADM1031 + http://www.analog.com/en/prod/0%2C2877%2CADM1031%2C00.html Authors: Alexandre d'Alton -- cgit v1.2.3 From 7f8e00f2b9797ce7235634431d65269d21ef80d2 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 31 Jul 2007 00:37:26 -0700 Subject: update dontdiff file Updates based on recent .gitignore updates: *.o.*: Says Alexey Dobriyan: These are presumably temporary gcc files, which aren't interesting. setup.bin, setup.elf: new x86 boot code files (from Matthew Wilcox) Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/dontdiff | 3 +++ 1 file changed, 3 insertions(+) (limited to 'Documentation') diff --git a/Documentation/dontdiff b/Documentation/dontdiff index 595a5ea4c690..7b9551fc6fe3 100644 --- a/Documentation/dontdiff +++ b/Documentation/dontdiff @@ -18,6 +18,7 @@ *.moc *.mod.c *.o +*.o.* *.orig *.out *.pdf @@ -163,6 +164,8 @@ raid6tables.c relocs series setup +setup.bin +setup.elf sim710_d.h* sImage sm_tbl* -- cgit v1.2.3 From c8facbb62111f9333d00870b0d523f5036822d04 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 31 Jul 2007 00:37:40 -0700 Subject: various doc/kernel-parameters fixes - tell what APIC (by request), MTD, & PARIDE mean - correct some source file names - remove IA64 "llsc*=" (seems to have been removed from source tree) - removel SCSI "53c7xx=" (driver already removed) Signed-off-by: Randy Dunlap Acked-by: Jesper Juhl Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/kernel-parameters.txt | 21 ++++++++------------- 1 file changed, 8 insertions(+), 13 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 1156653338fe..00254c018a25 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -58,14 +58,14 @@ parameter is applicable: MDA MDA console support is enabled. MOUSE Appropriate mouse support is enabled. MSI Message Signaled Interrupts (PCI). - MTD MTD support is enabled. + MTD MTD (Memory Technology Device) support is enabled. NET Appropriate network support is enabled. NUMA NUMA support is enabled. GENERIC_TIME The generic timeofday code is enabled. NFS Appropriate NFS support is enabled. OSS OSS sound support is enabled. - PV_OPS A paravirtualized kernel - PARIDE The ParIDE subsystem is enabled. + PV_OPS A paravirtualized kernel is enabled. + PARIDE The ParIDE (parallel port IDE) subsystem is enabled. PARISC The PA-RISC architecture is enabled. PCI PCI bus support is enabled. PCMCIA The PCMCIA subsystem is enabled. @@ -123,10 +123,6 @@ and is between 256 and 4096 characters. It is defined in the file ./include/asm/setup.h as COMMAND_LINE_SIZE. - 53c7xx= [HW,SCSI] Amiga SCSI controllers - See header of drivers/scsi/53c7xx.c. - See also Documentation/scsi/ncr53c7xx.txt. - acpi= [HW,ACPI,X86-64,i386] Advanced Configuration and Power Interface Format: { force | off | ht | strict | noirq } @@ -286,7 +282,8 @@ and is between 256 and 4096 characters. It is defined in the file not play well with APC CPU idle - disable it if you have APC and your system crashes randomly. - apic= [APIC,i386] Change the output verbosity whilst booting + apic= [APIC,i386] Advanced Programmable Interrupt Controller + Change the output verbosity whilst booting Format: { quiet (default) | verbose | debug } Change the amount of debugging information output when initialising the APIC and IO-APIC components. @@ -775,7 +772,8 @@ and is between 256 and 4096 characters. It is defined in the file See Documentation/nfsroot.txt. ip2= [HW] Set IO/IRQ pairs for up to 4 IntelliPort boards - See comment before ip2_setup() in drivers/char/ip2.c. + See comment before ip2_setup() in + drivers/char/ip2/ip2base.c. ips= [HW,SCSI] Adaptec / IBM ServeRAID controller See header of drivers/scsi/ips.c. @@ -871,9 +869,6 @@ and is between 256 and 4096 characters. It is defined in the file if PNPBIOS or ACPI should describe them. This is for working around firmware defects. - llsc*= [IA64] See function print_params() in - arch/ia64/sn/kernel/llsc4.c. - load_ramdisk= [RAM] List of ramdisks to load from floppy See Documentation/ramdisk.txt. @@ -1046,7 +1041,7 @@ and is between 256 and 4096 characters. It is defined in the file ,[,,,,] mtdparts= [MTD] - See drivers/mtd/cmdline.c. + See drivers/mtd/cmdlinepart.c. mtouchusb.raw_coordinates= [HW] Make the MicroTouch USB driver use raw coordinates -- cgit v1.2.3 From b8a367935fc649c071a91c648c4a9c892f72113e Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 31 Jul 2007 00:37:50 -0700 Subject: pnp: fix kernel-doc warnings Fix PNP docbook warnings: Warning(linux-2623-rc1g4//drivers/pnp/core.c): no structured comments found Warning(linux-2623-rc1g4//drivers/pnp/driver.c): no structured comments found Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/DocBook/kernel-api.tmpl | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl index ec7c498b69fc..31bf1eabc0dc 100644 --- a/Documentation/DocBook/kernel-api.tmpl +++ b/Documentation/DocBook/kernel-api.tmpl @@ -398,12 +398,12 @@ X!Edrivers/acpi/pci_bind.c --> Device drivers PnP support -!Edrivers/pnp/core.c +!Idrivers/pnp/core.c !Edrivers/pnp/card.c -!Edrivers/pnp/driver.c +!Idrivers/pnp/driver.c !Edrivers/pnp/manager.c !Edrivers/pnp/support.c -- cgit v1.2.3 From cd4f0ef7c03e79f92a883843662e3d0eaae26fb4 Mon Sep 17 00:00:00 2001 From: Alan Cox Date: Tue, 31 Jul 2007 00:37:59 -0700 Subject: doc/kernel-parameters: use X86-32 tag instead of IA-32 Signed-off-by: Alan Cox Acked-by: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/kernel-parameters.txt | 114 ++++++++++++++++++------------------ 1 file changed, 57 insertions(+), 57 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 00254c018a25..d763ebe11afe 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -41,7 +41,6 @@ parameter is applicable: EIDE EIDE/ATAPI support is enabled. FB The frame buffer device is enabled. HW Appropriate hardware is enabled. - IA-32 IA-32 aka i386 architecture is enabled. IA-64 IA-64 architecture is enabled. IOSCHED More than one I/O scheduler is enabled. IP_PNP IP DHCP, BOOTP, or RARP is enabled. @@ -92,6 +91,7 @@ parameter is applicable: VT Virtual terminal support is enabled. WDT Watchdog support is enabled. XT IBM PC/XT MFM hard disk support is enabled. + X86-32 X86-32, aka i386 architecture is enabled. X86-64 X86-64 architecture is enabled. More X86-64 boot options can be found in Documentation/x86_64/boot-options.txt . @@ -219,7 +219,7 @@ and is between 256 and 4096 characters. It is defined in the file acpi_fake_ecdt [HW,ACPI] Workaround failure due to BIOS lacking ECDT - acpi_pm_good [IA-32,X86-64] + acpi_pm_good [X86-32,X86-64] Override the pmtimer bug detection: force the kernel to assume that this machine's pmtimer latches its value and always returns good values. @@ -357,7 +357,7 @@ and is between 256 and 4096 characters. It is defined in the file c101= [NET] Moxa C101 synchronous serial card - cachesize= [BUGS=IA-32] Override level 2 CPU cache size detection. + cachesize= [BUGS=X86-32] Override level 2 CPU cache size detection. Sometimes CPU hardware bugs make them report the cache size incorrectly. The kernel will attempt work arounds to fix known problems, but for some CPUs it is not @@ -376,7 +376,7 @@ and is between 256 and 4096 characters. It is defined in the file Value can be changed at runtime via /selinux/checkreqprot. - clock= [BUGS=IA-32, HW] gettimeofday clocksource override. + clock= [BUGS=X86-32, HW] gettimeofday clocksource override. [Deprecated] Forces specified clocksource (if available) to be used when calculating gettimeofday(). If specified @@ -394,7 +394,7 @@ and is between 256 and 4096 characters. It is defined in the file [ARM] imx_timer1,OSTS,netx_timer,mpu_timer2, pxa_timer,timer3,32k_counter,timer0_1 [AVR32] avr32 - [IA-32] pit,hpet,tsc,vmi-timer; + [X86-32] pit,hpet,tsc,vmi-timer; scx200_hrt on Geode; cyclone on IBM x440 [MIPS] MIPS [PARISC] cr16 @@ -414,7 +414,7 @@ and is between 256 and 4096 characters. It is defined in the file over the 8254 in addition to over the IO-APIC. The kernel tries to set a sensible default. - hpet= [IA-32,HPET] option to disable HPET and use PIT. + hpet= [X86-32,HPET] option to disable HPET and use PIT. Format: disable com20020= [HW,NET] ARCnet - COM20020 chipset @@ -551,7 +551,7 @@ and is between 256 and 4096 characters. It is defined in the file dtc3181e= [HW,SCSI] - earlyprintk= [IA-32,X86-64,SH] + earlyprintk= [X86-32,X86-64,SH] earlyprintk=vga earlyprintk=serial[,ttySn[,baudrate]] @@ -589,7 +589,7 @@ and is between 256 and 4096 characters. It is defined in the file eisa_irq_edge= [PARISC,HW] See header of drivers/parisc/eisa.c. - elanfreq= [IA-32] + elanfreq= [X86-32] See comment before function elanfreq_setup() in arch/i386/kernel/cpu/cpufreq/elanfreq.c. @@ -598,7 +598,7 @@ and is between 256 and 4096 characters. It is defined in the file See Documentation/block/as-iosched.txt and Documentation/block/deadline-iosched.txt for details. - elfcorehdr= [IA-32, X86_64] + elfcorehdr= [X86-32, X86_64] Specifies physical address of start of kernel core image elf header. Generally kexec loader will pass this option to capture kernel. @@ -680,7 +680,7 @@ and is between 256 and 4096 characters. It is defined in the file hisax= [HW,ISDN] See Documentation/isdn/README.HiSax. - hugepages= [HW,IA-32,IA-64] Maximal number of HugeTLB pages. + hugepages= [HW,X86-32,IA-64] Maximal number of HugeTLB pages. i8042.direct [HW] Put keyboard port into non-translated mode i8042.dumbkbd [HW] Pretend that controller can only read data from @@ -822,7 +822,7 @@ and is between 256 and 4096 characters. It is defined in the file js= [HW,JOY] Analog joystick See Documentation/input/joystick.txt. - kernelcore=nn[KMG] [KNL,IA-32,IA-64,PPC,X86-64] This parameter + kernelcore=nn[KMG] [KNL,X86-32,IA-64,PPC,X86-64] This parameter specifies the amount of memory usable by the kernel for non-movable allocations. The requested amount is spread evenly throughout all nodes in the system. The @@ -838,7 +838,7 @@ and is between 256 and 4096 characters. It is defined in the file use the HighMem zone if it exists, and the Normal zone if it does not. - movablecore=nn[KMG] [KNL,IA-32,IA-64,PPC,X86-64] This parameter + movablecore=nn[KMG] [KNL,X86-32,IA-64,PPC,X86-64] This parameter is similar to kernelcore except it specifies the amount of memory used for migratable allocations. If both kernelcore and movablecore is specified, @@ -850,21 +850,21 @@ and is between 256 and 4096 characters. It is defined in the file keepinitrd [HW,ARM] - kstack=N [IA-32,X86-64] Print N words from the kernel stack + kstack=N [X86-32,X86-64] Print N words from the kernel stack in oops dumps. l2cr= [PPC] - lapic [IA-32,APIC] Enable the local APIC even if BIOS + lapic [X86-32,APIC] Enable the local APIC even if BIOS disabled it. - lapic_timer_c2_ok [IA-32,x86-64,APIC] trust the local apic timer in + lapic_timer_c2_ok [X86-32,x86-64,APIC] trust the local apic timer in C2 power state. lasi= [HW,SCSI] PARISC LASI driver for the 53c700 chip Format: addr:,irq: - legacy_serial.force [HW,IA-32,X86-64] + legacy_serial.force [HW,X86-32,X86-64] Probe for COM ports at legacy addresses even if PNPBIOS or ACPI should describe them. This is for working around firmware defects. @@ -974,11 +974,11 @@ and is between 256 and 4096 characters. It is defined in the file [SCSI] Maximum number of LUNs received. Should be between 1 and 16384. - mca-pentium [BUGS=IA-32] + mca-pentium [BUGS=X86-32] mcatest= [IA-64] - mce [IA-32] Machine Check Exception + mce [X86-32] Machine Check Exception md= [HW] RAID subsystems devices and level See Documentation/md.txt. @@ -990,14 +990,14 @@ and is between 256 and 4096 characters. It is defined in the file mem=nn[KMG] [KNL,BOOT] Force usage of a specific amount of memory Amount of memory to be used when the kernel is not able to see the whole system memory or for test. - [IA-32] Use together with memmap= to avoid physical + [X86-32] Use together with memmap= to avoid physical address space collisions. Without memmap= PCI devices could be placed at addresses belonging to unused RAM. - mem=nopentium [BUGS=IA-32] Disable usage of 4MB pages for kernel + mem=nopentium [BUGS=X86-32] Disable usage of 4MB pages for kernel memory. - memmap=exactmap [KNL,IA-32,X86_64] Enable setting of an exact + memmap=exactmap [KNL,X86-32,X86_64] Enable setting of an exact E820 memory map, as specified by the user. Such memmap=exactmap lines can be constructed based on BIOS output or other requirements. See the memmap=nn@ss @@ -1083,9 +1083,9 @@ and is between 256 and 4096 characters. It is defined in the file [NFS] set the maximum lifetime for idmapper cache entries. - nmi_watchdog= [KNL,BUGS=IA-32] Debugging features for SMP kernels + nmi_watchdog= [KNL,BUGS=X86-32] Debugging features for SMP kernels - no387 [BUGS=IA-32] Tells the kernel to use the 387 maths + no387 [BUGS=X86-32] Tells the kernel to use the 387 maths emulation library even if a 387 maths coprocessor is present. @@ -1116,17 +1116,17 @@ and is between 256 and 4096 characters. It is defined in the file noexec [IA-64] - noexec [IA-32,X86-64] + noexec [X86-32,X86-64] noexec=on: enable non-executable mappings (default) noexec=off: disable nn-executable mappings - nofxsr [BUGS=IA-32] Disables x86 floating point extended + nofxsr [BUGS=X86-32] Disables x86 floating point extended register save and restore. The kernel will only save legacy floating-point registers on task switch. nohlt [BUGS=ARM] - no-hlt [BUGS=IA-32] Tells the kernel that the hlt + no-hlt [BUGS=X86-32] Tells the kernel that the hlt instruction doesn't work correctly and not to use it. @@ -1141,12 +1141,12 @@ and is between 256 and 4096 characters. It is defined in the file Valid arguments: on, off Default: on - noirqbalance [IA-32,SMP,KNL] Disable kernel irq balancing + noirqbalance [X86-32,SMP,KNL] Disable kernel irq balancing - noirqdebug [IA-32] Disables the code which attempts to detect and + noirqdebug [X86-32] Disables the code which attempts to detect and disable unhandled interrupt sources. - no_timer_check [IA-32,X86_64,APIC] Disables the code which tests for + no_timer_check [X86-32,X86_64,APIC] Disables the code which tests for broken timer IRQ sources. noisapnp [ISAPNP] Disables ISA PnP code. @@ -1158,20 +1158,20 @@ and is between 256 and 4096 characters. It is defined in the file nojitter [IA64] Disables jitter checking for ITC timers. - nolapic [IA-32,APIC] Do not enable or use the local APIC. + nolapic [X86-32,APIC] Do not enable or use the local APIC. - nolapic_timer [IA-32,APIC] Do not use the local APIC timer. + nolapic_timer [X86-32,APIC] Do not use the local APIC timer. noltlbs [PPC] Do not use large page/tlb entries for kernel lowmem mapping on PPC40x. nomca [IA-64] Disable machine check abort handling - nomce [IA-32] Machine Check Exception + nomce [X86-32] Machine Check Exception - noreplace-paravirt [IA-32,PV_OPS] Don't patch paravirt_ops + noreplace-paravirt [X86-32,PV_OPS] Don't patch paravirt_ops - noreplace-smp [IA-32,SMP] Don't replace SMP instructions + noreplace-smp [X86-32,SMP] Don't replace SMP instructions with UP alternatives noresidual [PPC] Don't use residual data on PReP machines. @@ -1185,7 +1185,7 @@ and is between 256 and 4096 characters. It is defined in the file nosbagart [IA-64] - nosep [BUGS=IA-32] Disables x86 SYSENTER/SYSEXIT support. + nosep [BUGS=X86-32] Disables x86 SYSENTER/SYSEXIT support. nosmp [SMP] Tells an SMP kernel to act as a UP kernel. @@ -1193,7 +1193,7 @@ and is between 256 and 4096 characters. It is defined in the file nosync [HW,M68K] Disables sync negotiation for all devices. - notsc [BUGS=IA-32] Disable Time Stamp Counter + notsc [BUGS=X86-32] Disable Time Stamp Counter nousb [USB] Disable the USB subsystem @@ -1266,28 +1266,28 @@ and is between 256 and 4096 characters. It is defined in the file See also Documentation/paride.txt. pci=option[,option...] [PCI] various PCI subsystem options: - off [IA-32] don't probe for the PCI bus - bios [IA-32] force use of PCI BIOS, don't access + off [X86-32] don't probe for the PCI bus + bios [X86-32] force use of PCI BIOS, don't access the hardware directly. Use this if your machine has a non-standard PCI host bridge. - nobios [IA-32] disallow use of PCI BIOS, only direct + nobios [X86-32] disallow use of PCI BIOS, only direct hardware access methods are allowed. Use this if you experience crashes upon bootup and you suspect they are caused by the BIOS. - conf1 [IA-32] Force use of PCI Configuration + conf1 [X86-32] Force use of PCI Configuration Mechanism 1. - conf2 [IA-32] Force use of PCI Configuration + conf2 [X86-32] Force use of PCI Configuration Mechanism 2. - nommconf [IA-32,X86_64] Disable use of MMCONFIG for PCI + nommconf [X86-32,X86_64] Disable use of MMCONFIG for PCI Configuration nomsi [MSI] If the PCI_MSI kernel config parameter is enabled, this kernel boot option can be used to disable the use of MSI interrupts system-wide. - nosort [IA-32] Don't sort PCI devices according to + nosort [X86-32] Don't sort PCI devices according to order given by the PCI BIOS. This sorting is done to get a device order compatible with older kernels. - biosirq [IA-32] Use PCI BIOS calls to get the interrupt + biosirq [X86-32] Use PCI BIOS calls to get the interrupt routing table. These calls are known to be buggy on several machines and they hang the machine when used, but on other computers it's the only @@ -1295,32 +1295,32 @@ and is between 256 and 4096 characters. It is defined in the file this option if the kernel is unable to allocate IRQs or discover secondary PCI buses on your motherboard. - rom [IA-32] Assign address space to expansion ROMs. + rom [X86-32] Assign address space to expansion ROMs. Use with caution as certain devices share address decoders between ROMs and other resources. - irqmask=0xMMMM [IA-32] Set a bit mask of IRQs allowed to be + irqmask=0xMMMM [X86-32] Set a bit mask of IRQs allowed to be assigned automatically to PCI devices. You can make the kernel exclude IRQs of your ISA cards this way. - pirqaddr=0xAAAAA [IA-32] Specify the physical address + pirqaddr=0xAAAAA [X86-32] Specify the physical address of the PIRQ table (normally generated by the BIOS) if it is outside the F0000h-100000h range. - lastbus=N [IA-32] Scan all buses thru bus #N. Can be + lastbus=N [X86-32] Scan all buses thru bus #N. Can be useful if the kernel is unable to find your secondary buses and you want to tell it explicitly which ones they are. - assign-busses [IA-32] Always assign all PCI bus + assign-busses [X86-32] Always assign all PCI bus numbers ourselves, overriding whatever the firmware may have done. - usepirqmask [IA-32] Honor the possible IRQ mask stored + usepirqmask [X86-32] Honor the possible IRQ mask stored in the BIOS $PIR table. This is needed on some systems with broken BIOSes, notably some HP Pavilion N5400 and Omnibook XE3 notebooks. This will have no effect if ACPI IRQ routing is enabled. - noacpi [IA-32] Do not use ACPI for IRQ routing + noacpi [X86-32] Do not use ACPI for IRQ routing or for PCI scanning. routeirq Do IRQ routing for all PCI devices. This is normally done in pci_enable_device(), @@ -1469,13 +1469,13 @@ and is between 256 and 4096 characters. It is defined in the file Run specified binary instead of /init from the ramdisk, used for early userspace startup. See initrd. - reboot= [BUGS=IA-32,BUGS=ARM,BUGS=IA-64] Rebooting mode + reboot= [BUGS=X86-32,BUGS=ARM,BUGS=IA-64] Rebooting mode Format: [,[,...]] See arch/*/kernel/reboot.c or arch/*/kernel/process.c reserve= [KNL,BUGS] Force the kernel to ignore some iomem area - reservetop= [IA-32] + reservetop= [X86-32] Format: nn[KMG] Reserves a hole at the top of the kernel virtual address space. @@ -1566,7 +1566,7 @@ and is between 256 and 4096 characters. It is defined in the file Value can be changed at runtime via /selinux/compat_net. - serialnumber [BUGS=IA-32] + serialnumber [BUGS=X86-32] sg_def_reserved_size= [SCSI] @@ -1619,7 +1619,7 @@ and is between 256 and 4096 characters. It is defined in the file smart2= [HW] Format: [,[,...,]] - smp-alt-once [IA-32,SMP] On a hotplug CPU system, only + smp-alt-once [X86-32,SMP] On a hotplug CPU system, only attempt to substitute SMP alternatives once at boot. smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices @@ -1884,7 +1884,7 @@ and is between 256 and 4096 characters. It is defined in the file usbhid.mousepoll= [USBHID] The interval which mice are to be polled at. - vdso= [IA-32,SH,x86-64] + vdso= [X86-32,SH,x86-64] vdso=2: enable compat VDSO (default with COMPAT_VDSO) vdso=1: enable VDSO (default) vdso=0: disable VDSO mapping @@ -1895,7 +1895,7 @@ and is between 256 and 4096 characters. It is defined in the file video= [FB] Frame buffer configuration See Documentation/fb/modedb.txt. - vga= [BOOT,IA-32] Select a particular video mode + vga= [BOOT,X86-32] Select a particular video mode See Documentation/i386/boot.txt and Documentation/svga.txt. Use vga=ask for menu. -- cgit v1.2.3 From 57d4810ea0d9ca58a7bcc1336607f0cede0a2abf Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Tue, 31 Jul 2007 00:38:02 -0700 Subject: revert "x86, serial: convert legacy COM ports to platform devices" MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Revert 7e92b4fc345f5b6f57585fbe5ffdb0f24d7c9b26. It broke Sébastien Dugué's machine and Jeff said (persuasively) This seems like it will break decades-long-working stuff, in favor of breaking new ground in our favorite area, "trusting the BIOS." It's just not worth it for serial ports, IMO. Serial ports are something that just shouldn't break at this late stage in the game. My new Intel platform boxes don't even have serial ports, so I question the value of messing with serial port probing even more... because... just wait a year, and your box won't have a serial port either! :) I certainly don't object to the use of platform devices (or isa_driver), but the probe change seems questionable. That's sorta analagous to rewriting the floppy driver probe routine. Sure you could do it... but why risk all that damage and go through debugging all over again? It seems clear from this report that we cannot, should not, trust BIOS for something (a) so simple and (b) that has been working for over a decade. Much discussion ensued and we've decided to have another go at all of this. Cc: Sébastien Dugué Cc: Bjorn Helgaas Cc: Len Brown Cc: Adam Belay Cc: Matthew Garrett Cc: Russell King Cc: Jeff Garzik Acked-by: Alan Cox Cc: Michal Piotrowski Cc: Sascha Sommer Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/kernel-parameters.txt | 5 --- arch/i386/kernel/Makefile | 1 - arch/i386/kernel/legacy_serial.c | 67 ------------------------------------- arch/x86_64/kernel/Makefile | 2 -- drivers/serial/Kconfig | 14 +++----- include/asm-i386/serial.h | 16 +++++++++ include/asm-x86_64/serial.h | 16 +++++++++ 7 files changed, 37 insertions(+), 84 deletions(-) delete mode 100644 arch/i386/kernel/legacy_serial.c (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index d763ebe11afe..efdb42fd3fb8 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -864,11 +864,6 @@ and is between 256 and 4096 characters. It is defined in the file lasi= [HW,SCSI] PARISC LASI driver for the 53c700 chip Format: addr:,irq: - legacy_serial.force [HW,X86-32,X86-64] - Probe for COM ports at legacy addresses even - if PNPBIOS or ACPI should describe them. This - is for working around firmware defects. - load_ramdisk= [RAM] List of ramdisks to load from floppy See Documentation/ramdisk.txt. diff --git a/arch/i386/kernel/Makefile b/arch/i386/kernel/Makefile index dbe5e87e0d66..9d33b00de659 100644 --- a/arch/i386/kernel/Makefile +++ b/arch/i386/kernel/Makefile @@ -35,7 +35,6 @@ obj-y += sysenter.o vsyscall.o obj-$(CONFIG_ACPI_SRAT) += srat.o obj-$(CONFIG_EFI) += efi.o efi_stub.o obj-$(CONFIG_DOUBLEFAULT) += doublefault.o -obj-$(CONFIG_SERIAL_8250) += legacy_serial.o obj-$(CONFIG_VM86) += vm86.o obj-$(CONFIG_EARLY_PRINTK) += early_printk.o obj-$(CONFIG_HPET_TIMER) += hpet.o diff --git a/arch/i386/kernel/legacy_serial.c b/arch/i386/kernel/legacy_serial.c deleted file mode 100644 index 21510118544e..000000000000 --- a/arch/i386/kernel/legacy_serial.c +++ /dev/null @@ -1,67 +0,0 @@ -/* - * Legacy COM port devices for x86 platforms without PNPBIOS or ACPI. - * Data taken from include/asm-i386/serial.h. - * - * (c) Copyright 2007 Hewlett-Packard Development Company, L.P. - * Bjorn Helgaas - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - */ -#include -#include -#include -#include - -/* Standard COM flags (except for COM4, because of the 8514 problem) */ -#ifdef CONFIG_SERIAL_DETECT_IRQ -#define COM_FLAGS (UPF_BOOT_AUTOCONF | UPF_SKIP_TEST | UPF_AUTO_IRQ) -#define COM4_FLAGS (UPF_BOOT_AUTOCONF | UPF_AUTO_IRQ) -#else -#define COM_FLAGS (UPF_BOOT_AUTOCONF | UPF_SKIP_TEST) -#define COM4_FLAGS UPF_BOOT_AUTOCONF -#endif - -#define PORT(_base,_irq,_flags) \ - { \ - .iobase = _base, \ - .irq = _irq, \ - .uartclk = 1843200, \ - .iotype = UPIO_PORT, \ - .flags = _flags, \ - } - -static struct plat_serial8250_port x86_com_data[] = { - PORT(0x3F8, 4, COM_FLAGS), - PORT(0x2F8, 3, COM_FLAGS), - PORT(0x3E8, 4, COM_FLAGS), - PORT(0x2E8, 3, COM4_FLAGS), - { }, -}; - -static struct platform_device x86_com_device = { - .name = "serial8250", - .id = PLAT8250_DEV_PLATFORM, - .dev = { - .platform_data = x86_com_data, - }, -}; - -static int force_legacy_probe; -module_param_named(force, force_legacy_probe, bool, 0); -MODULE_PARM_DESC(force, "Force legacy serial port probe"); - -static int __init serial8250_x86_com_init(void) -{ - if (pnp_platform_devices && !force_legacy_probe) - return -ENODEV; - - return platform_device_register(&x86_com_device); -} - -module_init(serial8250_x86_com_init); - -MODULE_AUTHOR("Bjorn Helgaas"); -MODULE_LICENSE("GPL"); -MODULE_DESCRIPTION("Generic 8250/16x50 legacy probe module"); diff --git a/arch/x86_64/kernel/Makefile b/arch/x86_64/kernel/Makefile index d1d18c1ea0f4..ff5d8c9b96d9 100644 --- a/arch/x86_64/kernel/Makefile +++ b/arch/x86_64/kernel/Makefile @@ -32,7 +32,6 @@ obj-$(CONFIG_EARLY_PRINTK) += early_printk.o obj-$(CONFIG_IOMMU) += pci-gart.o aperture.o obj-$(CONFIG_CALGARY_IOMMU) += pci-calgary.o tce.o obj-$(CONFIG_SWIOTLB) += pci-swiotlb.o -obj-$(CONFIG_SERIAL_8250) += legacy_serial.o obj-$(CONFIG_KPROBES) += kprobes.o obj-$(CONFIG_X86_PM_TIMER) += pmtimer.o obj-$(CONFIG_X86_VSMP) += vsmp.o @@ -51,7 +50,6 @@ CFLAGS_vsyscall.o := $(PROFILING) -g0 therm_throt-y += ../../i386/kernel/cpu/mcheck/therm_throt.o bootflag-y += ../../i386/kernel/bootflag.o -legacy_serial-y += ../../i386/kernel/legacy_serial.o cpuid-$(subst m,y,$(CONFIG_X86_CPUID)) += ../../i386/kernel/cpuid.o topology-y += ../../i386/kernel/topology.o microcode-$(subst m,y,$(CONFIG_MICROCODE)) += ../../i386/kernel/microcode.o diff --git a/drivers/serial/Kconfig b/drivers/serial/Kconfig index 18f629706448..819fc3efc468 100644 --- a/drivers/serial/Kconfig +++ b/drivers/serial/Kconfig @@ -88,21 +88,17 @@ config SERIAL_8250_PCI depends on SERIAL_8250 && PCI default SERIAL_8250 help - Say Y here if you have PCI serial ports. - - To compile this driver as a module, choose M here: the module - will be called 8250_pci. + This builds standard PCI serial support. You may be able to + disable this feature if you only need legacy serial support. + Saves about 9K. config SERIAL_8250_PNP tristate "8250/16550 PNP device support" if EMBEDDED depends on SERIAL_8250 && PNP default SERIAL_8250 help - Say Y here if you have serial ports described by PNPBIOS or ACPI. - These are typically ports built into the system board. - - To compile this driver as a module, choose M here: the module - will be called 8250_pnp. + This builds standard PNP serial support. You may be able to + disable this feature if you only need legacy serial support. config SERIAL_8250_HP300 tristate diff --git a/include/asm-i386/serial.h b/include/asm-i386/serial.h index 57a4306cdf63..bd67480ca109 100644 --- a/include/asm-i386/serial.h +++ b/include/asm-i386/serial.h @@ -11,3 +11,19 @@ * megabits/second; but this requires the faster clock. */ #define BASE_BAUD ( 1843200 / 16 ) + +/* Standard COM flags (except for COM4, because of the 8514 problem) */ +#ifdef CONFIG_SERIAL_DETECT_IRQ +#define STD_COM_FLAGS (ASYNC_BOOT_AUTOCONF | ASYNC_SKIP_TEST | ASYNC_AUTO_IRQ) +#define STD_COM4_FLAGS (ASYNC_BOOT_AUTOCONF | ASYNC_AUTO_IRQ) +#else +#define STD_COM_FLAGS (ASYNC_BOOT_AUTOCONF | ASYNC_SKIP_TEST) +#define STD_COM4_FLAGS ASYNC_BOOT_AUTOCONF +#endif + +#define SERIAL_PORT_DFNS \ + /* UART CLK PORT IRQ FLAGS */ \ + { 0, BASE_BAUD, 0x3F8, 4, STD_COM_FLAGS }, /* ttyS0 */ \ + { 0, BASE_BAUD, 0x2F8, 3, STD_COM_FLAGS }, /* ttyS1 */ \ + { 0, BASE_BAUD, 0x3E8, 4, STD_COM_FLAGS }, /* ttyS2 */ \ + { 0, BASE_BAUD, 0x2E8, 3, STD_COM4_FLAGS }, /* ttyS3 */ diff --git a/include/asm-x86_64/serial.h b/include/asm-x86_64/serial.h index 8ebd765c674a..b0496e0d72a6 100644 --- a/include/asm-x86_64/serial.h +++ b/include/asm-x86_64/serial.h @@ -11,3 +11,19 @@ * megabits/second; but this requires the faster clock. */ #define BASE_BAUD ( 1843200 / 16 ) + +/* Standard COM flags (except for COM4, because of the 8514 problem) */ +#ifdef CONFIG_SERIAL_DETECT_IRQ +#define STD_COM_FLAGS (ASYNC_BOOT_AUTOCONF | ASYNC_SKIP_TEST | ASYNC_AUTO_IRQ) +#define STD_COM4_FLAGS (ASYNC_BOOT_AUTOCONF | ASYNC_AUTO_IRQ) +#else +#define STD_COM_FLAGS (ASYNC_BOOT_AUTOCONF | ASYNC_SKIP_TEST) +#define STD_COM4_FLAGS ASYNC_BOOT_AUTOCONF +#endif + +#define SERIAL_PORT_DFNS \ + /* UART CLK PORT IRQ FLAGS */ \ + { 0, BASE_BAUD, 0x3F8, 4, STD_COM_FLAGS }, /* ttyS0 */ \ + { 0, BASE_BAUD, 0x2F8, 3, STD_COM_FLAGS }, /* ttyS1 */ \ + { 0, BASE_BAUD, 0x3E8, 4, STD_COM_FLAGS }, /* ttyS2 */ \ + { 0, BASE_BAUD, 0x2E8, 3, STD_COM4_FLAGS }, /* ttyS3 */ -- cgit v1.2.3 From 60fd4d6a1953accd3d57f8e4f3b0f4692598bf4e Mon Sep 17 00:00:00 2001 From: Wyatt Banks Date: Tue, 31 Jul 2007 00:38:10 -0700 Subject: Documentation: document HFSPlus Documentation: document HFSPlus filesystem and its mount options. Signed-off-by: Wyatt Banks Cc: "Randy.Dunlap" Cc: Roman Zippel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/hfsplus.txt | 59 +++++++++++++++++++++++++++++++++++ 1 file changed, 59 insertions(+) create mode 100644 Documentation/filesystems/hfsplus.txt (limited to 'Documentation') diff --git a/Documentation/filesystems/hfsplus.txt b/Documentation/filesystems/hfsplus.txt new file mode 100644 index 000000000000..af1628a1061c --- /dev/null +++ b/Documentation/filesystems/hfsplus.txt @@ -0,0 +1,59 @@ + +Macintosh HFSPlus Filesystem for Linux +====================================== + +HFSPlus is a filesystem first introduced in MacOS 8.1. +HFSPlus has several extensions to HFS, including 32-bit allocation +blocks, 255-character unicode filenames, and file sizes of 2^63 bytes. + + +Mount options +============= + +When mounting an HFSPlus filesystem, the following options are accepted: + + creator=cccc, type=cccc + Specifies the creator/type values as shown by the MacOS finder + used for creating new files. Default values: '????'. + + uid=n, gid=n + Specifies the user/group that owns all files on the filesystem + that have uninitialized permissions structures. + Default: user/group id of the mounting process. + + umask=n + Specifies the umask (in octal) used for files and directories + that have uninitialized permissions structures. + Default: umask of the mounting process. + + session=n + Select the CDROM session to mount as HFSPlus filesystem. Defaults to + leaving that decision to the CDROM driver. This option will fail + with anything but a CDROM as underlying devices. + + part=n + Select partition number n from the devices. This option only makes + sense for CDROMs because they can't be partitioned under Linux. + For disk devices the generic partition parsing code does this + for us. Defaults to not parsing the partition table at all. + + decompose + Decompose file name characters. + + nodecompose + Do not decompose file name characters. + + force + Used to force write access to volumes that are marked as journalled + or locked. Use at your own risk. + + nls=cccc + Encoding to use when presenting file names. + + +References +========== + +kernel source: + +Apple Technote 1150 http://developer.apple.com/technotes/tn/tn1150.html -- cgit v1.2.3 From a12e2c6cde6392287b9cd3b4bd8d843fd1458087 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 31 Jul 2007 00:38:17 -0700 Subject: Doc: DMA-API update Fix typos and update function parameters. Signed-off-by: Randy Dunlap Acked-by: Muli Ben-Yehuda Cc: James Bottomley Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/DMA-API.txt | 79 +++++++++++++++++++++++------------------------ 1 file changed, 38 insertions(+), 41 deletions(-) (limited to 'Documentation') diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt index 805db4b2cba6..cc7a8c39fb6f 100644 --- a/Documentation/DMA-API.txt +++ b/Documentation/DMA-API.txt @@ -26,7 +26,7 @@ Part Ia - Using large dma-coherent buffers void * dma_alloc_coherent(struct device *dev, size_t size, - dma_addr_t *dma_handle, int flag) + dma_addr_t *dma_handle, gfp_t flag) void * pci_alloc_consistent(struct pci_dev *dev, size_t size, dma_addr_t *dma_handle) @@ -38,7 +38,7 @@ to make sure to flush the processor's write buffers before telling devices to read that memory.) This routine allocates a region of bytes of consistent memory. -it also returns a which may be cast to an unsigned +It also returns a which may be cast to an unsigned integer the same width as the bus and used as the physical address base of the region. @@ -52,21 +52,21 @@ The simplest way to do that is to use the dma_pool calls (see below). The flag parameter (dma_alloc_coherent only) allows the caller to specify the GFP_ flags (see kmalloc) for the allocation (the -implementation may chose to ignore flags that affect the location of +implementation may choose to ignore flags that affect the location of the returned memory, like GFP_DMA). For pci_alloc_consistent, you must assume GFP_ATOMIC behaviour. void -dma_free_coherent(struct device *dev, size_t size, void *cpu_addr +dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, dma_addr_t dma_handle) void -pci_free_consistent(struct pci_dev *dev, size_t size, void *cpu_addr +pci_free_consistent(struct pci_dev *dev, size_t size, void *cpu_addr, dma_addr_t dma_handle) Free the region of consistent memory you previously allocated. dev, size and dma_handle must all be the same as those passed into the consistent allocate. cpu_addr must be the virtual address returned by -the consistent allocate +the consistent allocate. Part Ib - Using small dma-coherent buffers @@ -77,9 +77,9 @@ To get this part of the dma_ API, you must #include Many drivers need lots of small dma-coherent memory regions for DMA descriptors or I/O buffers. Rather than allocating in units of a page or more using dma_alloc_coherent(), you can use DMA pools. These work -much like a struct kmem_cache, except that they use the dma-coherent allocator +much like a struct kmem_cache, except that they use the dma-coherent allocator, not __get_free_pages(). Also, they understand common hardware constraints -for alignment, like queue heads needing to be aligned on N byte boundaries. +for alignment, like queue heads needing to be aligned on N-byte boundaries. struct dma_pool * @@ -102,15 +102,15 @@ crossing restrictions, pass 0 for alloc; passing 4096 says memory allocated from this pool must not cross 4KByte boundaries. - void *dma_pool_alloc(struct dma_pool *pool, int gfp_flags, + void *dma_pool_alloc(struct dma_pool *pool, gfp_t gfp_flags, dma_addr_t *dma_handle); - void *pci_pool_alloc(struct pci_pool *pool, int gfp_flags, + void *pci_pool_alloc(struct pci_pool *pool, gfp_t gfp_flags, dma_addr_t *dma_handle); This allocates memory from the pool; the returned memory will meet the size and alignment requirements specified at creation time. Pass GFP_ATOMIC to -prevent blocking, or if it's permitted (not in_interrupt, not holding SMP locks) +prevent blocking, or if it's permitted (not in_interrupt, not holding SMP locks), pass GFP_KERNEL to allow blocking. Like dma_alloc_coherent(), this returns two values: an address usable by the cpu, and the dma address usable by the pool's device. @@ -123,7 +123,7 @@ pool's device. dma_addr_t addr); This puts memory back into the pool. The pool is what was passed to -the pool allocation routine; the cpu and dma addresses are what +the pool allocation routine; the cpu (vaddr) and dma addresses are what were returned when that routine allocated the memory being freed. @@ -209,18 +209,18 @@ Notes: Not all memory regions in a machine can be mapped by this API. Further, regions that appear to be physically contiguous in kernel virtual space may not be contiguous as physical memory. Since this API does not provide any scatter/gather capability, it will fail -if the user tries to map a non physically contiguous piece of memory. +if the user tries to map a non-physically contiguous piece of memory. For this reason, it is recommended that memory mapped by this API be -obtained only from sources which guarantee to be physically contiguous +obtained only from sources which guarantee it to be physically contiguous (like kmalloc). Further, the physical address of the memory must be within the dma_mask of the device (the dma_mask represents a bit mask of the -addressable region for the device. i.e. if the physical address of +addressable region for the device. I.e., if the physical address of the memory anded with the dma_mask is still equal to the physical address, then the device can perform DMA to the memory). In order to ensure that the memory allocated by kmalloc is within the dma_mask, -the driver may specify various platform dependent flags to restrict +the driver may specify various platform-dependent flags to restrict the physical memory range of the allocation (e.g. on x86, GFP_DMA guarantees to be within the first 16Mb of available physical memory, as required by ISA devices). @@ -244,14 +244,14 @@ are guaranteed also to be cache line boundaries). DMA_TO_DEVICE synchronisation must be done after the last modification of the memory region by the software and before it is handed off to -the driver. Once this primitive is used. Memory covered by this -primitive should be treated as read only by the device. If the device +the driver. Once this primitive is used, memory covered by this +primitive should be treated as read-only by the device. If the device may write to it at any point, it should be DMA_BIDIRECTIONAL (see below). DMA_FROM_DEVICE synchronisation must be done before the driver accesses data that may be changed by the device. This memory should -be treated as read only by the driver. If the driver needs to write +be treated as read-only by the driver. If the driver needs to write to it at any point, it should be DMA_BIDIRECTIONAL (see below). DMA_BIDIRECTIONAL requires special handling: it means that the driver @@ -261,7 +261,7 @@ you must always sync bidirectional memory twice: once before the memory is handed off to the device (to make sure all memory changes are flushed from the processor) and once before the data may be accessed after being used by the device (to make sure any processor -cache lines are updated with data that the device may have changed. +cache lines are updated with data that the device may have changed). void dma_unmap_single(struct device *dev, dma_addr_t dma_addr, size_t size, @@ -302,8 +302,8 @@ pci_dma_mapping_error(dma_addr_t dma_addr) In some circumstances dma_map_single and dma_map_page will fail to create a mapping. A driver can check for these errors by testing the returned -dma address with dma_mapping_error(). A non zero return value means the mapping -could not be created and the driver should take appropriate action (eg +dma address with dma_mapping_error(). A non-zero return value means the mapping +could not be created and the driver should take appropriate action (e.g. reduce current DMA mapping usage or delay and try again later). int @@ -315,7 +315,7 @@ reduce current DMA mapping usage or delay and try again later). Maps a scatter gather list from the block layer. -Returns: the number of physical segments mapped (this may be shorted +Returns: the number of physical segments mapped (this may be shorter than passed in if the block layer determines that some elements of the scatter/gather list are physically adjacent and thus may be mapped with a single entry). @@ -357,7 +357,7 @@ accessed sg->address and sg->length as shown above. pci_unmap_sg(struct pci_dev *hwdev, struct scatterlist *sg, int nents, int direction) -unmap the previously mapped scatter/gather list. All the parameters +Unmap the previously mapped scatter/gather list. All the parameters must be the same as those and passed in to the scatter/gather mapping API. @@ -377,7 +377,7 @@ void pci_dma_sync_sg(struct pci_dev *hwdev, struct scatterlist *sg, int nelems, int direction) -synchronise a single contiguous or scatter/gather mapping. All the +Synchronise a single contiguous or scatter/gather mapping. All the parameters must be the same as those passed into the single mapping API. @@ -406,7 +406,7 @@ API at all. void * dma_alloc_noncoherent(struct device *dev, size_t size, - dma_addr_t *dma_handle, int flag) + dma_addr_t *dma_handle, gfp_t flag) Identical to dma_alloc_coherent() except that the platform will choose to return either consistent or non-consistent memory as it sees @@ -426,34 +426,34 @@ void dma_free_noncoherent(struct device *dev, size_t size, void *cpu_addr, dma_addr_t dma_handle) -free memory allocated by the nonconsistent API. All parameters must +Free memory allocated by the nonconsistent API. All parameters must be identical to those passed in (and returned by dma_alloc_noncoherent()). int dma_is_consistent(struct device *dev, dma_addr_t dma_handle) -returns true if the device dev is performing consistent DMA on the memory +Returns true if the device dev is performing consistent DMA on the memory area pointed to by the dma_handle. int dma_get_cache_alignment(void) -returns the processor cache alignment. This is the absolute minimum +Returns the processor cache alignment. This is the absolute minimum alignment *and* width that you must observe when either mapping memory or doing partial flushes. Notes: This API may return a number *larger* than the actual cache line, but it will guarantee that one or more cache lines fit exactly into the width returned by this call. It will also always be a power -of two for easy alignment +of two for easy alignment. void dma_sync_single_range(struct device *dev, dma_addr_t dma_handle, unsigned long offset, size_t size, enum dma_data_direction direction) -does a partial sync. starting at offset and continuing for size. You +Does a partial sync, starting at offset and continuing for size. You must be careful to observe the cache alignment and width when doing anything like this. You must also be extra careful about accessing memory you intend to sync partially. @@ -472,21 +472,20 @@ dma_declare_coherent_memory(struct device *dev, dma_addr_t bus_addr, dma_addr_t device_addr, size_t size, int flags) - Declare region of memory to be handed out by dma_alloc_coherent when it's asked for coherent memory for this device. bus_addr is the physical address to which the memory is currently assigned in the bus responding region (this will be used by the -platform to perform the mapping) +platform to perform the mapping). device_addr is the physical address the device needs to be programmed with actually to address this memory (this will be handed out as the -dma_addr_t in dma_alloc_coherent()) +dma_addr_t in dma_alloc_coherent()). size is the size of the area (must be multiples of PAGE_SIZE). -flags can be or'd together and are +flags can be or'd together and are: DMA_MEMORY_MAP - request that the memory returned from dma_alloc_coherent() be directly writable. @@ -494,7 +493,7 @@ dma_alloc_coherent() be directly writable. DMA_MEMORY_IO - request that the memory returned from dma_alloc_coherent() be addressable using read/write/memcpy_toio etc. -One or both of these flags must be present +One or both of these flags must be present. DMA_MEMORY_INCLUDES_CHILDREN - make the declared memory be allocated by dma_alloc_coherent of any child devices of this one (for memory residing @@ -528,7 +527,7 @@ dma_release_declared_memory(struct device *dev) Remove the memory region previously declared from the system. This API performs *no* in-use checking for this region and will return unconditionally having removed all the required structures. It is the -drivers job to ensure that no parts of this memory region are +driver's job to ensure that no parts of this memory region are currently in use. void * @@ -538,12 +537,10 @@ dma_mark_declared_memory_occupied(struct device *dev, This is used to occupy specific regions of the declared space (dma_alloc_coherent() will hand out the first free region it finds). -device_addr is the *device* address of the region requested +device_addr is the *device* address of the region requested. -size is the size (and should be a page sized multiple). +size is the size (and should be a page-sized multiple). The return value will be either a pointer to the processor virtual address of the memory, or an error (via PTR_ERR()) if any part of the region is occupied. - - -- cgit v1.2.3 From 7eacbbd32a98ab5b607f7773bb2692cc195db9b2 Mon Sep 17 00:00:00 2001 From: Satyam Sharma Date: Tue, 31 Jul 2007 00:38:17 -0700 Subject: Fix a typo in Documentation/keys.txt Signed-off-by: Satyam Sharma Acked-by: David Howells Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/keys.txt | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/keys.txt b/Documentation/keys.txt index 81d9aa097298..947d57d53453 100644 --- a/Documentation/keys.txt +++ b/Documentation/keys.txt @@ -859,9 +859,8 @@ payload contents" for more information. void unregister_key_type(struct key_type *type); -Under some circumstances, it may be desirable to desirable to deal with a -bundle of keys. The facility provides access to the keyring type for managing -such a bundle: +Under some circumstances, it may be desirable to deal with a bundle of keys. +The facility provides access to the keyring type for managing such a bundle: struct key_type key_type_keyring; -- cgit v1.2.3 From 22b238bdb93ed2fcb1d627ce81d8a2fcbe24de85 Mon Sep 17 00:00:00 2001 From: Anton Vorontsov Date: Tue, 31 Jul 2007 00:38:44 -0700 Subject: spidev_test utility This is a simple utility used to test SPI functionality. It could stand growing options to support using other test data patterns; this initial version only issues full duplex transfers, which rules out 3WIRE or Microwire links. Signed-off-by: Anton Vorontsov Signed-off-by: David Brownell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/spi/spidev_test.c | 202 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 202 insertions(+) create mode 100644 Documentation/spi/spidev_test.c (limited to 'Documentation') diff --git a/Documentation/spi/spidev_test.c b/Documentation/spi/spidev_test.c new file mode 100644 index 000000000000..218e86215297 --- /dev/null +++ b/Documentation/spi/spidev_test.c @@ -0,0 +1,202 @@ +/* + * SPI testing utility (using spidev driver) + * + * Copyright (c) 2007 MontaVista Software, Inc. + * Copyright (c) 2007 Anton Vorontsov + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License. + * + * Cross-compile with cross-gcc -I/path/to/cross-kernel/include + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define ARRAY_SIZE(a) (sizeof(a) / sizeof((a)[0])) + +static void pabort(const char *s) +{ + perror(s); + abort(); +} + +static char *device = "/dev/spidev1.1"; +static uint8_t mode; +static uint8_t bits = 8; +static uint32_t speed = 500000; +static uint16_t delay; + +static void transfer(int fd) +{ + int ret; + uint8_t tx[] = { + 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, + 0x40, 0x00, 0x00, 0x00, 0x00, 0x95, + 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, + 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, + 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, + 0xDE, 0xAD, 0xBE, 0xEF, 0xBA, 0xAD, + 0xF0, 0x0D, + }; + uint8_t rx[ARRAY_SIZE(tx)] = {0, }; + struct spi_ioc_transfer tr = { + .tx_buf = (unsigned long)tx, + .rx_buf = (unsigned long)rx, + .len = ARRAY_SIZE(tx), + .delay_usecs = delay, + .speed_hz = speed, + .bits_per_word = bits, + }; + + ret = ioctl(fd, SPI_IOC_MESSAGE(1), &tr); + if (ret == 1) + pabort("can't send spi message"); + + for (ret = 0; ret < ARRAY_SIZE(tx); ret++) { + if (!(ret % 6)) + puts(""); + printf("%.2X ", rx[ret]); + } + puts(""); +} + +void print_usage(char *prog) +{ + printf("Usage: %s [-DsbdlHOLC3]\n", prog); + puts(" -D --device device to use (default /dev/spidev1.1)\n" + " -s --speed max speed (Hz)\n" + " -d --delay delay (usec)\n" + " -b --bpw bits per word \n" + " -l --loop loopback\n" + " -H --cpha clock phase\n" + " -O --cpol clock polarity\n" + " -L --lsb least significant bit first\n" + " -C --cs-high chip select active high\n" + " -3 --3wire SI/SO signals shared\n"); + exit(1); +} + +void parse_opts(int argc, char *argv[]) +{ + while (1) { + static struct option lopts[] = { + { "device", 1, 0, 'D' }, + { "speed", 1, 0, 's' }, + { "delay", 1, 0, 'd' }, + { "bpw", 1, 0, 'b' }, + { "loop", 0, 0, 'l' }, + { "cpha", 0, 0, 'H' }, + { "cpol", 0, 0, 'O' }, + { "lsb", 0, 0, 'L' }, + { "cs-high", 0, 0, 'C' }, + { "3wire", 0, 0, '3' }, + { NULL, 0, 0, 0 }, + }; + int c; + + c = getopt_long(argc, argv, "D:s:d:b:lHOLC3", lopts, NULL); + + if (c == -1) + break; + + switch (c) { + case 'D': + device = optarg; + break; + case 's': + speed = atoi(optarg); + break; + case 'd': + delay = atoi(optarg); + break; + case 'b': + bits = atoi(optarg); + break; + case 'l': + mode |= SPI_LOOP; + break; + case 'H': + mode |= SPI_CPHA; + break; + case 'O': + mode |= SPI_CPOL; + break; + case 'L': + mode |= SPI_LSB_FIRST; + break; + case 'C': + mode |= SPI_CS_HIGH; + break; + case '3': + mode |= SPI_3WIRE; + break; + default: + print_usage(argv[0]); + break; + } + } +} + +int main(int argc, char *argv[]) +{ + int ret = 0; + int fd; + + parse_opts(argc, argv); + + fd = open(device, O_RDWR); + if (fd < 0) + pabort("can't open device"); + + /* + * spi mode + */ + ret = ioctl(fd, SPI_IOC_WR_MODE, &mode); + if (ret == -1) + pabort("can't set spi mode"); + + ret = ioctl(fd, SPI_IOC_RD_MODE, &mode); + if (ret == -1) + pabort("can't get spi mode"); + + /* + * bits per word + */ + ret = ioctl(fd, SPI_IOC_WR_BITS_PER_WORD, &bits); + if (ret == -1) + pabort("can't set bits per word"); + + ret = ioctl(fd, SPI_IOC_RD_BITS_PER_WORD, &bits); + if (ret == -1) + pabort("can't get bits per word"); + + /* + * max speed hz + */ + ret = ioctl(fd, SPI_IOC_WR_MAX_SPEED_HZ, &speed); + if (ret == -1) + pabort("can't set max speed hz"); + + ret = ioctl(fd, SPI_IOC_RD_MAX_SPEED_HZ, &speed); + if (ret == -1) + pabort("can't get max speed hz"); + + printf("spi mode: %d\n", mode); + printf("bits per word: %d\n", bits); + printf("max speed: %d Hz (%d KHz)\n", speed, speed/1000); + + transfer(fd); + + close(fd); + + return ret; +} -- cgit v1.2.3 From 73c21e8024296760c450a0bded131cb573f83328 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 31 Jul 2007 00:39:04 -0700 Subject: docbook bad file references Fix docbook warnings: Warning(linux-2.6.22-git12//drivers/base/power/main.c): no structured comments found Warning(linux-2.6.22-git12//include/linux/splice.h): no structured comments found Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/DocBook/kernel-api.tmpl | 2 -- 1 file changed, 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl index 31bf1eabc0dc..b886f52a9aac 100644 --- a/Documentation/DocBook/kernel-api.tmpl +++ b/Documentation/DocBook/kernel-api.tmpl @@ -380,7 +380,6 @@ X!Edrivers/base/interface.c !Edrivers/base/bus.c Device Drivers Power Management -!Edrivers/base/power/main.c !Edrivers/base/power/resume.c !Edrivers/base/power/suspend.c @@ -709,7 +708,6 @@ X!Idrivers/video/console/fonts.c kernel, without continually transferring them between the kernel and user space. -!Iinclude/linux/splice.h !Ffs/splice.c -- cgit v1.2.3 From 07ff8ee79e254ac07205d2c9c9e8fbe2f4b3e6e8 Mon Sep 17 00:00:00 2001 From: Dave Young Date: Tue, 31 Jul 2007 00:39:11 -0700 Subject: hpet.txt: broken link fix The specification link in hpet document is broken. Signed-off-by: Dave Young Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/hpet.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/hpet.txt b/Documentation/hpet.txt index b7a3dc38dd52..6ad52d9dad6c 100644 --- a/Documentation/hpet.txt +++ b/Documentation/hpet.txt @@ -5,7 +5,7 @@ for the 8254 and Real Time Clock (RTC) periodic timer functionality. Each HPET can have up to 32 timers. It is possible to configure the first two timers as legacy replacements for 8254 and RTC periodic timers. A specification done by Intel and Microsoft can be found at -. +. The driver supports detection of HPET driver allocation and initialization of the HPET before the driver module_init routine is called. This enables -- cgit v1.2.3