RunNutchInEclipse) Now there is a directory runtime/local which contains a ready to use Nutch installation.When the source distribution is used NUTCHRUNTIMEHOME refers to apache-nutch-1.X/runtime/local/. Note that config files should be modified in apache-nutch-1.X/runtime/local/conf/ antclean will remove this directory (keep copies of modified config files)Usage: nutch COMMAND where command is one of: readdb read / dump crawl db mergedb merge crawldb-s, with optional filtering readlinkdb read / dump link db inject inject new urls into the database generate generate new segments to fetch from crawl db freegen generate new segments to fetch from text files fetch fetch a segments pages ... Host Database localhost is used to configure the loopback interface when the system is booting. Do not change this entry. 127.0.0.1 localhost.localdomain localhost LMC-032857 ::1 ip6-localhost ip6-loopback fe80::1lo0 ip6-localhost ip6-loopbackCustomize your crawl properties Default crawl properties can be viewed and edited within conf/nutch-default.xml- where most of these can be used without modification The file conf/nutch-site.xml serves as a place to add your own custom crawl properties that overwrite conf/nutch-default.xml. The only required modification for this file is to override the value field of the http.agent.name i.e. Add your agent name in the value field of the http.agent.name property in conf/nutch-site.xml, for example:The file conf/nutch-site.xml serves as a place to add your own custom crawl properties that overwrite conf/nutch-default.xml. The only required modification for this file is to override the value field of the http.agent.name i.e. Add your agent name in the value field of the http.agent.name property in conf/nutch-site.xml, for example:Create a URL seed list A URL seed list includes a list of websites, one-per-line, which nutch will look to crawl The file conf/regex-urlfilter.txt will provide Regular Expressions that allow nutch to filter and narrow the types of web resources to crawl and downloadWhole-Web crawling is designed to handle very large crawls which may take weeks to complete, running on multiple machines. This also permits more control over the crawl process, and incremental crawling. It is important to note that whole Web crawling does not necessarily mean crawling the entire World Wide Web. We can limit a whole Web crawl to just a list of the URLs we want to crawl. This is done by using a filter just like the one we used when we did the crawl command (above).Nutch data is composed of: The crawl database, or crawldb. This contains information about every URL known to Nutch, including whether it was fetched, and, if so, when. The link database, or linkdb. This contains the list of known links to each URL, including both the source URL and anchor text of the link. A set of segments. Each segment is a set of URLs that are fetched as a unit. Segments are directories with the following subdirectories: a crawlgenerate names a set of URLs to be fetched a crawlfetch contains the status of fetching each URL a content contains the raw content retrieved from each URL a parsetext contains the parsed text of each URL a parsedata contains outlinks and metadata parsed from each URL a crawlparse contains the outlink URLs, used to update the crawldbNext we select a random subset of these pages. (We use a random subset so that everyone who runs this tutorial doesnt hammer the same sites.) DMOZ contains around three million URLs. We select one out of every 5,000, so that we end up with around 1,000 URLs:Usage: bin/nutch solrindex solr url crawldb -linkdb linkdb-params k1v1ampk2v2... (segment ... -dir segments) -noCommit -deleteGone -filter -normalize Example: bin/nutch solrindex localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalizeMapReduce: Map: Identity map where keys are digests and values are SolrRecord instances (which contain id, boost and timestamp) Reduce: After map, SolrRecords with the same digest will be grouped together. Now, of these documents with the same digests, delete all of them except the one with the highest score (boost field). If two (or more) documents have the same score, then the document with the latest timestamp is kept. Again, every other is deleted from solr index.Reduce: After map, SolrRecords with the same digest will be grouped together. Now, of these documents with the same digests, delete all of them except the one with the highest score (boost field). If two (or more) documents have the same score, then the document with the latest timestamp is kept.