home >> blog >>
talk: jan pedersen
 

July 24, 2003

talk: jan pedersen

pedersen_2003.07.jpg

Today Jan Pedersen, former PARC researcher and current Chief Scientist of AltaVista, spoke at the PARC Forum. His talk was entitled Internet Search: Past, Present, and Future. It seems particularly relevant given my recent exposure to personalized search start-up Kaltix. Jan primarily covered the developmental and economic history of search engines and spoke about current search technologies. Read on for my notes from the talk.

Notes: PARC Forum, July 24, 2003

Internet Search: Past, Present, and Future
Jan Pedersen, Chief Scientist, AltaVista

  • Search Engine Timeline
    • Pre-Cursors
      • Information Retrieval research
      • Discovery that free text queries win over Boolean queries (salton)
    • 1st Generation
      • 1993 NCSA Mosaic
        • Webcrawler
        • Yahoo!
        • Lycos (400k indexed pages)
        • Infoseek
      • Power Players
        • 1994 AltaVista
          • DEC labs, advanced query syntax, large index
          • Actually a showcase for DEC Alpha machines
        • 1996 Inktomi
          • Berkeley Systems Lab, Eric Brewer
          • Massively parallel solution
    • 2nd Generation
      • Relevance
        • 1998 DirectHit
          • re-ranking results using user click-through rates
        • 1998 Google
          • re-ranking results using link authority
      • Size
        • 1999 FAST/AllTheWeb
          • scalable architecture
      • User Matters
        • 1996 AskJeeves
          • Users ask questions, natural language input
      • Money
        • 1997 Goto/Overture
          • Pay-for-performance, pay for search rankings
    • 3rd Generation
      • Consolidation
        • 2002 Yahoo! Purchases Inktomi
        • 2003 Overture purchases AltaVista, AllTheWeb
        • 2003 MSN announces intention of own search engine
        • 2003 Yahoo! Purchases Overture
      • Maturity
        • $2B market, $6B by 2005
        • Requires large capital investment, limiting newcomers
          • Although Gigablast is an exception (2 years private development, mid-size search index)
        • Traffic focused on Yahoo!, Google, AOL
        • Consumer use driven by brand marketing
  • Economics
    • Overview
      • Popularity
        • Search is the most used Internet application after email
        • 400M queries / day
      • High bar for quality in search results
        • Users spend 1.5 hours / week searching
        • Experience search rage after 12 minutes
      • Expensive centralized service
        • Indices cover billions on documents
          • The FAST index is 30TB large!
        • Query service is high performance application
          • Google claims 50K machines
      • Cost: $0.001 per query
        • Amortizes capital, operations, and engineering costs
    • Business Models
      • Early Monetization Models
        • Subscription services
          • Infoseek, Northern Lights
          • Failed: users can find equal results for free
        • Advertising
          • Invented by Infoseek, Netscape
            • Untargeted ads (banners, sponsorships)
            • Limited keyword targeting (low keyword coverage)
        • Portalitis
          • Search not profitable enough, need stickier services
          • Email, shopping, content channels
          • Tried by Excite, Infoseek, AltaVista -> disastrous
          • Led to lack of focus on core technology that opened the door for 2nd generation search engines.
      • Performance Search Market
        • Goto/Overture – keyword auction
        • With 80k+ advertisers get good keyword coverage, currently exceeds 40%
        • Pay per click revenue
          • Marketers easily project to conversions
          • Search engine projects to CPM
        • Triple Win
          • Consumer: relevant ads
          • Marketer: qualified traffic
          • Search Engine: high-monetized impressions
        • Successful
          • Overture makes ~$1B/year
          • Strategy adopted by Google
        • Current Evolution
          • Greater automation
            • self-serve sign up, automated bidding
          • Increased competition
            • Google splits market with Overture
            • 19% of Yahoo! Revenue from paid listings
            • MSN Search most profitable MS product group by headcount (50 people)
    • Trends available online
      • SearchEngineWatch.com (Neilsen / Netrating)
      • Traffic concentration
        • Google > Yahoo! > MSN…
      • Loyalty
        • AOL > Google > IS > Yahoo!
  • Technology
    • WWW Size
      • Dynamic pages -> effectively infinite pages
      • Domains: .com (23M), .net (4M), .org (2.5M)
    • Crawling
      • Index parameterized by size and freshness
      • Batch (discover, grab, index) and Incremental (mixed) approaches to crawling
    • Relative Size
      • Google – 3B, FAST– 2.5B, AltaVista – 1B
      • Anchor text only index (discovered links that are not yet crawled)
        • FAST 1.2B fully indexed pages (rest anchor text only)
        • Google 1.5 fully indexed pages
    • Freshness
      • Graph from (G. Notess)
      • Note use of hybrid indices
        • Subindices with differing update rates
    • Ranking
      • 2.4 query terms -> 2B documents -> 10 highly relevant pages. All in 300ms.
      • Trouble queries: Travel, Cobra, John Ellis
      • Ingredients
        • Keyword match
        • Anchor text
        • Link authority
        • Click-through rates
    • SPAM – An Arms Race
      • Manipulate content purely to influence ranking
      • Dictionary spam, link sharing, domain hijacking, link farms
      • Robotic use of search results
        • Meta-search engines
        • Search engine optimizers
        • Fraud
    • UI
      • Ranked results lists
        • Document summaries are critical
        • Hit highlight, dynamic abstract
        • NO RECENT INNOVATIONS!
      • Blending
        • Pre-defined segmentation (e.g. paid listing)
        • Intermixed results from multiple sources
  • Future
    • Question Answering
      • Natural Language Processing
      • Dumais, SIGIR 2002 paper
      • WWW as language model
    • New Contexts
      • Ubiquitous searching
      • Implicit searching
    • New Tasks
      • Local / community search
  • Questions…
    • Personalization
      • Currently searchers are anonymous
      • Personalized search requires some form of user model
        • How much does the engine need to know?
        • Geographic location
        • Use context of surfing behavior
    • Personal Search Agents
      • Technical challenges to this
      • My idea: have distributed agents coupled with access to large, centralized indices.
      • Most importantly: what is the big advantage??
        • Need qualitative change in searching experience
        • Interesting, but not shown useful yet
      • My idea: have agents be pre-fetchers to automatically hunt for content for which you have a high probability of interest
        • e.g. citation mining to collect all research papers within a particular domain
Posted by jheer at July 24, 2003 05:42 PM
Comments
Trackback Pings
Forum: Internet Search Past, Present, Future
Excerpt: I didn't like this talk enough to actually transcribe the notes. Luckily, > blog >> talk: jan pedersen (heerforceone)" href="http://jheer.org/blog/archives/000022.html">heerforce has already transcribed his notes/outline, and they're probably better th...
Weblog: kwc blog
Tracked: July 25, 2003 11:41 AM
Trackback URL


    jheer@acm.ørg