Vicente,
One of the biggest problems with search engines is the knowledge in the head of the user that doesn't get entered onto the search line. There are two big reasons to not get the results that you want: 1) The content is not a part of the collection being searched, 2) you had implied knowledge in your head that you didn't tell the search engine.
Many search engines will try to give you some context about how a page is related to your search. For example, the home page of a veterinarian might talk about cats, and dogs, and birds. If you are searching for cats, the information about birds on the page is probably not interesting, and if you are searching for dogs, you want an entirely different summary of the page. Many search engines will try to do that. My systems did the best job of that process by storing the pages to refilter them according to "refine" queries entered by the user.
But one of the weaknesses of search engines is that when they indexed the page, they had no idea of exactly what searches people would use to pull up the page. They have no idea what is going to be relevant in the future. So the static analysis that they did of the page when they indexed it might not be well tuned to what a particular user wants on a particular query.
Yesterday I bought an Iris Pen II Executive. It's a USB connected scanning wand that reads text. Quick Review: It worked surprisingly well, the OCR is very good. However, on a test of scanning the several page table of contents of a book, I found it took 9.5 minutes to scan, and 6 minutes to correct the OCR to match what I would have typed. I found I could just type in the table of contents in 11 minutes. So this is typical for most scanners: If you type 40WPM or faster, scanning might not be any faster than just typing the text after you include time to correct the innevitable errors.
Point for this discussion: Before buying it, I sat down in the store to do some online research about it. (I almost never shop in a store that doesn't provide me an internet terminal so that I can comparison shop online for reviews.) This new product already has 1700 pages in google ("Iris Pen II Executive")! This company might be lucky to sell 1700 pens in their entire production run!
I found many many copies of the reviews on Amazon on these other sites. After I read those reviews, I wanted to find other reviews, but with so many clones via AWS, it wasn't easy.
So how do I find some more independent reviews? How do I communicate this to a search engine that sometime in the past blindly indexed all these sites, not knowing what my particular need would be today? There are not a lot of reviews of this device, and all the clones of Amazon make it difficult to find a new point of view. If you instead use the query:
"Iris Pen II Executive" -amazon
the results count drops to 651. Does this mean there are approximately 1100 affiliates of Amazon listing this product? Very possibly... Using
"Iris Pen II Executive" -"I sat in my easy chair with my laptop on the table"
to exclude pages that include the first Amazon review drops the results to 1,580. So one could say there are at least 120 sites copying that review for this product, probably via AWS.
Excluding the contrived Amazon part number,
"Iris Pen II Executive" -B0000636X5
results in 1520 pages.
Sadly, attemping to filter out both of the above with
"Iris Pen II Executive" -B0000636X5 -"I sat in my easy chair with my laptop on the table"
gets the warning:
"chair" (and any subsequent words) was ignored because we limit queries to 10 words.
But now the results drops to 1460. So it's a safe bet there are at least 240 sites using AWS feed for this product.
Combining all three of the above, using Amazon.com is the following query shows 620 results, but leaving off the .COM reduces you to 583 results, 70 documents less than the -Amazon above:
"Iris Pen II Executive" -amazon -B0000636X5 -"I sat in my easy chair with my laptop on the table"
This query was useful, it gave me some results that excluded most fo the comparison people... It brought an innovative site like www.brainx.com out of the noise...
Excluding the Amazon price filters out even more associates:
"Iris Pen II Executive" -amazon -B0000636X5 -180.49
This gives 546 results. Unfortunately, this still leaves some Amazon people in the list, like PCBuyReview, obviously an associate with their URL that includes the Amazon ASIN. Obviously an industrial strength version of the "-" operator is needed to look at other things like URL, etc.
Using these techniques I eventually found some more reviews of the product, and I decided to buy it.....
In case anyone didn't notice, I'll state the obvious: Imagine a search tool that included Amazon's page on the product, and then with strategies like the above excluded all the copy cats. Such could be easily embedded in google as a checkbox item ("filter to reduced cloned pages.") While it might be tough to decide whether Ian stole his content from other sites, or if they stole from him, certainly no one would accuse Amazon of stealing Michael Lapore's writings, so if Amazon and another site contain the same words, guess who would lose..... What would your commission check look like without Google's traffic? Such a tool could tremendously rise the fortunes of people like Michael who write their own content. But can he write fast enough? It's hard to hand write more pages than can be cranked out by a gerbil on a wheel....
At a recent search engine conference I personally thanked all the people at the google booth for all the money they send me each month, but I didn't let them have my card, or see my name tag, to prevent any chance of jealous retaliation!!!! Years ago a shift in inktomi traffic reduced one of my internet sites from making $500 per day down to $3 per day. I knew the site was at risk, but I didn't expect such a drammatic change..... I had sold another larger site, with a million pages, for a good chunk of change a year before that happened. I found out from the new owner that his site had also been reduced to ashes when he was contacted by inktomi and asked to pay 29 cents a visitor for the traffic they were sending. Of course he didn't have $20,000 a day to send them, so they yanked that site from their results...
Sadly, today any script kiddy can produce a dozen copies of a million page site with a rodent script that they don't even have to write. Bac
...[Message truncated]
|