First, a warning: To those that read my blog and are typically interested in posts on Lean Manufacturing and Change Management, this article is not on either of those topics. Okay, read on. I concede: I know almost nothing about SEO, but other publications I read have been discussing this “Panda” update or “Farmer” update that Google rolled out recently. So, I decided to read-up and learn about what’s going on. Furthermore, I don’t know jack about Google. Other than my 2006 job interview with Google, I am simply a user of Google search. I don’t have any insider dealings. I’m just another guy. When the first major algorithm update came out, many in the SEO world dubbed the update the “Farmer” update because the aim of the update was to devalue content farms and, by doing so, increase the value of high quality sites by reducing the value of low quality sites – pretty much sites that are like a neighborhood of dog poopy.
I guess a bunch of websites got affected by the update – big sites too. That’s pretty much what I know. But, what got lost in all the debate was this important question: Who the heck is Panda?
Who is Panda?
Well, we know that Panda is a Google engineer, as explained by Matt Cutts in this interview with Wired Magazine:
Wired.com: Whatâ€™s the code name of this update? Danny Sullivan of Search Engine Land has been calling it â€œFarmerâ€ because its apparent target is content farms. Amit Singhal: Well, we named it internally after an engineer, and his name is Panda. So internally we called a big Panda. He was one of the key guys. He basically came up with the breakthrough a few months back that made it possible.
Here, Amit Singhal, who I guess is an important person in the SEO world, verifies that “Panda” is a person – yeah, a real human being with a pretty cool name. And, the update was based on his breakthrough. So, if Panda is a person, whose recent breakthrough led to a massive change in how websites are valued in the eyes of Google, what we can know about him might help the largely confused world regarding the Panda or Farmer or whatever update. So, what do we know about him? And, can some knowledge of his background, research interest, or whatever give us a hint as to how one can survive the dreaded Panda or Farmer update? Can our knowledge of Panda’s background help Black Hat SEOs better game Google? Obviously I’m not the best person to answer those questions, but here’s what we know about Panda, taken from a simple search on Google, Linkedin, Facebook, and Twitter.
Who is Navneet Panda?
- Navneet Panda studied at the Indian Institute of Technology in Kharagpur in the Department of Mathematics and earned a MSc in Mathematics and Computing ( Integrated 5-year course )
- Navneet Panda then went on to the University of California Santa Barbara, where he earned a Ph.D in Computer Science. His advisor was Edward Y. Chang.
It appears that before he worked for Google in 2007, he did a summer internship at Intel and at the IBM T. J. Watson Research Center in New York. Navneet Panda has filed 2 patents, and they are described below:
- Learning Concept Templates from Web Images to Query Personal Image Databases,Â Navneet Panda, Yi Y. Wu, Jean-Yves Bougueti, Ara Neï¬an (Filed with Intel, June 2007)
- Fast Approximate SVM Classiï¬cation for Large-Scale Stream Filtering,Â Navneet Panda, Ching-Yung Lin and Lisa D. Amini (Filed with IBM, Sep 2005)
Below are a list of his publications followed by a short abstract, which might give us a sense of what might have been behind the Google Panda Update:
- Efficient Top-k Hyperplane Query Processing for Multimedia Information RetrievalAbstract: A query can be answered by a binary classifier, which separates the instances that are relevant to the query from theÂ ones that are not. When kernel methods are employed toÂ train such a classifier, the class boundary is represented asÂ a hyperplane in a projected space. Data instances that areÂ farthest from the hyperplane are deemed to be most relevantÂ to the query, and that are nearest to the hyperplane to beÂ most uncertain to the query. In this paper, we address theÂ twin problems of efficient retrieval of the approximate set ofÂ instances (a) farthest from and (b) nearest to a query hyperplane. Retrieval of instances for this hyperplane-based queryÂ scenario is mapped to the range-query problem allowing forÂ the reuse of existing index structures. Empirical evaluationÂ on large image datasets confirms the effectiveness of our approach (link).
- Concept Boundary Detection for Speeding up SVMs: Support Vector Machines (SVMs) suffer fromÂ an O(n2) training cost, where n denotes theÂ number of training instances. In this paper,Â we propose an algorithm to select boundaryÂ instances as training data to substantially reduceÂ n. Our proposed algorithm is motivatedÂ by the result of (Burges, 1999) that, removingÂ non-support vectors from the training setÂ does not change SVM training results. OurÂ algorithm eliminates instances that are likelyÂ to be non-support vectors. In the concept independent preprocessing step of our algorithm,Â we prepare nearest-neighbor lists forÂ training instances. In the concept-speciedÂ sampling step, we can then effectively selectÂ useful training data for each target concept.Â Empirical studies show our algorithm to beÂ effective in reducing n, outperforming otherÂ competing downsampling algorithms withoutÂ signicantly compromising testing accuracy (link).
- KDX: An Indexer for Support Vector Machines: Support Vector Machines (SVMs) have been adopted by many data-mining and information-retrieval applicationsÂ for learning a mining or query concept, and then retrieving the â€œtop-kâ€ best matches to the concept. However,Â when the dataset is large, naively scanning the entire dataset to find the top matches is not scalable. In this work,Â we propose a kernel indexing strategy to substantially prune the search space and thus improve the performanceÂ of top-k queries. Our kernel indexer (KDX) takes advantage of the underlying geometric properties and quicklyÂ converges on an approximate set of top-k instances of interest. More importantly, once the kernel (e.g., GaussianÂ kernel) has been selected and the indexer has been constructed, the indexer can work with different kernel-parameterÂ settings (e.g.,Â and ) without performance compromise. Through theoretical analysis, and empirical studies on aÂ wide variety of datasets, we demonstrate KDX to be very effective (link).
- Exploiting Geometry for Support Vector Machine Indexing: Support Vector Machines (SVMs) have been adopted byÂ many data-mining and information-retrieval applications forÂ learning a mining or query concept, and then retrievingÂ the â€œtop-kâ€ best matches to the concept. However, whenÂ the dataset is large, naively scanning the entire datasetÂ to find the top matches is not scalable. In this work, weÂ propose a kernel indexing strategy to substantially pruneÂ the search space and thus improve the performance of top-kÂ queries. Our kernel indexer (KDX) takes advantage of theÂ underlying geometric properties and quickly converges onÂ an approximate set of top-k instances of interest. MoreÂ importantly, once the kernel (e.g., Gaussian kernel) hasÂ been selected and the indexer has been constructed, theÂ indexer can work with different kernel-parameter settingsÂ without performance compromise. ThroughÂ theoretical analysis, and empirical studies on a wide varietyÂ of datasets, we demonstrate KDX to be very effective (link).
- Hypersphere Indexer: Indexing high-dimensional data for efficient nearest-neighbor searchesÂ poses interesting research challenges. It is well known that when data dimensionÂ is high, the search time can exceed the time required for performing a linear scanÂ on the entire dataset. To alleviate this dimensionality curse, indexing schemesÂ such as locality sensitive hashing (LSH) and M-trees were proposed to performÂ approximate searches. In this paper, we propose a hypersphere indexer, namedÂ Hydex, to perform such searches. Hydex partitions the data space using concentricÂ hyperspheres. By exploiting geometric properties, Hydex can perform effectiveÂ pruning. Our empirical study shows that Hydex enjoys three advantages overÂ competing schemes for achieving the same level of search accuracy. First, HydexÂ requires fewer seek operations. Second, Hydex can maintain sequential disk accessesÂ most of the time. And third, it requires fewer distance computations (link).
- Active Learning in Very Large Databases: Query-by-example and query-by-keyword both suffer from the problem of â€œaliasing,â€Â meaning that example-images and keywords potentially have variable interpretations orÂ multiple semantics. For discerning which semantic is appropriate for a given query, we haveÂ established that combining active learning with kernel methods is a very effective approach.Â In this work, we first examine active-learning strategies, and then focus on addressing theÂ challenges of two scalability issues: scalability in concept complexity and in dataset size. WeÂ present remedies, explain limitations, and discuss future directions that research might take (link).
- Formulating Context-dependent Similarity: Tasks of information retrieval depend on a good distance functionÂ for measuring similarity between data instances. The most effectiveÂ distance function must be formulated in a context-dependentÂ (also application-, data-, and user-dependent) way. In this paper, weÂ present a novel method, which learns a distance function by capturingÂ the nonlinear relationships among contextual information providedÂ by the application, data, or user. We show that through a processÂ called the â€œkernel trick,â€ such nonlinear relationships can beÂ learned efficiently in a projected space. In addition to using the kernelÂ trick, we propose two algorithms to further enhance efficiencyÂ and effectiveness of function learning. For efficiency, we proposeÂ a SMO-like solver to achieve O(N2) learning performance. ForÂ effectiveness, we propose using unsupervised learning in an innovativeÂ way to address the challenge of lack of labeled data (contextualÂ information). Theoretically, we substantiate that our methodÂ is both sound and optimal. Empirically, we demonstrate that ourÂ method is effective and useful (link).
- Formulating Distance Functions via the Kernel Trick: Tasks of data mining and information retrieval depend on a goodÂ distance function for measuring similarity between data instances.Â The most effective distance function must be formulated in a context dependentÂ (also application-, data-, and user-dependent) way. InÂ this paper, we propose to learn a distance function by capturing theÂ nonlinear relationships among contextual information provided byÂ the application, data, or user. We show that through a process calledÂ the â€œkernel trick,â€ such nonlinear relationships can be learned efficientlyÂ in a projected space. Theoretically, we substantiate thatÂ our method is both sound and optimal. Empirically, using severalÂ datasets and applications, we demonstrate that our method is effectiveÂ and useful (link).
- Speeding up Approximate SVM Classification for Data Streams
- Improving Accuracy of SVMs by Allowing Support Vector Control
Here are what he lists as Research Projects on his resume: Machine Learning:
- Development of indexing structures for support vector machines to enable relevantÂ instance search in high-dimensional datasets
- Speeding up SVM training in multi-category large dataset scenarios
- Speeding up approximate SVM classiï¬cation of data-streams
- Improving concept identiï¬cation and classiï¬cation for personal image retrieval
- Using idealizing kernels to develop distance metrics incorporating user preferencesÂ for high-dimensional data
- Design of a real time web page classiï¬er for text and image data
Grid Computing and Distributed Systems:
- Development of scheduling strategies for numerous large jobs in a grid environmentÂ under heavy load conditions using the Network Weather Service and Globus
- Development of scheduling strategies for executing compute-intensive jobs in aÂ dynamically evolving simulated market of servers providing priced slots of CPUÂ time for process execution
- Development of a distributed dictionary enforcing causal ordering
- Development of dynamic peer to peer system with query lookup modeling the CANÂ architecture
- Design of a snoopy cache for a multiprocessor system
- Design of a superscalar instruction dispatch unit
I don’t know. I’ll leave it to the SEO people to decide. I just write and don’t pay much attention to SEO because I don’t know much about SEO. But, at least now we can put a face to a generically named Google algorithm update called “Panda”. Now, when someone references a Google algorithm update as “Panda”, we can all, under our breath say, “Yeah, that Navneet Panda guy”.