links for 2009-12-23
-
I'm thrilled to learn that I can now dual-boot my new Intel-based Mac so that I can live in the warm fuzzy world of Mac OS X or flip over to the business world of Windows XP
. I have a shiny new copy of WinXP from my IT people, an Intel-based Mac Mini, and lots of enthusiasm, but that's about it. How the heck do I actually install Windows XP on my Mac so I can work in either operating system?
links for 2009-12-21
-
Index a DB table directly into Solr. Quick tutorial on how to index a table from a mysql database through the solr dataimporthandler
links for 2009-12-18
-
Information – it’s the key to knowledge. According to industry analyst Gartner, “information access technology will locate and analyze more than 90% of data in more than 50% of Global 2000 enterprises by YE12”. To help organizations meet this challenge Mindbreeze Enterprise Search enables organizations to mature their information “ecosystem” with an easy to handle but impressively powerful enterprise search software solution.
links for 2009-12-12
-
Ashlee Vance’s insightful piece in Monday’s NYTimes on the implications of the wrangling between the EU and Larry Ellison over Sun and MySQL lit up a lot of conversation in open source circles. And with Open Source reaching something like a ten-year mark since Redhat and Linux broke forth in a big way, it’s a good time to ask the question? Is Open Source a business model, and if so, can it succeed? I think the answer lies in a more nuanced understanding of open source, from three perspectives: as a business model, as development method, and as social network.
-
Faceted Search with Solr | Enterprise Search support for Apache Lucene and Solr by Lucid Imagination
Faceted search has become a critical feature for enhancing findability and the user search experience for all types of search applications. In this article, Solr creator Yonik Seeley gives an introduction to faceted search with Solr.
-
Semantic search has been the new black in the high fashion of content management and the industries around it. Nstein (news, site), a provider of Web CMS, DAM and text-mining technologies, just released a new product — which they say is more flexible, intuitive and extensible than Google Search Appliance — called Semantic Site Search, or the “new kind of site search,” as the vendor humbly refers to it.
links for 2009-12-09
-
Solr 1.3 brings a powerful set of features that make it more attractive than ever. The rest of this article takes a look at new Solr features and how you can incorporate them into your applications. To demonstrate them, I'll build a simple application that combines an RSS feed with a rating of that feed. The ratings will be stored in a database, and the RSS feed will be taken from my Lucene blog's RSS feeds. Given this simple setup, I'll demonstrate the use of:
-
"Named entities" is the NLP jargon for proper nouns which represent people, places, organisations, and so on. This module provides a very simple way of extracting these from a text. If we run the extract_entities routine on a piece of news coverage of recent UK political events, we should expect to see it return a list of hash references looking like this:
-
CRFClassifier is a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. The software provides a general (arbitrary order) implementation of linear chain Conditional Random Field (CRF) sequence models, of the sort pioneered by Lafferty, McCallum, and Pereira (2001), coupled with well-engineered feature extractors for Named Entity Recognition. Included are a good 3 class (PERSON, ORGANIZATION, LOCATION) named entity recognizer for English (in versions with and without additional distributional similarity features) and another pair of models trained on the CoNLL 2003 English training data. The distributional similarity features improve performance but the models require considerably more memory.
-
The HTML tags on a web page must be stripped away to get clean text for a PHP search engine, keyword extractor, or some other page analysis tool. PHP's standard strip_tags( ) function will do part of the job, but you need to strip out styles, scripts, embedded objects, and other unwanted page code first. This tip shows how.
links for 2009-12-08
-
If you’ve ever tried to write a program that fetches search results from Google, you’ll no doubt be familiar with the excrutiating annoyances of parsing the results and getting blocked periodically. Run a couple hundred queries in a row and bam! – your script is banned until proven innocent by entering an captcha. Even that would provide only a short reprieve, as you’d soon get blocked again.
-
I’ve created a PHP class that can extract the main content parts from a HTML page, stripping away superfluous components like JavaScript blocks, menus, advertisements and so on. The script isn’t 100% effective, but good enough for many practical purposes. It can also serve as a starting point for more complex systems.